You are on page 1of 5

2014 International Conference on Circuit, Power and Computing Technologies [ICCPCT]

Content Based Filtering in Online Social Network


using Inference Algorithm
N.Thilagavathi R.Taarika
Assistant Professor, PG Student,
Dept. of Information and Technology, Department of Computer Science,
Sri Manakula Vinayagar Engineering College, Sri Manakula Vinayagar Engineering College,
Puducherry, India. Puducherry, India.
thilagarajsanran@gmail.com taarika.ravichandran8@gmail.com

Abstract— The basic issue in the online social network is to automatically control the unwanted messages posted in their
provide the ability for the user to manage the messages post on wall.
their wall. Online social networks offer only minimal assistance to This is the key of the online social network service which is
avoid unwanted content displayed in the user wall. To enhance not provided up to now. Certainly, online social network
the support, a system is designed to filter unwanted messages and provide minimal support to prevent unwanted messages from
allow user to have direct control on the messages posted in the
the wall. For instance, facebook offers three different methods
wall. It is achieved using flexible rule based system that allows the
user to specify filtering rule for their wall. And, Inference to filter the messages from the user wall.
algorithms are used to infer new information from the filtering The first method is to states who are the entire user allowed
rules to increase the efficiency of the filtering process. Machine to post messages in the wall. It may be friends, friends of
Learning based soft classifier is employed to facilitate the content friends or defined group of friends. They are only allowed to
based filtering. post message in the particular user wall. The second method is
Index Terms— Online social networks, information filtering, to block specific users from posting messages into the wall. It
short text classification just prevents the particular user to post messages in the user
wall. And the third method is to hide the undesired posts from
I. INTRODUCTION timeline. It merely conceals the posted message.
The advancement in computing and communication The key limitation of the existing methods is that it cannot
technologies enables people to get together and share filter the incoming posts based on the content. It just blocks
information in innovative ways. Social networking sites are the user who posts the messages in the wall. Certainly it is not
the familiar interactive medium to share a significant amount essential to the block the users in all situations. The existing
of individual information. In these networks, several types of methods do not support the content based filtering. So it is
content such as text, image, audio, video etc are exchanged impossible to prevent the undesired message without consider
every day. As per the facebook statistics, each month 90 the user who posts it. We believe that is the key online social
pieces of content is created by an average user while more networks service that has not been provided so far.
than 30 billion pieces of information are shared each month. To facilitate the content based filtering, this article
Typically, information in the social networks is introduces the filtered wall architecture. It will filter the
dynamically changing and users are overwhelmed with large incoming post based on the content.
amount of raw data. Content mining strategies are employed The remainder of the section is organized as follows.
to extract valuable information hidden within the enormous Section 2 surveys the related work whereas Section 3 provides
amount of data. These techniques provide effective support to the architecture and concepts of the proposed system whereas
complex tasks in the online social networks such as instance Section 4 illustrates the inference of new rule from the existing
access control or information filtering. rule. Finally Section 5 concludes the paper.
Information filtering is the process of providing appropriate
II. RELATED WORK
information to the people who need it. It significantly
searches for what actually concerns the textual document, The main goal of this paper is to design a system to provide
specifically web contents [1, 2, 3], and offers a user with customizable content based message filtering for online social
classification mechanism to avoid the unnecessary networks, based on machine learning techniques. Information
information. This information filtering process is used in the Filtering Systems are designed to categorize the information
online social network for insightful objective. which are generated dynamically and offer the information to
In the Online social networks, there is a chance to post or the user fulfill their requirement [6]. In the content – Based
comment on other’s public or private walls. Information Filtering system, each user is assumed to operate separately.
filtering techniques can be assisted to support the user to So the filtering system selects the information based on the

978-1-4799-2397-7/14/$31.00 ©2014 IEEE 1416


2014 International Conference on Circuit, Power and Computing Technologies [ICCPCT]

correlation between the content of the items and user However, this method is named as Prediction by Partial
preferences. Mapping, produces a language model that is used in
It is contradictory with the collaborative filtering system probabilistic text classifiers which are hard classifiers in
which selects information based on the correlation between nature and do not easily integrate soft, multimembership
the people with similar preferences [7], [8]. Initially the paradigms. In our scenario, gradual membership to classes a
information filtering process was used to categorize the email key feature for defining flexible policy-based personalization
messages, subsequent papers have refer the various domains strategies is need to be considered.
such as newswire articles, internet news articles, network
resources etc [9], [10], [11]. Content-based filtering mostly III. CONTENT BASED FILTERING
processes the textual document in nature and this builds To support the content based filtering in online social
content-based filtering close to text classification. network, Filtered wall architecture is introduced. In this
In fact, the activity of filtering can be modeled as a case of architecture, text mining techniques are employed to
single label, binary classification, and partitioning incoming categorize the incoming messages. Traditional text
documents into relevant and nonrelevant categories [12]. classification methods have major inadequacy in classifying
Other complex filtering systems comprise multilabel text the short text message. Short text message do not have
categorization that automatically labels the messages into sufficient word occurrences. An automated system called
partial thematic categories. filtered wall is designed in this paper to filter unwanted
Content-based filtering is mainly based on the use of the messages from user walls.
ML paradigm. In that, a classifier is automatically induced by In this system, Machine Learning based text categorization
learning from a set of preclassified examples. The feature [4] techniques are used to automatically allot each short text
extraction procedure maps text into a compact representation message with set of categories based on the content. Short
of its content, which is uniformly applied to training and Text Classifier is built to accurate extraction and set of
generalization phases. Bag-of-Words (BoW) approach yields discriminating feature in the message. Neural learning model
good performance and exists in general over more is employed for the efficient text classification. In particular,
sophisticated text representation that may have superior Radial Basis Function Network [5] acts as a soft classifier to
semantics but lower statistical quality [13], [14], [15]. handle noisy data and intrinsically unclear classes. Neural
There are varieties of key approaches in content-based model is enclosed within a hierarchical two level
filtering and text classification. Based on the application, each classification. In the first level, RBFN classifies the short
approach may having mutual advantages and disadvantages. messages as Neutral or Nonneutral and in the second level
In depth comparison analysis [4], has been conducted to verify Nonneutral messages are classified based on the
the superiority of classifiers such as Boosting-based classifiers appropriateness to each of the considered category.
[16], Neural Networks [17], [18], and Support Vector In addition the classification facilities, the system offer the
Machines [19] over other popular methods, such as Rocchio robust rule layer to specify Filtering Rules (FR) in a flexible
[20] and Naive Bayesian [21]. However, most of the work language. Using that, user can specify what content should not
related to text categorization by ML has been applied for long- be displayed on their walls. According to the user needs,
form text and the evaluated performance of the text different varieties of filtering rules are combined and
classification methods strictly depends on the nature of textual customized. The system also supports the user – defined
documents. Blacklists (BL) that is, list of users that are temporarily
Content-based filtering on messages posted on online social blocked to post messages on the user wall.
networks user walls poses additional challenges given the
short length of these messages other than the wide range of
topics. Probably, there are lot of difficulties in defining robust
features, essentially due to the fact that the description of the Filtered Wall
GUI
short text is concise and crisp, with many misspellings,
nonstandard terms, and noise. Zelikovitz and Hirsh [22] try to
improve the classification of short text streams by developing
a semi-supervised learning strategy based on a combination of Short Text
Content Based
labeled training data and a corpus of unlabeled related Classification
Filtering
documents.
This solution is inapplicable to the online social networks,
Filtering Rules
in which short messages are not summary or part of longer Social Network
semantically related documents. Another approach is proposed Blacklists Rules Manager
by Bobicev and Sokolova [23] that evade the problem of
error-prone feature construction by assuming a statistical
learning method that can work reasonably well without feature Fig. 1. Filtered Wall Architecture
engineering.

1417
2014 International Conference on Circuit, Power and Computing Technologies [ICCPCT]

The architecture to support of online social network B. Content-Based Messages Filtering (CBMF)
services comprises of three major components (Figure 1):
Social Network Manager, Short text classification and Content CBMF exploits the message categorization to enforce the
Based Filtering. Social Network Manager (SNM) offers the Filtering Rules specified by the user. First of all, in online
basic online social networks functionalities such as profile social network, the same message may have different
management, relationship management etc Short Text meanings and relevance based on who writes it. As a
Classification is employed to classify the incoming post consequence, FRs should allow users to state constraints on
messages. message creators. Creators on which a FR applies can be
Content Based filtering offers the support for selected on the basis of several different criteria; one of the
message filtration. Specifically, users interact with the system most relevant is by imposing conditions on their profile’s
via a GUI to set up and manage their FRs/ BLs. Moreover, the attributes. In such a way it is, for instance, possible to define
GUI provides users with a FW, that is, a wall where only rules applying only to young creators or to creators with a
messages that are authorized according to their FRs/BLs are given religious/political view.
published. As graphically depicted in Figure 1, the path Given the social network scenario, creators may also be
followed by a message, from its writing to the possible final identified by exploiting information on their social graph.
publication can be summarized as follows: Each Filtering rule comprises of three essential things: Creator
1. After entering the private wall of one of his/her specification, content based filtering and action of the system.
contacts, the user tries to post a message, which is intercepted Creator specification indicates the various criteria for creator
by the Filtered Wall. selection. Content based filtering specifies the content need to
2. A Machine Learning-based text classifier extracts be filtered based on the user preference. Action of the system
metadata from the content of the message. is to either notify or block the message.
3. Filtered Wall uses metadata provided by the
classifier, together with data extracted from the social graph FR= (author, creatorSpec, contentSpec, action)
and users’ profiles, to enforce the filtering and Black List
rules. In general, more than a filtering rule can apply to the same
4. Depending on the result of the previous step, the user. A message is therefore published only if it is not blocked
message will be published or filtered by Filtered Wall. by any of the filtering rules that apply to the message creator.
Note moreover, that it may happen that a user profile does not
The core components of the proposed system are the Short contain a value for the attribute(s) referred by a FR (e.g., the
Text Classifier modules and the Content-Based Messages profile does not specify a value for the attribute Hometown
Filtering (CBMF). whereas the FR blocks all the messages authored by users
coming from a specific city).
A. Short Text Classifier Black List rules can also be used to enhance the filtering
process. BL mechanism is to avoid messages from undesired
Short text classifier aims to classify the messages creators, independent from their contents. BLs is directly
according to a set of categories. For that, a classifier is build to managed by the system, which should be able to determine
extract and select the discriminating features of the short text who are the users to be inserted in the BL and decide when
message. Short text classification is comprised with two user’s retention in the BL is finished.
phases: Text representation and Machine Learning based To enhance flexibility, such information is given to the
classification. system through a set of rules, called BL rules. Wall owners
In the first phase, the short text message is specify who has to be banned from their walls and for how
represented in vector space model [24]. It will denote the text long. Banning considers two major measures: the number of
in an appropriate format to extract its discriminant feature. In times user is inserted in blacklist in certain amount of time and
this model, a short text message is represented as a vector of users whose message is continued to be failing in filtering
weights, Dj= w1j, w2j,……. wTj. The term frequency-inverse rules.
document frequency (tf-idf) weighting function is employed to Blacklist rule comprises of four essential things: author,
calculate the weight of each term in the message. creator specification, creator behavior and time. Author is the
While in the second phase, a neural network classifier user who specifies the rule. Creator specification indicates the
is employed to classify the incoming message. It automatically user who is allowed to post messages in wall. Creator
categorizes the short message into the suitable category, which behavior deals with the banning criteria. Time specifies the
are neutral or non neutral messages. Non neutral messages are amount of time the user is banned to post message.
further to analyzed to determine the appropriateness to each
category. BL = (author, creatorSpec, creatorBehavior, T)

1418
2014 International Conference on Circuit, Power and Computing Technologies [ICCPCT]

IV.INFERENCE RULE else for each p' in KB such that UNlFY(p',


SUBST( θ, FlRST( premises))) = θ2 do
In the online social networks, unnecessary messages are FlND-AND-lNFER(KB,REST(premises),
controlled by specifying the filtering rules. Generally, conclusion, COMPOSE( θ, θ2))
Filtering Rules are static in nature. It means that the rules do End
not changed based on the content of the message in the user
wall. In the online social networks, large amount of
information are changing dynamically. Hence the predefined Fig. 2. Inference Algorithm
rules may not be representative for the longer terms. In order
to deal with this, online learning paradigms are used to infer V.CONCLUSION AND FUTURE WORK
the new rules from the existing rules.
For example, a post message exhibits, “It is an offence for
an Indian to sell weapons to enemy nations. James, who is To filter the undesired messages from OSN walls, the
Indian, sold some missiles to Pakistan, an enemy country”. system exploits a machine learning soft classifier to enforce
Suppose user may specify the filtering rule as “Information customizable content-dependent filtering rules. Learning
regarding to crime person will not be displayed in my wall”. paradigms are used to infer new rules from the existing one.
In the post, the person James is a crime because he sold Moreover, the flexibility of the system in terms of filtering
missiles to hostile nation. But the information is not directly options is enhanced through the management of black lists. As
specified in the post. In this case, if the filtering rule is merely future work, we intend to exploit similar techniques to infer BL
applied in the post message, it will not able to identify that rules
James is a crime. In order find the indirect relationship in the
post message, inference rule are applied. REFERENCES
Inference process determines the entailment relationship
between the sentences and generates the new rule. The new
rule will be added in the knowledge base. Forward chaining [1] A. Adomavicius and G. Tuzhilin, “Toward the Next Generation
algorithm is exploited to generate the new rules. Forward of Recommender Systems: A Survey of the State-of-the-Art and
Possible Extensions,” IEEE Trans. Knowledge and Data Eng.,
chaining is a reasoning technique that intent is to show the
vol. 17, no. 6, pp. 734-749, June 2005.
applicability of an implication by prove that all elements of
[2] M. Chau and H. Chen, “A Machine Learning Approach to Web
the premise possess.
Page Filtering Using Content and Structure Analysis,” Decision
Inference procedure makes use of the idea of a renaming Support Systems, vol. 44, no. 2, pp. 482-494, 2008.
and composition of substitutions. Renaming is the concept of [3] R.J. Mooney and L. Roy, “Content-Based Book
renaming a sentence of another if they are similar apart from Recommending Using Learning for Text Categorization,” Proc.
the variable names. Substitution is the concept of mapping of Fifth ACM Conf. Digital Libraries, pp. 195-204, 2000.
a variable to other variable. [4] F. Sebastiani, “Machine Learning in Automated Text
The forward-chaining inference algorithm (fig.2) adds all Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-
the sentences to knowledge base. Consider sentence p that can 47, 2002.
be inferred from the knowledge base. If it is previously stored [5] M.J.D. Powell, “Radial Basis Functions for Multivariable
Interpolation: A Review,” Algorithms for Approximation, pp.
in the knowledge base, it just returns the sentence p. If p is a
143-167, Clarendon Press, 1987.
new sentence, then consider each implication contains a [6] N.J. Belkin and W.B. Croft, “Information Filtering and
premise that matches p. If all the left over premises are in Information Retrieval: Two Sides of the Same Coin?” Comm.
knowledge base, then infer the new conclusion. If the ACM, vol. 35, no. 12, pp. 29-38, 1992.
premises can be matched numerous ways, then infer the [7] P.J. Denning, “Electronic Junk,” Comm. ACM, vol. 25, no. 3,
matching conclusion. pp. 163-165, 1982.
[8] P.W. Foltz and S.T. Dumais, “Personalized Information
procedure FORWARD-CHAIN(KB, p) Delivery: An Analysis of Information Filtering Methods,”
if there is a sentence in KB that is a renaming of p Comm. ACM, vol. 35, no. 12, pp. 51-60, 1992.
[9] S. Jacobs and L.F. Rau, “Scisor: Extracting Information from
then return
On- Line News,” Comm. ACM, vol. 33, no. 11, pp. 88-97, 1990.
Add p to KB [10] S. Pollock, “A Rule-Based Message Filtering System,” ACM
for each (p1 A ... A pn => q) in KB such that for Trans. Office Information Systems, vol. 6, no. 3, pp. 232-254,
some i, UNIFY(pi,p) = 0 succeeds do 1988.
FlND-AND-lNFER(KB, [p1,..,pi-1,pi+1,. .. ,pn],q, θ) [11] P.E. Baclace, “Competitive Agents for Information Filtering,”
end Comm. ACM, vol. 35, no. 12, p. 50, 1992.
[12] P.J. Hayes, P.M. Andersen, I.B. Nirenburg, and L.M. Schmandt,
procedure FIND-AND-lNFER(KB, premises, conclusion, θ) “Tcs: A Shell for Content-Based Text Categorization,” Proc.
if premises = [ \ then Sixth IEEE Conf. Artificial Intelligence Applications (CAIA
’90), pp. 320- 326, 1990.
FORWARD-CHAIN(KB, SuBST( θ, conclusion))
[13] C. Apte, F. Damerau, S.M. Weiss, D. Sholom, and M. Weiss,
“Automated Learning of Decision Rules for Text

1419
2014 International Conference on Circuit, Power and Computing Technologies [ICCPCT]

Categorization,” Trans. Information Systems, vol. 12, no. 3, pp.


233-251, 1994.
[14] S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive
Learning Algorithms and Representations for Text
Categorization,” Proc. Seventh Int’l Conf. Information and
Knowledge Management (CIKM ’98), pp. 148-155, 1998.
[15] D.D. Lewis, “An Evaluation of Phrasal and Clustered
Representations on a Text Categorization Task,” Proc. 15th
ACM Int’l Conf. Research and Development in Information
Retrieval (SIGIR ’92), N.J. Belkin, P. Ingwersen, and A.M.
Pejtersen, eds., pp. 37-50, 1992.
[16] R.E. Schapire and Y. Singer, “Boostexter: A Boosting-Based
System for Text Categorization,” Machine Learning, vol. 39,
nos. 2/3, pp. 135-168, 2000.
[17] H. Schu¨ tze, D.A. Hull, and J.O. Pedersen, “A Comparison of
Classifiers and Document Representations for the Routing
Problem,” Proc. 18th Ann. ACM/SIGIR Conf. Research and
Development in Information Retrieval , pp. 229-237, 1995.
[18] E.D. Wiener, J.O. Pedersen, and A.S. Weigend, “A Neural
Network Approach to Topic Spotting,” Proc. Fourth Ann. Symp.
Document Analysis and Information Retrieval (SDAIR ’95), pp.
317- 332, 1995.
[19] T. Joachims, “Text Categorization with Support Vector
Machines: Learning with Many Relevant Features,” Proc.
European Conf. Machine Learning, pp. 137-142, 1998.
[20] T. Joachims, “A Probabilistic Analysis of the Rocchio
Algorithm with TFIDF for Text Categorization,” Proc. Int’l
Conf. Machine Learning, pp. 143-151, 1997.
[21] S.E. Robertson and K.S. Jones, “Relevance Weighting of Search
Terms,” J. Am. Soc. for Information Science, vol. 27, no. 3, pp.
129-146, 1976.
[22] S. Zelikovitz and H. Hirsh, “Improving Short Text
Classification Using Unlabeled Background Knowledge,” Proc.
17th Int’l Conf. Machine Learning (ICML ’00), P. Langley, ed.,
pp. 1183-1190, 2000.
[23] V. Bobicev and M. Sokolova, “An Effective and Robust Method
for Short Text Classification,” Proc. 23rd Nat’l Conf. Artificial
Intelligence (AAAI), D. Fox and C.P. Gomes, eds., pp. 1444-
1445, 2008.
[24] C.D. Manning, P. Raghavan, and H. Schu¨ tze, Introduction
toInformation Retrieval. Cambridge Univ. Press, 2008.

1420

You might also like