A New Approach For Ranking Micro-Blogs Content

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.
ORG
14
A New Approach for Ranking Micro-blogs Content

Ahmed Sulaiman M Alharbi
Abstract- Recently, there is a service which associated with Web2.0; this service is micro-blog. They have their features that make them different from traditional blogs. They allow their users to write and read short massages and share these messages with other users. The most popular example of micro-blogs is Twitter. This paper will investigate that whether or not using the Twitters features will improve the returned result. Indeed, these features are extracted from tweets content. The features that will be covered in this paper are hashtag feature, URL feature, user name tag feature, tweet length feature and user tweet number feature. In order to find answers to the research question, an experiment was conducted. The main goal of the experiment is to examine each feature individually in order to investigate their impact on the returned results. Using the traditional ranking approaches alone with micro-blogs might not be the ideal option. Therefore, one of this paper aims is to help the researchers who concern about ranking results based on a given query or flirting and clustering data on micro-blogs environment to think about creating new ranking strategies.
Index Terms: Search Process, Information Retrieval, Context Analysis, Micro-blogging u might lose its values after a short time such as news. Therefore, searching in micro-blogs is an important component of real-time search [5]. 1.1 What is Micro-blogs? Micro-blogs are form of digital communication which gives the micro-blogger1 the ability to write and publish brief messages that contain not only text but also multimedia, such as images or videos; these messages could be seen by the public or by a limited group of users that are pre-defined by the users themselves [8]. Generally, the number of characters inside messages is a limited; for instance, Twitter allows its users to write tweets which do not exceed 140 characters. There are many examples of microblog platforms such as Twitter, Tumblr, Facebook, MySpace Plurk, Jaiku, and Beeing. However, Twitter considers to be one of the popular microblog website [7]. 1.2 Twitter In 2006, Twitter was officially launched. The number of visitors Twitter website monthly is above 20 million [3]. According to Twitter Blog2, in Jun 2010, Twitter had daily approximately 50 million tweets which mean in one second there are more than 600 tweets1. However, according to the same source, the number of tweets per day in March 2011 was 177 million. Twitter has many features which make it within a popular micro-blogs websites which people use it not only for social communication purpose but they consider it as a useful and active source of information [2]. There is a study conducted by Teevan et al [4] about users behavior who use Twitter. They survey 54 users of Twitter and come up with findings which shows that the percentage of participants who read tweets one or more times a day is 83% whereas 59% of participants write tweets one or more times a day. Besides that, the study illustrates what kind of information the people look for. It shows 49% of
1 2
Introduction
Nowadays, Micro-blogs have become a very common environment of sharing or publishing information instantly. A huge number of messages are issuing per hour in well-known websites which provide micro-blog services for example Twitter, Facebook, Tumblr and MySpace. Users of microblogs can write about what they are thinking, feeling, doing now as well as they can express their opinions on different topics. However, micro-blogs do not use only for social media, they can be used to keep track with breaking news around you or internationally. Moreover, using micro-blogs are not restricted to individuals, according to [1] many companies and organizations use micro-blog to communicate with their customers as well as they use them for commercial purposes. Furthermore, many Internet users abandon the traditional ways of exchanging information such as mailing lists [2]. Instead, they use micro-blogs such as Twitter. The reasons behind this transformation is due to ease in writing messages with noncomplicated format as well as the number of ways that users can use to send and share their ideas not only through the micro-blogs websites [3]. To explain, if we take Twitter as example, when users read about news in news agencys website, the users generally can write their comments or opinions about that news through news agencys website which means they do not need to do that through Twitter website. This gives users the multiinterfaces access ability. As users of micro-blogs websites rise and keep rising, data size of micro-blogs rises as well. This means the micro-blogs become really a big source of information that could be used to help information searchers to find what they need. According to a study conducted by Teevan, Ramageis and Morr [4] demonstrates that users use micro-blogs not only for social communication purpose but they consider them as a useful and active source of information. In addition, the microblog in general and Twitter in particular consider to be a rich source of real-time information which
Micro-bloggers refers to users who use the micro-blog to write and read their messages. http://blog.twitter.com/2010/02/measuring-tweets.html
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG
15
participants search topics related to recent news or events. 1.3 Research Question Given a set of point-in-time queries, which consist of query terms and a timestamp, and a time-ordered set of tweets: 1) Can we identify those tweets that are most relevant to that query at that time-point? And 2) Can we identify those tweets going forward in time which discuss the concepts from the query?
The algorithm used for this experiment is called Event Detection algorithm. There is a set of queries to track earthquakes. By using search API of Twitter, the system begins with look for tweets (posts) which have words such as earthquake, shaking and so on every S seconds. S denotes the time in seconds which the queries should restart in order to make the system up dated. After that there are some calculations are done by the system to determine the seriousness of situation. Then, the place information of tweets is taken and used them in order to locate the earthquakes locations [18]. Classification and Filtering of Short- Content Documents There are two main types of flittering systems [19]. One of them is social-filtering and another one is content-based. In the first system, the filtering is done by using comments or posts of readers of documents, whereas, in the second system, the filtering is based on the information that abstracted from documents text. Most micro-blogging websites allow users to write their comments or posts within limited number of characters which in most cases do not exceed 200. For instance in Twitter, the number of characters is 140. This type of limitation produces a content which is not dense enough for classification of text. In order to solve that problem, there are different suggested approaches to deal with this type of problems, for instance, utilizing Wikipedia as an extra source [20] and [21]or even utilizing results which come from web search engine [22]. One of these approaches is using topic model. Another solution is using topic models [23]. The idea behind the topic models approach is by applying Latent Dirichlet Allocation [24]on a general corpus in order to create a topic model which then is used to classify the short posts. Moreover, Sriram et al [25]introduce approach of classifying received tweets (which are the good example of short-content document) into five categories which are news, events, opinions, deals, and private messages. This type of classifying rely on the author information in addition to others characteristics of tweets. Another method of classifying tweets is proposed by Ramage et al [26] which is using topic model in order to classify tweets. In This method every tweet is checked by topic models. To explain further, every topic should be categorized under four categories which they are Substance, Social, Status, or Style; then, the system assesses every tweet and give the ratio of the four categories in that tweet. 2.4 Search Engines Using Micro-blogs Because of micro-blogs are real-time systems, they consider to be a useful resource for new content. For that reason, Twitter can be used for detection new pages of web. Accordingly, a number of search engines take advantage of a real-time characteristic of Twitter in order to supply their results with new pages of web. Both [33] and [34] use Twitters realtime characteristic in order to identify new URLs. 2.3
Related Work
There are major areas of research or study on microblogs. These areas include: mining real-time data, automated summarizing of documents contents, analysing and detecting events such as feeling detection or natural disasters detection such as earthquakes. Most of these researches are related in some of their stages to this research which is in micro-blogs and real-time search. On other words, all mentioned areas use a specific algorithm in order to search for documents which are required to complete their task. For example, Twitterstand3 which is a system of capture breaking news through tweets in Twitter platform and then creates a map to locate the newss locations, it uses an algorithm to search and retrieve the news tweets before analyzing the content in order to extract required data such as a location of a tweet. This section of paper will shed light on some of these works. Summarising Short-Content Documents One of the greatest experiments was about automated summarisation of short-content documents [16]. This means that trying to create a short condensed tweet (which is a perfect example of short-content documents) that can give a general picture of Trending Topics. Trending Topics is a service provided by Twitter to list topics which have users attention. The main aim of such a research is to save a user time. Instead of reading thousands of tweets under a specific topic, all what he or she needs, just read the summarized tweets. 2.2 Event Detecting using Micro-blog Data There is an experiment conducted by Sakaki, Okazaki and Matsuo in 2010 [18], the goal of it is to build an application that has ability to track the locations of earthquakes around the world through analysing the tweets content and thus give people quick announcement even before USGS (United States Geological Survey) or media [18]. The type of such experiment, indeed, takes advantage of realtime nature power which is characterised by microblogs in general and in particular by Twitter.
3
2.1
TwitterStand is a system that its main goal is to automatically capture breaking news through tweets and creates a map that locates news places. It is introduced by Sankaranarayanan, Samet, Teitler, Lieberman and Sperling [27].
16
Research Design
The main goal of this experiment is to find out whether or not using the Twitters features, which they are early mentioned in this research, will improve the result of search in Twitter -i.e. improve the ranking for tweets. In other words, given queries, which in our study are published by TREC Micro-blog track 2011, rank tweets by Twitters features to find the most relevant tweets which meet the information need of a searcher. To attain that, a relative large dataset is required. 3.1 Dataset Description The dataset used in this research was provided by TREC Micro-blog track 2011. The size of the dataset is approximately sixteen million tweets over a period of two weeks starts from 24th January 2011 to 8th February 2011. The dataset has various types of tweets which include replies and re-tweets, and independent tweets4. In addition, the languages, which tweets are written by, consist of English and other languages. TREC Micro-blog tracks dataset is divided into files. Each file represents one day of tweeting and contains approximately 10,000 tweet Ids with their authors names. Once all files are downloaded, the actual tweets, which their id equal the ids in the files, are obtained directly from Twitter via Twitter API. The table 1 is a sample of tweets ids list with their authors names that each file has. The table 2 shows an example of actual tweets after using Twitter API to complete the dataset. Table 1 Sample of tweets ids with their authors names
Name5 User
In the first column is the tweets id followed by the authors name. The third column has HTTP status codes which are returned from Twitter API6. The timestamp is in the fourth column which it shows the created (posted) time of a tweet. The last column has the actual content of the tweet. Table 3 Dataset summary statistics Number of Tweets Actual Tweets Number Number of User Number Of Tweet has Hashtag # Number Of Tweet has URL Number Of Tweet has User Name Tag @ 3.2 The Judgment Approach In terms of evaluating the results, the list of manual judged tweets for each query from queries set are created. The top 300 tweets from each query are evaluated therefore the total number of judged tweets is 3,600. Thus, we will use this list as baseline. The returned tweets from modified functions will be compared programmatically to the result from the baseline which is created by a judge. That is, the baseline will be used to measure how the runs with new values are identifying the most relevant tweets. 16,000,000 14,691,916 4,668,704 2,488,261 2,672,818 9,520,135
Twitter ID 28965133296 340992 28965133988 401153 .
Hash Code
d53af5c74a1c0cf1425f8 e8ab5f7c256 379bc361b0b29529f7383 4aa4f3c9a29 .
Table 2 Example of actual tweets after using Twitter API
Fig. 1. Screenshot of the judgment application interface In order to facilitate the evaluation process, an application was created with a sample interface to allow a judge making the evaluation by one mouse click. This application (figure 1), in fact, displays each one of the tweets returned by Lucenes search engine along with the query topic. To allow a tweet be judging, the interface has four buttons. These buttons allow a judge to evaluate a tweet, whether it is relevant, not relevant to a query, written in a language other than English (since the focusing of this experiment is on English language) or a spam tweet. Once a judge ends the evaluation, the application writes all the result to a file.
Posted Time
Twitter ID
Content of at 3am?
User Name
3439 9760 7229 5219 2
200
Sun Feb 06 23:55: 16 2011
#bbcsuperbowl think Jake will be lookin tired
An independent tweet in this context means a tweet when authors update her or his status. This tweet may or may not have comment (replay) and may or may not be re-tweeted. For privacy issues user names are removed.
6
the Tweet
Status
https://dev.twitter.com/docs/error-codes-responses
17
3.3 Baseline As a baseline, this experiment uses Lucenes similarity function in order to compare our results with. See section the Judgment Approach to read about baseline construction 3.4 Ranking Experiment The main goal behind this experiment is to find whether or not using Twitters features will enhance the search results and if there are any improvements what the ideal values of each one of these features. Therefore, each feature is examined individually. In order to do that, we re-rank the top-k results7 by assign weights ranged from -0.9 up to 2.0 to each one of features of Twitter and then add these values to the score obtained by Lucenes similarity function. We use the following ranking formula: !"#$%(!, !) = !"#$%$(!, !) + ! !"#$%&"(!, !)
improvement was little. Similar to precision at 10, the precision at 20 demonstrates that using Hashtag return the most relevant tweets than the baseline. It is important to note that hashtag feature in precision at 20 did slight better when its value in the positive side more than the negative side especially when it has value ranged from 0.1 up to 0.5. Although the hashtag feature achieves a limited enhancement, it does not give the high level of improvement as what we expected.
Fig. 1. Screenshot of a tweet Where w is weight which take its value from weights set = {-0.9 to 2.0}. To explain more, let suppose that we would like to examine the hashtag feature, the system creates lists of returned tweets based on the hashtag feature values which is only added to Lucenes score and ignoring all other features. As mentioned earlier, each list will be compared programmatically to the list from the baseline that is created by a judge beforehand. Therefore, the next subsections will describe and examine each feature separately. The features covered in this paper are hashtag feature, URL feature, user name tag feature, tweet length feature and user tweet number feature. 3.4.1 Hashtag Feature There are many tweets have # symbol which called hashtag. A hashtag is essentially created by users themselves. It is used to create keywords for specific topics [39]. In addition, the Twitters users use the hashtags to make it easier to keep track of all discussions of hashtags topic [3]. Twitter uses hashtags in order to follow trending topics. That is, if a hashtag has many discussions, Twitter will considers this hashtag as a trend and then lists this hashtag among the hot topics list. This list can be founded in the right side of Twitters front page under label Trends. Most hashtags have explicit meaning, for instance, the figure 2 shows a tweet contains a hashtag which is #TwoThingsThatDontMix and there are other such as #thingsIfear, #android, #justathought and much more. The most popular hashtags considered as Trending Topics. By comparing the precisions (see Figure 3) we found that in the precision at 1 there were not any differences between the baseline and hashtags values. However, the precision at 10 the hashtag feature did better than baseline even though this
7
It is important to mention that the number of tweets that have not been judged in precisions at 1, 10, and 20 is zero. This is mean, all 300 returned tweets from each one of 12 quires were within our manual judged list of tweets. In precision at 30, all the returned tweets are judged when the hashtag values were between -0.9 and 1.0. However, in precision at 30, when the hashtag values were between 0.2 and 2.0., there is one tweet been not judged. 3.4.2 URL Feature One of the useful features in Twitter is enabling users to insert URL into their Tweets. Using URL in tweets could be a good indicator of the quality of a tweet. According to Nagmoti, Teredesai and De Cock [5], there are many purposes for inserting URL in tweets. The first one is essentially to guide followers of tweets author or readers of that tweet to read more information which because of the words number limitation, they cannot be inserted in the tweets content. The other purpose is to confirm the veracity of what author of tweets have written. However, all URLs in tweets should be short by using shortening services which, recently, Twitter provides such a service to its users, because of that this type of tweets are not guessable. That is, not all URLs in tweets will direct readers to websites that have a relation to the content of the tweet which contain that URL. This will create a good environment for spammers [3]. Figure 4 shows comparing between precisions at 1, 10, 20 and 30. It seems from the first glance that in precisions at 1, using URL feature to rank tweets does not make any effect. However, precisions at 10 proves that URL feature increases the number of relevant tweets especially when the values of URL are from 0.5 up to 1.0 after these values the effectiveness of using URL feature reduces. For example, at the URL value in 2.0, the number of relevant tweets is less than the number of tweets without using URL feature. Generally, both precisions at 20 and 30 demonstrate that using URL feature enhance the search results.
The top-k results are in fact returned by Lucenes similarity function.
18
3.4.3 User Name Tag Feature This feature indeed makes Twitter sometimes confusing even for daily users [5]. As most of Twitters features, it can be recognized by its symbol which is @. Sometimes, it is called username tag or replay and mention tag8. All of these names indeed are correct. To explain more, it is called username tag because the first thing come directly after @ sign is a username; and because of its main purpose which is used to replay to another user, it is called replay tag. Twitter is not only allowing its users to write tweets but they can reply to each other. In order to manage that in Twitter, users should use @ sign in the beginning followed by the username ,who created the tweet which the user want to reply to it, in the body of tweet thus readers will recognize that this tweet is reply to another tweet. Sometimes, when users write tweets, they, for one reason or another, need to write the name or names of others users in anywhere of tweets body. In this circumstance, Twitter called this mention. According to that, reply is a form of mention. The only different between replay and mention is the place of @ sign, if it is in the beginning of the tweet is consider as a reply, otherwise, it is a mention. The user name feature plays important role in some studies and experiments which trying to measure the influence of user. Take [40] for example, in that paper, Cha, Haddadi, Benevenuto and Gummadi tried to figure out some patterns of user influence in Twitter. They measure the user influence through three levels. These levels are in-degree, re-tweets and mention. In-degree level means the number of users who following a user. Re-tweets means how many users re-publish a tweets user, whereas mention represents the number of tweets that mention a user name. They found that measuring the influence of a user in twitter should take into account all the three mentioned levels. Using one of them will sometimes lead to inaccurate measuring. For the interested reader we recommend the work by Nagmoti et al [5]. The graph in figure 5 compares the precisions at 1, 10, 20 and 30 with each other when the user name feature has values from -0.9 to 2. It can be clearly seen that using user name feature generally does not enhance the quality of search, especially when its values between 0.3 and 2. To begin with, in precision at 1, when the user name feature has values from -0.9 to -0.1 and from 0.1 to1.7, there is not any influence of using this feature if it is compared with the zero value which represents the baseline. Moreover, using values behind 1.7 will affect the search results negatively. The precisions at 10, 20 and 30 have similar patterns. Mostly, the negative side (from -0.9 to 0.1) did the same as the baseline, except when the user name feature has value -0.2 in precision at 30.
8
On the other hand, in case the user name feature has positive values, using this feature will make the search results worse, excluding the values between 0.1 and 0.2 in precision at 10. 3.4.4 Tweet Length Feature One of the main characteristics that distinguish Twitter from other micro-blogs platforms is the limitation on the number of words that a user can write. Because of this characteristic, most authors of tweets try to write as much as they can within the 140 character limit. Therefore, they use many abbreviations, acronyms and phonetic abbreviations [41] which make the indexing for these data is challenge process [5]. According to Nagmoti el at [5], the length of a tweet could be used as a measure of the quality of a tweet, where, long tweets could be considered more relevant to a given query than short tweets. That is, the long tweets contain more information than short ones. With this in mind, in our experiment, we take the actual length of tweets into account. By actual length we mean that the number of words which a tweet contains after eliminating any URLs, user names and hashtags. According to that, the long tweets will be boosted to be in the top of the result list. The graph in figure 6 compares the precision values at 1, 10, 20 and 30 with each other when the tweet length feature has values from -0.9 to 2. It can be clearly seen that using tweet length feature generally enhance the returned results. In term of precision at 1, in most cases there are no changes between the performance of the baseline and using tweet length feature with exceptions when the values are between -0.9 and -0.7 as well as 0.9 to 1.3 which using these values for tweet length feature make the performance of the system worse. It is important to take into account the number of unjudged tweets. For instance, when the tweet length feature has value equal to 1.9, one in twelve tweets9 was not judged. However, it is difficult to rely on precision at 1 to make a decision about the effectiveness of the tweet length feature; thus it is important to examine the performance of this feature at precisions 10, 20 and 30. Both precisions (at 10, 20 and 30) show clearly that the positive values of the tweet length feature did better than negative values. Besides that, positive values in general enhance the returned results much better than baseline. All in all, while keeping the number of un-judged tweets in mind, using the tweet length feature in ranking tweets does provide some improvement especially with positive values. 3.4.5 User Tweet Number Feature Using the number of tweets that a user creates could be a good indicator for the quality of her or his tweets. Moreover, as it is previous mentioned in the
9
In this research we called this feature user name feature.
Here we have 12 tweets because precision at 1 examines first tweet of returned lists of each one of query.
19
user name tag feature in regards to measuring a user influence, the user tweet number is used in many studies (such as [5] and [40]) to evaluate the authority of tweets authors. Therefore, the more number of tweets a user has the more advanced position these tweets will be in list of returned results. In addition, looking at the number of tweets that a user creates will give us, to a certain degree, evidence about impertinence of her or his tweets. If we, For example, take the news agencies which have accounts in Twitter such as Reuters10, BBC11, CNN12 and others, we will find that they have thousands of tweets. What this means, they have more credibility and reliability than others. Therefore, for a given query, if there are two relevant tweets form two different authors and the second author has more number of tweets than the first author, then our formula will boost the second author tweet to be in advanced position in list of returned results. In Figure 7, it can be clearly seen that using user tweet number feature generally affect returned results negatively in particular when values of the user tweet number feature are ranged from 0.4 up to 2.0. Although, it seems that the baseline by itself does better than using this feature, we should not ignore the number of tweets that have not been judged. For example, the numbers of tweets that were un-judged in precision at 30 and when the value of this feature was equal to 2.0 are 191 tweets, which means, nearly half of returned tweets were judged. 3.5 Experimental Results of First Phase Our experiment covered the five features of Twitter which are hashtag feature, URL feature, user name tag feature, tweet length feature and user tweet number feature. Indeed, the data used in this experiment is gathered by collecting the top 300 tweets returned from each query included in the first query set which consists of 12 queries that each one covers different topic. In all, the training data has 3,600 tweets. By comparing all results of mentioned features, it seems that some of these feature performed better than the baseline such as URL feature and tweet length tweet feature. However, the rest of features (hashtag feature, user name tag feature and user tweet number feature) in general did not reflect any enhancements compared to the performance of baseline. Therefore, we can order these features based on their contributions to enhancement the search results, in the first place could be tweet length feature followed by URL feature and hashtag feature and in the last place come both user tweet number feature and user name tag feature.
10 11
Conclusion
As mentioned before, the characteristics that microblogs have make them distinguishing from other social media such as blogs or even from the traditional web pages. These characteristics include the real-time nature of micro-blogs which make them a good and rich source of news and hot topic, another characteristic is the size of the content13 in micro-blogs which is one of the challenges that facing researchers. In this research, we took Twitter as a good example of micro-blogs. The research question we stated in the beginning of this research that whether or not using the Twitters features will improve the returned result. The features that covered in this research are hashtag feature, URL feature, user name tag feature, tweet length feature and user tweet number feature. In order to find answers to the research question, the experiment was conducted. The experiment aim was to examine each feature individually in order to investigate their impact on the returned results as well as to obtain the ideal weights for each one of these features. In regards to the dataset used in this experiment, we used the dataset that partly provided by TREC as it contains only of tweets ids and user names who wrote the tweets without the tweets content. Then the tweets content was obtained directly from Twitter via Twitter API. After the dataset was completely obtained, the baseline was created. It is important to note that this experiment used the set of queries that are provided by TREC. The findings of first phase of this experiment is that by comparing all results of mentioned features, it seems that some of these feature performed better than the baseline such as URL feature and tweet length tweet feature. However, the rest of features (hashtag feature, user name tag feature and user tweet number feature) in general did not reflect any enhancements compared to the performance of baseline. 4.1 Future Work The results of this experiment show that there is a lot of room for improvement. One interesting option is to identify more features or extend some of the mentioned features. What I mean is that, let take URL feature as an example; if there is a tweet which is relevant to a given query and this tweet contains URL inside its content this does not necessarily mean this URL will be relevant too; therefore, it will be useful if a system examine the relevance of that URL to the query. While our research is about ranking the returned results, it will be beneficial if there is any implementation of mechanisms of relevance feedback, which in turn could be used to improve the returned results.
http://twitter.com/#!/Reuters http://twitter.com/#!/BBCNews 12 http://twitter.com/#!/CNN
13
By the size of the content, it means the number of characters that exist in the content of a tweet for example.
20
0.8
Precision values
0.7 0.6 0.5 0.4 0.3 0.2 0.1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 0 p@1 p@10 p@20 p@30
Weights for hashtag feature

Fig. 3. Hashtag Feature
0.8
Precision values
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -0.9 -0.7 -0.5 -0.3 -0.1 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 p@1 p@10 p@20 p@30
Weights for URL feature

Fig. 4. URL Feature
0.8
precision values
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -0.9 -0.7 -0.5 -0.3 -0.1 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 p@1 p@10 p@20 p@30
Weights for user name tag feature

Fig. 5. User Name Tag Feature
21
0.8
Precisiopn Values
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -0.9 -0.7 -0.5 -0.3 -0.1 0.1 0.3 0.5 0.7 0.9 1.1 1.3 1.5 1.7 1.9 p@1 p@10 p@20 p@30
Wrights for tweet length feature

Fig. 6. Tweet Length Feature
-0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Precision Values
p@1 p@10 p@20 p@30
Weights for user tweet number feature

Fig. 7. User Tweet Number Feature
22
References
[1] D. A. Shamma, L. Kennedy, and E. F. Churchill, "Tweet the debates: understanding community annotation of uncollected sources," in Proceedings of the first SIGMM workshop on Social media, Beijing, China, 2009, pp. 3-10. G. Mishne and M. de Rijke, "A study of blog search," Advances in Information Retrieval, pp. 289-301, 2006. H. Kwak, C. Lee, H. Park, and S. Moon, "What is Twitter, a social network or a news media?," presented at the Proceedings of the 19th international conference on World wide web, Raleigh, North Carolina, USA, 2010. J. Teevan, D. Ramage, and M. R. Morris, "# TwitterSearch: a comparison of microblog search and web search," in Proceedings of the fourth ACM international conference on Web search and data mining, Hong Kong, China, 2011, pp. 35-44. R. Nagmoti, A. Teredesai, and M. De Cock, "Ranking Approaches for Microblog Search," 2010, pp. 153-157. C. D. Manning, P. Raghavan, and H. Schtze, Introduction to information retrieval: Cambridge University Press, 2009. M. Ebner and M. Schiefner, "Microblogging-more than fun," in IADIS Mobile Learning Conference, 2008, pp. 155-159. S. Westman and L. Freund, "Information interaction in 140 characters or less: genres on twitter," presented at the Proceeding of the third symposium on Information interaction in context, New Brunswick, New Jersey, USA, 2010. J. Weng, E. P. Lim, J. Jiang, and Q. He, "Twitterrank: finding topic-sensitive influential twitterers," in Proceedings of the third ACM international conference on Web search and data mining, New York, New York, USA, 2010, pp. 261270. D. Gayo-Avello, "Nepotistic Relationships in Twitter and their Impact on Rank Prestige Algorithms," CoRR, vol. 1004.0816, 2010. M. Bianchini, M. Gori, and F. Scarselli, "Inside PageRank," ACM Trans. Internet Technol., vol. 5, pp. 92-128, 2005. J. M. Kleinberg, "Authoritative sources in a hyperlinked environment," J. ACM, vol. 46, pp. 604-632, 1999. K. Bharat and M. R. Henzinger, "Improved algorithms for topic distillation in a hyperlinked environment," presented at the Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, Melbourne, Australia, 1998. M. Richardson and P. Domingos, "The intelligent surfer: Probabilistic combination of link and content [15]
[2] [3]
[16] [17]
[4]
[18]
[5] [6]
[19]
[20]
[7]
[8]
[21]
[9]
[22]
[23]
[10]
[11] [12] [13]
[24]
[25]
[14]
information in pagerank," Advances in neural information processing systems, vol. 2, pp. 1441-1448, 2002. J. M. Pujol, R. Sangesa, and J. Delgado, "Extracting reputation in multi agent systems by means of social network topology," presented at the Proceedings of the first international joint conference on Autonomous agents and multiagent systems: part 1, Bologna, Italy, 2002. B. Sharifi, M. A. Hutton, and J. Kalita, "Automatic Summarization of Twitter Topics," 2010. B. Sharifi, M. A. Hutton, and J. K. Kalita, "Experiments in Microblog Summarization," in Social Computing (SocialCom), IEEE Second International Conference on, Minneapolis, MN 2010, pp. 49-56. T. Sakaki, M. Okazaki, and Y. Matsuo, "Earthquake shakes Twitter users: realtime event detection by social sensors," in Proceedings of the 19th international conference on World wide web, Raleigh, North Carolina, USA, 2010, pp. 851-860. D. W. Oard, "The state of the art in text filtering," User Modeling and UserAdapted Interaction, vol. 7, pp. 141-178, 1997. S. Banerjee, K. Ramanathan, and A. Gupta, "Clustering short texts using wikipedia," in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, Amsterdam, The Netherlands, 2007, pp. 787-788. P. Schnhofen, "Identifying document topics using the Wikipedia category network," Web Intelligence and Agent Systems, vol. 7, pp. 195-207, 2009. D. Bollegala, Y. Matsuo, and M. Ishizuka, "Measuring semantic similarity between words using web search engines," in Proceedings of the 16th international conference on World Wide Web, Banff, Alberta, Canada, 2007, pp. 757-786. X. H. Phan, L. M. Nguyen, and S. Horiguchi, "Learning to classify short and sparse text & web with hidden topics from large-scale data collections," in Proceeding of the 17th international conference on World Wide Web, Beijing, China, 2008, pp. 91-100. D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent dirichlet allocation," The Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003. B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, and M. Demirbas, "Short text classification in twitter to improve information filtering," in Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, Geneva, Switzerland, 2010, pp. 841-842.
23
[26] [27]
[28]
[29]
[30]
[31]
[32] [33]
[34]
[35]
[36]
[37]
D. Ramage, S. Dumais, and D. Liebling, "Characterizing microblogs with topic models," ICWSM'10, 2010. J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D. Lieberman, and J. Sperling, "Twitterstand: news in tweets," in Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, Seattle, Washington, 2009, pp. 42-51. A. Angel, N. Koudas, N. Sarkas, and D. Srivastava, "What's on the grapevine?," in Proceedings of the 35th SIGMOD international conference on Management of data, Providence, Rhode Island, USA, 2009, pp. 1047-1050. N. Bansal and N. Koudas, "Blogscope: a system for online analysis of high volume text streams," in Proceedings of the 33rd international conference on Very large data bases, Vienna, Austria, 2007, pp. 1410-1413. D. Shamma, L. Kennedy, and E. Churchill, "Tweetgeist: Can the twitter timeline reveal the structure of broadcast events," CSCW Horizons, 2010. B. E. Teitler, M. D. Lieberman, D. Panozzo, J. Sankaranarayanan, H. Samet, and J. Sperling, "NewsStand: a new view on news," presented at the Proceedings of the 16th ACM SIGSPATIAL international conference on Advances in geographic information systems, Irvine, California, 2008. R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification, second edition ed.: New York ; Chichester : Wiley 2001. A. Dong, R. Zhang, P. Kolari, J. Bai, F. Diaz, Y. Chang, Z. Zheng, and H. Zha, "Time is of the essence: improving recency ranking using twitter data," in Proceedings of the 19th international conference on World wide web, Raleigh, North Carolina, USA, 2010, pp. 331-340. A. Dong, Y. Chang, Z. Zheng, G. Mishne, J. Bai, R. Zhang, K. Buchner, C. Liao, and F. Diaz, "Towards recency ranking in web search," in Proceedings of the third ACM international conference on Web search and data mining, New York, New York, USA, 2010, pp. 11-20. L. Guangxia, S. C. H. Hoi, C. Kuiyu, and R. Jain, "Micro-blogging Sentiment Detection by Collaborative Online Learning," in Data Mining (ICDM), 2010 IEEE 10th International Conference on, 2010, pp. 893-898. B. Pang, L. Lee, and S. Vaithyanathan, "Thumbs up?: sentiment classification using machine learning techniques," presented at the Proceedings of the ACL02 conference on Empirical methods in natural language processing - Volume 10, 2002. S. Agrawal and T. j. Siddiqui, "Using syntactic and contextual information for
[38]
[39]
[40]
[41]
[42]
[43] [44]
[45]
sentiment polarity analysis," presented at the Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human, Seoul, Korea, 2009. A. Go, R. Bhayani, and L. Huang, "Twitter sentiment classification using distant supervision," CS224N Project Report, Stanford, 2009. M. Efron, "Hashtag retrieval in a microblogging environment," presented at the Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, Geneva, Switzerland, 2010. M. Cha, H. Haddadi, F. Benevenuto, and K. P. Gummadi, "Measuring user influence in twitter: The million follower fallacy," 2010. A. Ritter, C. Cherry, and B. Dolan, "Unsupervised modeling of Twitter conversations," presented at the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, California, 2010. W. Weerkamp, S. Carter, and M. Tsagkias, "How People use Twitter in Different Languages," in Proceedings of the ACM WebSci'11, Koblenz, Germany., June 14-17 2011, pp. 1-2. G. E Mark, "Language identification in the limit," Information and Control, vol. 10, pp. 447-474, 1967. S. Carter, M. Tsagkias, and W. Weerkamp., "Semi-supervised priors for microblog language identication," in Dutch-Belgian Information Retrieval workshop (DIR 2011), 2011. S. Robertson, "Understanding inverse document frequency: on theoretical arguments for IDF," Journal of Documentation, vol. 60, pp. 503-520, 2004.
Ahmed Sulaiman M Alharbi is a lecture in Department of Computer Science and Information, University of Taibah, Saudi Arabia. He received his M.Sc. from University of Monash, Australia in 2012 on filed of Information Technology. He recived his B.Sc. from University of Taibah, Saudi Arabia in 2007

A New Approach For Ranking Micro-Blogs Content

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A New Approach For Ranking Micro-Blogs Content

Uploaded by

Copyright:

Available Formats

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 6, JUNE 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.

A New Approach for Ranking Micro-blogs Content

Twitter ID 28965133296 340992 28965133988 401153 .

d53af5c74a1c0cf1425f8 e8ab5f7c256 379bc361b0b29529f7383 4aa4f3c9a29 .

Table 2 Example of actual tweets after using Twitter API

3439 9760 7229 5219 2

Sun Feb 06 23:55: 16 2011

#bbcsuperbowl think Jake will be lookin tired

The top-k results are in fact returned by Lucenes similarity function.

In this research we called this feature user name feature.

http://twitter.com/#!/Reuters http://twitter.com/#!/BBCNews 12 http://twitter.com/#!/CNN

Weights for hashtag feature

Weights for URL feature

Weights for user name tag feature

Wrights for tweet length feature

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

p@1 p@10 p@20 p@30

Weights for user tweet number feature

[11] [12] [13]

You might also like