Professional Documents
Culture Documents
Twitter has attracted millions of users to share and disseminate most up-to-date information,
resulting in large volumes of data produced everyday. However, many applications in
Information Retrieval (IR) and Natural Language Processing (NLP) suffer severely from the
noisy and short nature of tweets. In this paper, we propose a novel framework for tweet
segmentation in a batch mode, called HybridSeg. By splitting tweets into meaningful
segments, the semantic or context information is well preserved and easily extracted by the
downstream applications. HybridSeg finds the optimal segmentation of a tweet by
maximizing the sum of the stickiness scores of its candidate segments. The stickiness score
considers the probability of a segment being a phrase in English (i.e., global context) and the
probability of a segment being a phrase within the batch of tweets (i.e., local context). For the
latter, we propose and evaluate two models to derive local context by considering the
linguistic features and term-dependency in a batch of tweets, respectively. HybridSeg is also
designed to iteratively learn from confident segments as pseudo feedback. Experiments on
two tweet data sets show that tweet segmentation quality is significantly improved by
learning both global and local contexts compared with using global context alone. Through
analysis and comparison, we show that local linguistic features are more reliable for learning
local context compared with term-dependency. As an application, we show that high
accuracy is achieved in named entity recognition by applying segment-based part-of-speech
(POS) tagging.
LIST OF FIGURES: