TOWARDS EFFICIENT FRAMEWORK FOR SEMANTIC QUERY SEARCH ENGINE

IN LARGE-SCALE DATA COLLECTION

ABSTRACT

Short texts are different from long documents, they have unique characteristics which
make difficult to understand and handle. Billions of short texts are generated everyday, in the
form of queries, tweets, tags, messenger conversations, social network posts etc. Most of the
generated queries contain less than 5 words. These texts, do not always observe the syntax of a
written language. Hence, traditional NLP methods do not always apply to short texts. Many
applications, including search engines, ads, online advertising etc. rely on short texts.
Understanding short texts retrieval, classification and processing become more difficult task.

In the existing system, a Deep Neural Network (DNN) with 3-autoencoders mechanism is
used to enrich short texts semantically based on concepts and co-occurring terms in the probase
knowledge system. The goal of probase is to make machines to better understand human
communications through texts. The semantic hashing method converts the text into a compact
binary codes for better mapping. Auto-encoder based DNN model is able to capture the abstract
features and complex correlations from the input text such that the learned compact binary codes
can be used to represent the meaning of that text. However, it performs poorly in large-scale data
collection i.e. low accuracy, predictive error rate, classification of text, bit time consuming in text
processing.

In the present system, Deep Neural Network (DNN) is replaced with Recurrent Neural
Network (RNN) to improve the accuracy, prediction rate and proccesing time in large-scale data
collection. The RNN allows to score arbitrary sentences based on how likely they are to occur in
the real world. This gives a measure of grammatical and semantic correctness. The RNN are
associated with stacked de-noising autoencoders enables to maintain the input and output units
are same. It can be predicted that improvement will be observed through the improved accuracy
and predicting rate over large-scale data.