You are on page 1of 5

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 7, JULY 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.

ORG

39

Product Feature Extraction Using Natural Language Processing Techniques


Niloufar Salehi Dastjerdi, Roliana Ibrahim, and Seyed Hamid Ghorashi
Abstract Nowadays many customers use online services for buying their desired products and many products and services are sold through the Web. On the one hand, considerable numbers of people are expressing their opinion about a particular product or service by writing comments on them; and on the other hand, collecting users opinions and giving feedback to the service or product providers. These are important techniques of increasing the chance of business success and improving service quality. For this, identifying opinion sentences and extracting appropriate features of review sentences can be perceived as a proper task. This paper applies techniques of natural language processing for the detection of sentences expressing an opinion. Subsequently, product features can be identified in each opinion sentence, considering appropriate grammatical concepts in the review sentence. In this instance, the review sentences will be considered as an input and extracted features and opinions as an output. Finally the results and accuracy of the system will be measured by related evaluation metrics. Index Terms Customer Reviews, Dependency Relation, NLP (Natural Language Processing), Product Feature.

1 INTRODUCTION

ue to the rapid development of electronic commerce (e-commerce), many customers use online services for buying their desired products and many products and services are presented through the Web. It is necessary for producers and online sellers to enable their customers and users to express their ideas about the products which they have bought, as it is believed this will enhance customer satisfaction with their shopping experience. Therefore a considerable number of people are currently writing reviews of their opinion of specific subjects through the Web. These reviews are appropriate for both information publishers and users and have several advantages. In addition to manufacturers collecting customer review feedback via to strive for greater improvement of their products, people could impartially decide whether to buy a product or not by evaluating the opinions of other people, which will probably affect their ideas about the product. On the other hand it is difficult for customers to read all reviews, especially for well-known products, as the number of comments may exceed hundreds or even thousands. Thus, an important task of opinion mining is to extract peoples opinions on features of an entity [1]. A vital question is then posed, that is, how to extract features from a corpus. Therefore in recent years, review mining and text analysis have been effective research areas in NLP (Natural Language Processing) for understanding the opinion of people. Most text analysis contains a stage of text feature extraction to determine the effective

Niloufar Salehi Dastjerdi is with the Software Engineering Research Group (SERG), Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia , Johor Bahru, Malaysia. Roliana Ibrahim. is with the Software Engineering Research Group(SERG), 2 RELATAED WORK Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia , Johor Bahru, Malaysia. In some approaches considering association rule mining, Seyed Hamid Ghorashi is with the Software Engineering Research Group (SERG), Faculty of Computer Science and Information Systems, Universiti a technique has been proposed for determining nouns as Teknologi Malaysia , Johor Bahru, Malaysia. Journal of Computing Press, NY, USA, ISSN 2151-9617 2012 http://sites.google.com/site/journalofcomputing/

ness of various sentences in documents. A review sentence may consist of some words as feature or opinion words. Therefore, it is necessary to detect valid featureopinion pairs after determining feature and opinion words in a comment, therefore grammatical relations may be a useful way to recognize the relationship among feature words and the opinion words of a related feature. After that, the mined relations are applied to determine convincing feature-opinion pairs. Some of the existing works have applied the methods of association rule mining for extracting features, and hence determine frequent item sets which are most likely to be considered as frequent features in the reviews. Since there might be some features that are specified by only a few number of reviewers, frequency matching is not a proper method by which to identify product features. In some other works, an attempt has been made to apply natural language processing techniques in order to solve the above mentioned issue. This can be seen in the work Zhuang et. al. performed on the movie dataset. In movie reviews, some related names such as a movie name or names of people may also be featured, while in the domain of products a feature can be considered as a part of the product, its manufacturer or an attribute of that product. Therefore, this paper attempts to use some techniques in the domain of Natural Language Processing and focus on product dataset to identify features of product reviews and achieve a higher precision in comparison with association rule mining approaches.

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 7, JULY 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

40

feature candidates that exist in product reviews[2]. The main idea is that people often use the same words whenever they express their opinions about the same product features. They have also used association mining to determine frequent item sets. In the work of Qadir, the term opinion sentence can be identified with some type of dependency relations. For detecting a product feature some steps were applied, namely: counting frequent words, normalizing the product feature scope by tf.idf metric and finally assigning product features [3]. However frequency matching seems not to be a proper method by which to identify product features. Extra work needs to be done to find those features that are talked about by a minority of people. Moreover, some words are often repeated in the reviews whilst not being considered as proper features. Therefore, that frequency matching is not a good approach by which to solve the problem. Some approaches introduce extraction systems like OPINE that decompose the review mining with a step of identifying product and opinion features [4]. But, Popescus system OPINE is not easily at hand, so there is some complexity with using Popescus technique. Some works consist of comparison between different algorithms[5]. They have applied two algorithms for feature extraction, namely: Association Mining Approach and Likelihood Ratio Test Approach and have analyzed their functionality. Another attempt performed in feature extraction has been to apply a dependency grammar graph to determine a number of relations among feature and opinion words in the reviews of a movie dataset (IMDB) [1]. Movie reviews usually consist of several sentences with neutral information about the characters, directors, actors or plot of the movie. However while these sentences are mainly for showing the authors opinions, they can also include several negative and positive terms. As a result, many confusing feature-opinion pairs may exist in these sentences, thereby causing low precision. Some works present the concept of phrase dependency parsing [6]. Phrase dependency parsing divides the input sentence into phrases. The parsing concentrates on the phrases and the relations that existed between them, instead of focusing on the single words in each phrase. In their work they defined the phrase dependency parsing and then construct the phrase dependency trees.

product is considered as a table of database, and each review is represented as a row of the table. Other preparation steps include Part-of-Speech tagging and simplification that will be considered as a preprocessing step. Generally the system consists of three main steps: 1. 2. 3. Preprocessing Classification of reviews Feature Extraction

4.1 Preprocessing While the desired data for this study are the comments that have been produced by users, they are nevertheless in unstructured formats. Therefore, in order to make them usable for further text mining processes they should be converted into a structured format by carrying out some preprocessing steps using part-of-speech tagging (POS Tagging) and text simplification. POS Tagging is a process of identifying the units of a sentence such as nouns, adjectives, verbs, etc [7]. It finds the role of each word in a sentence in order to prepare data for the next step and create an appropriate dictionary for the second step (classification of reviews). To achieve text simplification, compound sentences will be split into multiple sentences. In this way, the patterns that are defined in the third step (feature extraction), will be better matched with simplified sentences. For both POS Tagging and text simplification tasks, the libraries and classes of Stanford Parser were applied (http://nlp.stanford.edu/software/lexparser.shtm). For example in the review sentence I am satisfied with good lens but the battery life is short , after applying text simplification two simple sentence are achieved which are:
1. 2. I am satisfied with good lens the battery life is short

these sentences are better applied in next step.

3 DATA
The dataset employed in this paper consists of comments that have been expressed by customers on a product dataset. These review sentences were obtained from Amazon.com and Cnet.com via URL and focus on the following electronic products: a DVD player, two digital cameras, an MP3 player and a cell phone.
http://www.cs.uic.edu/liub/FBS/CustomerReviewData.zip
Fig. 1. Grammatical structure of sample sentence

4 THE PROPOSED TECHNIQUE


The first task of our methodology was to convert an annotated dataset into the MySql database. Journal ofend, each Press, NY,shows the preprocessing techniques. The parts that To this Computing Fig. 1. USA, ISSN 2151-9617 2012
http://sites.google.com/site/journalofcomputing/

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 7, JULY 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

41

labeled by s are indicators of a sentence. In addition, the role of each word in the sentence is identified that is related to POS-tagging.

4.2 Classification of Reviews Whereas the massive volume of online text in the form of reviews and comments are available through the World Wide Web, there is a need to limit the domain of reviews for further use. In this paper, a dictionary of adjectives has been made in order to separate opinion sentences from non-opinion ones. Most of the opinions are in the form of adjectives, then for each adjective synonyms of it are extracted from the WordNet dictionary. Finally a table of adjectives is created. Subsequently, all the reviews are then compared to these adjectives. If any of them consist of these adjectives, it is considered as an opinion sentence and will be put into the class of opinion sentence accordingly. 4.3 Feature Extraction Feature extraction is one of the important steps in most text analysis and identifies the words that make in each document, therefore extracting peoples opinions which expressed on features of entities is an important task.. So How to extract features from a textual document is a considerable problem [8]. In recent years, opinion mining has been an active research area in NLP. Several models exist for illustrating the natural languages. The main attention of most of them is to gain the semantics and syntax of the languages. One field that broadly applied in todays natural language processing is dependency grammar [9]. Dependency grammars indicate structures of a sentence in a textual document as a set of dependency relationships [10]. In this paper, for this step, a pattern will be created which gives the grammatical relations of elements (words) incommon review sentences, and determines the roles of words in sentences in order to define specific words such as feature or opinion. To find the relationship between words in the review sentences, the Stanford Parser method, which has been developed by Stanford, is used. Among the defined dependency relations, it can be seen that some of them correspond better with the structures of our opinion sentences. Thus, they can be considered as appropriate formats to create a pattern. As mentioned before, some works like [1] use grammatical relation patterns for identifying features and corresponding opinions in review sentences, but instead of using the product dataset they work on the movie dataset(IMDB). Therefore, there are some differences between our work and what has been done in [1]. First, the movie feature is different from the feature in the product dataset. A movie feature can be considered as a movie element such as music, or movie-related people such as actors, whom authors comment on. However, product

feature refers to all the physical characteristics, manufacturers, brands, attributes or components of a product such as size, color, etc. Second, movie reviews consist of several sentences with information about the directors, characters or actors of the movie, whereas in product reviews, few customers will pay attention to issues such as who is the vendor or designer of the product. Moreover, this paper proposes patterns that have some differences with the defined patterns in [1]. The terms that are used in this paper are amod, acomp, xcomp, nsubj, and are made clear in Table 1. TABLE 1
DEPENDENCY RELATIONS PATTERN

As is clear from Table 1, the Noun words (labeled with TABLE 2


EXAMPLE OF OPINION SENTENCE MATCH THE PATTERN

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617 http://sites.google.com/site/journalofcomputing/

N) are considered as feature, and the adjective words (specified with J) are opinion in our system. For example in the sentence Design looks amazing, there is a notable grammatical relation between looks and amazing: acomp (looks, amazing). Table 2 indicates more examples of review sentences for defined patterns. Fig. 2. shows the syntactic tree of grammatical relations and the words that are labeled with their corresponding roles in a sample

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 7, JULY 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

42

sentence. It is produced by GrammerScope (a GUI interface to the Stanford Dependencies representation) and is available at: http://grammarscope.sourceforge.net/.)

6 CONCLUSION
Whereas most of the comments are written to express opinions on a specific subject, some may have the same writing format that can create general patterns. Considering approaches that apply to Natural Language Processing techniques, in this paper an attempt has been made to use some NLP techniques to identify features and opinions of reviews by using dependency relations. According to the results of this work, it is clear that extracting features based on these defined patterns that were achieved under NLP techniques improved evaluation metrics of precision and recall. So it can be concluded that our system extracted the most number of correct features based on Natural Language Processing techniques and tools, with less irrelevant results, in comparison with other approaches .

Fig. 2. Sample syntactic tree of acomp grammatical relation.

5 EVALUATING THE ACCURACY OF THE SYSTEM


Precision, recall, and the F-measure are evaluation metrics used to assess the accuracy of a system. A high value for precision indicates that most of the features returned by the system have been predicted correctly, but there might be some features which have not yet been identified. Also, a high recall shows that less missing features exist in the results; however there might be some irrelevant features among them. Table 2 shows the results of our system by recall and precision. As is clear from Table 3, for all the products, precision and recall is increased in comparison with the association rule mining achieved from [2]. This results indicates that most of the reviews have matched our defined patterns well, whereas these patterns return features and opinions

REFERENCES
[1] L. Zhuang, F. Jing, and X. Zhu, "Movie review mining and summarization," Proceedings of the 15th ACM international conference on Information and knowledge management - CIKM '06, 2006. M. Hu and B. Liu, "Mining and summarizing customer reviews," Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '04, 2004. A. Qadir, "Detecting Opinion Sentences Specific To Product Features in Customer Reviews using Typed Dependency Relations," Text, 2009. A. Popescu and O. Etzioni, "Extracting product features and opinions from reviews," Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing - HLT '05, 2005. L. Ferreira, N. Jakob, and I. Gurevych, "A Comparative Study of Feature Extraction Algorithms in Customer Reviews," 2008 IEEE International Conference on Semantic Computing, 2008. Y. Wu, Q. Zhang, X. Huang, and L. Wu, "Phrase Dependency Parsing for Opinion Mining," Language, 2009. K. Buss, "Literature Review on Preprocessing for Text Mining," Text, 2008. I. Conference and E. Bej, "Computational Linguistics and Intelligent Text Processing," Word Journal Of The International Linguistic Association, 2011. E. Zmecnkov and P. Horcek, FORMAL MODELS IN NATURAL LANGUAGE PROCESSING, 2009. G. Somprasertsri, "Mining Feature-Opinion in Online Customer Reviews for Opinion Summarization," Computer, vol. 16, 2010.

[2]

[3]

[4]

[5] TABLE 3 RECALL AND PRECISION OF EXTRACTING FEATURES WITH ASSOCIATION RULE MINING AND NLP TECHNIQUES [6]

[7]

[8]

[9]

[10]

correctly most of the time.

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617 http://sites.google.com/site/journalofcomputing/

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 7, JULY 2012, ISSN (Online) 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

43

Niloufar Salehi Dastjerdi received her B.Sc. degree in Software Computer Engineering from Azad University of Najafabad, Iran (IAUN) in 2007, and M.Sc. degree in Computer Science from University Technology Malaysia (UTM) in 2012. Her research interests include Natural Language Processing (NLP), Opinion Mining, Text mining and Data Mining. Roliana Ibrahim is a senior lecturer at the Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia. She received her PhD in System Science from Loughborough University and currently a member of Software Engineering Research Group. Her research interests are the adoption of system thinking methodologies and ontology as innovative solutions for complex systems integration and development, data warehousing and data mining. Seyed Hamid Ghorashi is a PhD candidate in computer science at university of technology Malaysia (UTM). His research interests are data mining, text mining, and opinion mining. Hamid received his bachelor in computer hardware from university of Kashan, in 2006 and his master in computer science from university technology of Malaysia in 2012.

2012 Journal of Computing Press, NY, USA, ISSN 2151-9617 http://sites.google.com/site/journalofcomputing/

You might also like