You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/265109861

Usage of cluster analysis in consumer behavior research

Article · August 2012

CITATIONS READS

4 3,590

3 authors:

Pavel Turčínek Jiri Stastny


Mendel University in Brno Mendel University in Brno
23 PUBLICATIONS   67 CITATIONS    104 PUBLICATIONS   697 CITATIONS   

SEE PROFILE SEE PROFILE

Arnost Motycka
Mendel University in Brno
24 PUBLICATIONS   57 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Work-Based Learning in Future IT Professionals Education View project

FITPED-AI: Future IT Professionals EDucation in Artificial Intelligence (Erasmus+ Programme) View project

All content following this page was uploaded by Jiri Stastny on 17 October 2014.

The user has requested enhancement of the downloaded file.


Advances in Applied Information Science

Usage of cluster analysis in consumer behavior research


PAVEL TURCINEK, JIRI STASTNY, ARNOST MOTYCKA
Department of Informatics
Mendel University in Brno
Zemedelska 1, 61300 Brno
CZECH REPUBLIC
pavel.turcinek@mendelu.cz, jiri.stastny@mendelu.cz, mot@mendelu.cz

Abstract: - This article discusses a case study that deals with the application of clustering methods in data
mining research in consumer behavior in the food market. The data obtained questionnaire survey of the
Institute of Marketing and Trade of Faculty of Business and Economics of Mendel University in Brno are
applied to different types of cluster analysis algorithms to find market segments. The aim of this study is to
identify the possibilities of these methods in the issues and describe their suitability or unsuitability for solving
such problems.

Key-Words: - Cluster analysis, Data mining, Consumer behavior, Marketing Research, Application of methods
of knowledge discovery in marketing, Data processing.

1 Introduction surveys in which consumers respond to specific


The issue of consumer behavior falls into the field questions.
of marketing. Into the topic of consumer behavior This article will address the possibilities of how
can be included categories of knowledge and to find specific groups of consumers in the food
understanding of how consumers think, feel, market of Czech Republic. The article will be based
evaluate, choose among different alternatives, how on primary data that was taken out of the research
consumers are influenced by their environment, how institute of marketing and sales at the Faculty of
they act during the decision-making and purchasing, Economics Mendel University in Brno [3].
how they are limited by their knowledge and ability
to process information, what motivates them and
how they differ in their decision-making in different 2 Consumer Behaviour
ways depending on the importance or interest in the Consumer behavior is the multidisciplinary topic, as
product [1]. no self-discipline is able to provide a comprehensive
Perceptions of information and communication view. Experimental psychology is focused on
technologies are gradually changing from something analyzing the role of product in the processes of
rather unique, bringing a competitive advantage in perception, learning and remembering. Clinical
the market, to the necessity of determining existence psychology examines the role of the product in the
or absence between competitive business psychological adjustment, microeconomics looks at
organizations [2]. Nowadays you cannot imagine the role of the product in terms of allocation of
marketing without the involvement of information individual and family resources, social psychology
technology. The development of the Internet greatly analyzes the role of the product in customer
simplified data collection, and so the volume of behavior as a member of social groups, while
usable data to search for information has increased sociology is looking for an answer to what role has
manifold. the product in social institutions and group relations
For the research of consumer behavior data can [1].
be obtained from multiple sources. For secondary
research are typically used from national and 2.1 Collection and processing of data
international sources, such as the Czech Statistical Collection and processing of data is focused on
Office or Eurostat, which provide data in electronic applying several methods which their logical
form and are easily accessible via the Web. In the sequence allows complex analysis of behavior or
primary research are most often used data from decision making of consumers, hence the
implementation of predictions and qualified

ISBN: 978-1-61804-113-5 172


Advances in Applied Information Science

estimations of future development. These are For a better understanding how it can be difficult
secondary data collection instrument of national and to determine the format of individual clusters in
international sources, primary data collection Figure 1 is shown three different ways to divide the
through marketing research and data processing twenty points are assigned to groups. The shapes of
applications selected statistical methods [4]. Data is objects determine the jurisdiction of individual
usually necessary prepare for analysis [5]. clusters. Figures 1 (b) and 1 (d) divide the data into
two or six parts. However, the apparent breakdown
of each of the two major clusters of three sub
3 Cluster analysis clusters can only be caused by the activities of the
Cluster analysis is a multidimensional statistical human visual system. In some cases it may be
method that is used to classify objects. The basic reasonable to classify items into four groups as
problem of cluster analysis is to classify objects into shown in Figure 1 (c). Figure 1 thus shows that the
groups (clusters), so that the two objects in the same creation of clusters is not clear. The optimal
cluster are more similar than two objects of different allocation depends on the nature of the required data
clusters [6]. and the results of [7].

Fig. 1. Different ways of clustering identical points [7].

The first problem is to determine the similarity of exceed its category. Han and Kamber [8] offer the
two objects. To be measured in similarity, each following breakdown:
object must be characterized by their properties [6]. Partitioning methods enable to divide 𝑛 objects
Properties of objects can be divided into several into k groups, where 𝑘 ≤ 𝑛. These methods must be
categories. Han and Kamber [8] reported these types complied with two requirements:
of variables: a) each group must contain at least one
• Interval-Scaled variables object,
• Binary variables b) each object must belong to just one
• Categorical variables group.
• Ordinal variables At first, the appropriate method divides objects
• Ratio-Scaled variables into 𝑘 groups. Subsequently, the algorithm starts the
How to determine the similarity respectively appropriate distribution of objects between groups.
differences of individual criteria is described in The algorithm can be terminated after a certain
detail [8, 9] and others. number of iterations or if they no longer move it.
Hierarchical methods create a hierarchical
3.1 Types of methods of cluster analysis decomposition of objects. Depending on how this
You can find very many algorithms for creating decomposition is carried out the hierarchical
clusters. It is however difficult, is clearly divided methods are divided agglomerative (bottom-up),
into different categories, as some of them may where at the beginning of each object forms its own
class and then there are the most similar pool into

ISBN: 978-1-61804-113-5 173


Advances in Applied Information Science

one class, and divisive (top-down), where all objects Department of Marketing and Trade of Faculty of
are first in one class, which is subdivided until it Business and Economics of the Mendel University
reaches the desired level of distribution. in Brno. Marketing research was focused on the
Density-based methods find the clusters with behavior of consumers in the food market in the
large density of objects in the data area that is Czech Republic. The questionnaire contained thirty
separated from areas with low densities occurring items related to the research questions (low price,
objects. These methods allow users to find clusters product composition, ...), which respondents rated
of different shapes and are also capable of dealing on scale from 1 to 10, where a value of 10
with the occurrence of noise and outliers in the data. determined that this criterion has the highest
Grid-based methods transform the object space importance to the interviewee. Another eight
into a finite number of cells that form a grid questions then characterized the respondent (age,
structure. All clustering operations are performed on sex, educational level, etc.) [3]. By applying data
the structure of the grid. The main advantage of this mining methods to this kind of data, interesting
approach is its speed of processing, which is usually patterns concerning the customers behaviour can be
independent of the number of data objects and identified [14].
depends only on the number of cells in each
dimension.
Model-based methods try to optimize 5 Finding the clusters
consistency between the dataset and some In finding clusters it has been used several
mathematical model, which means that they try to algorithms. It was a K-means algorithm,
find such clusters, which would most correspond to Expectation-Maximization, and DBSCAN algorithm
that model. for hierarchical clustering. Because of absence of
Methods for clustering high-dimensional data knowledge of number of clusters in dataset at first
allow transformation or selection of attributes to we focused on the methods, which do not require
reduce the number of dimensions while preserving this information. As input criteria were selected
the relevant distances between objects [8]. thirty items related to the issue of consumer
Individual examples of algorithms are presented behavior.
and describe in [6, 8, 10, 11, 12, 13]. At first it was tested DBSCAN method. With
default setting (𝑀𝑖𝑛𝑃𝑡𝑠 = 6, 𝜀 = 0.9), there was no
cluster. All objects were identified as noise. So the
4 Data source parameters were modified. The following table
Data file, on which is performed the knowledge shows the values for selected parameters.
discovery was acquired in the survey of the

Table 1. Results of the DBSCAN method.


Number of objects
No. 𝑀𝑖𝑛𝑃𝑡𝑠 𝜀 Number of clusters (the frequency of individual clusters) identified as noise
1 6 0.9 0 2020
2 4 3.9 7 (90, 4, 7, 11, 4, 4, 4) 1896
3 4 5.9 1 (2020) 0
4 4 4.9 1 (2020) 0
5 5 4.9 1 (2020) 0
6 7 4.9 1 (2020) 0
7 7 4.3 3 (1068, 7, 7) 938
8 5 4.0 3 (90, 7, 11) 1912
9 5 4.1 6 (212, 5, 5, 5, 5, 5) 1783
10 4 4.1 8 (223, 20, 7, 5, 5, 4, 4, 5) 1747
11 4 4.2 10 (643, 4, 4, 4, 6, 4, 4, 4, 4, 4) 1339
12 3 4.2 20 (725, 3, 3, 6, 3, 4, 3, 3, 3, 4, 5, 4, 7, 4, 5, 3, 5, 3, 3, 3) 1221
13 3 4.0 9 (107, 3, 8, 11, 3, 3, 3, 4, 4) 1874

ISBN: 978-1-61804-113-5 174


Advances in Applied Information Science

The table is not a list of all the tested values. But Another method which has been tested was
all the other attempts to set the parameters so that Expectation-Maximization. In gaining the outputs of
the output would produce consistently large clusters this method were gradually adjusted value of the
this method failed. As the table shows, applications minimum standard deviation (𝑚𝑖𝑛𝑆𝑡𝑑𝐷𝑒𝑣) and the
DBSCAN method to data from the survey, usually maximum number of iterations. Table 2
we get one large cluster and several other of demonstrates the results.
negligible size. For this reason, this method can be As the table shows the number of iterations does
considered for such data unsuitable conceived. not influence result is too big. Besides one case
As a second division option was chosen where the number of elements in different clusters
hierarchical clustering. Even using this method, we differed by a maximum of two, was no effect of
have not come to the desired distribution. When number of iterations. It is possible that in the event
applying this method, almost exclusively occurred of a further increase in the number of iterations
that one object was separated from the rest. For this causing major changes, but the period during which
reason, is this hierarchical clustering method the algorithm was carried out, would be too long.
unsuitable for our purpose.

Table 2. Results of EM method.


No. 𝑚𝑖𝑛𝑆𝑡𝑑𝐷𝑒𝑣 number of iterations Number of clusters (the frequency of individual clusters)
1 0.000001 100 10 (166, 251, 234, 267, 176, 195, 138, 274, 156, 163)
2 0.000001 200 10 (166, 251, 234, 267, 176, 195, 138, 274, 156, 163)
3 0.00001 100 10 (166, 251, 234, 267, 176, 195, 138, 274, 156, 163)
4 0.0001 100 10 (166, 251, 234, 267, 176, 195, 138, 274, 156, 163)
5 0.0001 500 10 (166, 251, 234, 267, 176, 195, 138, 274, 156, 163)
6 0.001 100 10 (166, 251, 234, 267, 176, 195, 138, 274, 156, 163)
7 0.005 1000 10 (166, 251, 234, 267, 176, 195, 138, 274, 156, 163)
8 0.01 100 7 (472, 342, 237, 189, 323, 282, 175)
9 0.1 100 7 (472, 342, 237, 189, 323, 282, 175)
10 0.5 100 8 (308, 305, 151, 288, 290, 256, 249, 173)
11 1 100 10 (178, 244, 193, 356, 99, 165, 132, 293, 206, 154)
12 1 1000 10 (176, 244, 191, 358, 101, 164, 133, 292, 206, 155)
13 10 100 11 (192, 211, 72, 234, 108, 159, 162, 228, 137, 162, 355)
14 100 100 11 (192, 211, 72, 234, 108, 159, 162, 228, 137, 162, 355)

From the beginning of the experiments, it This method of identifying clusters has provided
seemed that even change of the minimum standard results, which at first glance appears to be
deviation does not change the number and applicable. The output is a few clusters with an
composition of clusters. The change occurred acceptable number of objects.
between the values 0.005, 0.01, where the number As a last method was used 𝑘-means algorithm.
of clusters decreased from ten to seven. Another This procedure requires the knowledge of number of
increase in this parameter, then brought again clusters. The following table presents the results for
increase of the number of clusters. different numbers of clusters.

ISBN: 978-1-61804-113-5 175


Advances in Applied Information Science

Table 3. Results of 𝑘-means method.


No. Number of clusters Frequency of individual clusters
1 3 737, 604, 679
2 4 493, 448, 584, 495
3 5 349, 344, 468, 435, 424
4 6 334, 298, 318, 382, 376, 312
5 7 281, 212, 275, 289, 406, 283, 274
6 8 216, 213, 319, 284, 306, 245, 206, 231
7 9 171, 192, 320, 193, 279, 251, 198, 196, 220
8 10 158, 172, 282, 211, 265, 207, 178, 190, 208, 149

With this method we have achieved clusters of set are only suitable methods EM and K-means, that
comparable size. However, it is difficult to create useable (reasonable) clusters out of input
determine the number closest to reality. This is data. These methods can be used with advantage in
already a task for an expert on the issue, which is the preparatory phase of the subsequent application
able to assess whether the clusters have meaning. of the methods for dealing with data classification
This also applies to the previous method. according to the set parameters [16] or [17].

6 Conclusion References:
As a tool for data analysis was chosen Weka [1] Solomon, M. R. Consumer Behavior. Buying,
software. Weka (Waikato Environment for Having, and Being. Pearson Prenctice Hall.
Knowledge Analysis) is in Java written machine Saddle River 2004, 621 s., ISBN: 0-13-123011-
learning tool, developed at the University of 5.
Waikato, New Zealand. WEKA is freely available [2] Chalupová, N., Motyčka, A. Situation and
software under the GNU General Public License. trends in trade-supporting information
Weka is a set of machine learning algorithms technologies. In Acta Universitatis agriculture
designed for data mining tasks. Algorithms can be et silviculture Mendelianae Brunensis. 2008,
applied directly to a data file, or you can call via our LVI, no. 6, pp. 25-36. ISSN 1211-8516.
own code written in Java. Weka contains tools for [3] Turčínková, J., Kalábová, J., Preferences of
preprocessing, classification, regression, clustering, Moravian consumers when buying food. Acta
association rules and visualization. It is suitable also Universitatis agriculture et silviculture
for developing new machine learning schemes [15]. Mendelianae Brunensis. 2011. Vol. LIX, No. 2,
This whole case study was performed to obtain pp.371-376.
information about the behavior of consumers in the
[4] Turčínková, J., Stávková, J., Stejskal, L.
food market using the methods of cluster analysis.
Chování a rozhodování spotřebitele. 2007.
The study dealt with the issue primarily in terms of
102 s. ISBN 978-80-7392-013-5.
the suitability of selected methods for the type of
data. [5] Munk, M., Kapusta, J., Švec, P., Turčáni, M.
The study concluded that the application of 2010. Data Advance Preparation Factors
cluster analysis on the number of such attributes is Affecting Results of Sequence Rule Analysis in
possible, but not all types of methods are suitable Web Log Mining. In E & M Ekonomie a
for this purpose. However, the results need to Management. ISSN 1212-3609, 2010, vol. 13,
consult an expert in consumer behavior, if the no. 4, p. 143-160.
results relevant. For this type of data would be [6] Řezanková, H., Húsek, D., Snášel, V. Shluková
useful to test other approaches such as the creation analýza dat. Professional Publishing. Praha,
of association rules. 2007, 1. vyd., 196 s. ISBN 978-80-86946-26-9.
In the research of consumer behavior in the food [7] Tan, P.-N., Steinbach, M., Kumar, V.
market was performed analysis of data by following Introduction to Data Mining. Addison-Wesley.
the methods of cluster analysis: DBSCAN, 2006. 769 s. ISBN 9780321321367.
HierarchicalClusterer, Expectation-Maximization,
K-means. The analysis shows that for a given data

ISBN: 978-1-61804-113-5 176


Advances in Applied Information Science

[8] Han, J., Kamber, M. Data mining Concepts and


Techniques. Elsevier, Amsterdam. 2006. 2nd
edition, 770 s. ISBN 1-55860-901-6.
[9] Grabmeier, J., Rudolph, A. Techniques of
Cluster Algorithms in Data Mining. Proceedings
of Data Mining and Knowledge Discovery,
Volume 6, Number 4, 2002. pp303-360, ISSN
1573-756X.
[10] Kaufman, L., Rousseeuw, P. J., Finding
Groups in Data: An Introduction to Cluster
Analysis. John Wiley & Sons, Inc., New Jersey,
2005, 2nd edition, 342 s. ISBN 0-471-73578-7.
[11] Romesburg, H. C. Cluster Analysis For
Researchers, Lulu Press, North Carolina, 2004,
334 s., ISBN 1-4116-0617-5.
[12] Everitt, B. S., Landau, S., Leese, M.,
Cluster Analysis. Arnold, London, 2001, 4.
vydání, 237 s., ISBN 978-0-340-76119-9.
[13] Ester, M., Kriegel, H. Sander, J. Xu, X. A
density-based algorithn for discovering clusters
in large databases with noise. Proceedings of the
2nd ACM SIGKDD International Conference on
Knoelwdge Discovery and Data Mining, AAAI
Press, Portland, 1996, 226-231.
[14] Munk, M., Drlík, M. 2011. Influence of
Different Session Timeouts Thresholds on
Results of Sequence Rule Analysis in
Educational Data Mining. In Communications in
Computer and Information Science. Springer,
ISSN 1865-0929, 2011, vol. 166, p. 60-74.
[15] Weka. Weka 3 - Data Mining with Open
Source Machine Learning Software in Java. [on-
line]. HTML Document. 2011. [cit. 2012-06-
16]. http://www.cs.waikato.ac.nz/ml/weka/
[16] Turčínek, P., Šťastný, J., Motyčka, A.
Usage of Data Mining Techniques on Marketing
Research Data. In Proceedings of the 11th
WSEAS International Conference on Applied
Computer and Computational Science
(ACACOS '12). Rovaniemi, Finland, WSEAS
Press, 2012. p. 159-164. ISBN 978-1-61804-
084-8.
[17] Šťastný, J., Turčínek, P., Motyčka, A. Using
Neural Networks for Marketing Research Data
Classification. In International WSEAS
Conference on Mathematical Methods and
Techniques in Engeneering & Environmental
Science. Catania, Italy, WSEAS Press, 2011. p.
252-256. ISBN 978-1-61804-046-6.

ISBN: 978-1-61804-113-5 177

View publication stats

You might also like