You are on page 1of 6

Konferensi Nasional Ilmu Komputer (KONIK) 2021 P-ISSN : 2338-2899

E-ISSN: 2807-1271

Big Mart Sales Prediction Exploratory with the Concepts of


Clustering
Dian Tri Wiyanti1, Isnaini Rosyid2
1,2
Department of Mathematics, Faculty of Mathematics and Natural Science, Universitas Negeri Semarang
Correspondent Author: diantriwiyanti@mail.unnes.ac.id

Abstract — Big marts record data related to product sales valid, usable, and understandable in a very large database,
with their various dependent or independent factors as an which contains a search for patterns or trends in a large
important step to help forecast demand and inventory
management in the future. Big mart tries to understand the database to assist future decision making [4]. There is a
properties of products and outlets that play an important role natural fit between data mining and CRM in that data
in increasing sales. While the increase in sales can not be mining techniques, when applied properly to the right data
separated from the relationship with customers. Customers can be a powerful tool for formulating and implementing a
can be grouped into different categories where marketing good CRM strategy [6]. The CRM method that is often used
people can use targeted marketing and retain customers, this is
commonly known as Customer Relationship Management to describe a set of market information is currently changing
(CRM). Managing a successful CRM implementation requires trends and CRM applications are supported by data from the
an integrated and balanced approach to technology, processes data warehouse [7]. Data mining methods and applications
and people. Cluster analysis is widely used in many can be used for decision-making in CRM in the areas of
applications such as market research, pattern recognition, and customer value and customer experience [4].
data analysis, as it can help marketers find distinct clusters in
their customer base. In this paper, sales data at Big Mart, a one-stop shopping
Keyword —big mart, CRM, clustering. center, has been used to build a predictive model and predict
the sales of each product at a particular outlet. With the
model obtained, BigMart will try to understand the
I. INTRODUCTION properties of products and outlets that play an important role
Customer Relationship Management (CRM) is a in increasing sales. It was reported that data may have
company's approach to understanding and influencing missing values as some stores may not report all data due to
customer behavior through good communication to improve technical glitches. Therefore, it is necessary to treat them
customer acquisition, customer retention, customer loyalty, accordingly.
and customer profitability [1]. Customers could be clustered The predictive power comes from a unique design by
into different categories for which the marketing people can combining the process of extracting and discovering
employ targeted marketing and retain the customers [2]. patterns in large data sets involving methods at the
Therefore rules may be generated to increase business intersection of machine learning and statistics, while
performance. At each customer touchpoint, the organization organizations get help to use their current reporting
reinforces the value delivery while simplifying how capabilities to discover and identify the hidden patterns in
customers interact and relate. To maximize the customer databases [2]. Clustering is a type of unsupervised learning
lifetime value, the sales representative strives to build a where the goal is to partition a set of objects into groups
relationship and efficiently generate revenue from high called clusters, where these groups can be mutually
potential prospects [3]. CRM is an active, participatory, and exclusive or may overlap, depending on the approach used
interactive relationship between business and customer with [8]. Cluster analysis is widely used in many applications
the objective is to achieve a comprehensive view of such as market research, pattern recognition, and data
customers and be able to consistently anticipate and react to analysis, as it can help marketers find distinct groups in their
their needs with targeted and effective activities at every customer base. In addition, they can characterize their
customer touchpoint [4]. To manage a successful CRM customer groups based on buying patterns. As a data mining
implementation requires an integrated and balanced function, cluster analysis serves as a tool to gain insight into
approach to technology, process, and people. Big mart the distribution of data to observe the characteristics of each
records data related to product sales with its various cluster [9]. Data mining is a powerful technique to help
dependent or independent factors as an important step to companies find patterns and trends in their customer
assist in future demand prediction and inventory preferences, it is also a well-known tool for CRM [10].
management. The dataset is built with data collected The main data mining process uses data exploration
through customers, as well as data related to inventory technology to extract data, create predictive models using
management in the data warehouse, then refined to get decision trees, and test and verify the stability and
accurate predictions [5]. Data mining is an iterative and effectiveness of the model. The K-means method is to group
interactive process to find a new pattern or model that is customers into groups based on billing, loyalty, and

109
Tri Wiyanti, et al.
IJCCSISSN
payment behavior to create a decision tree-based model. the initial centroid, with each class described by the
Determining the number of k clusters in a data set with centroid. For a given data set X containing n
limited prior knowledge of the appropriate values is a multidimensional data points and the category K to be
general problem that is different from solving data divided, the Euclidean distance is selected as the similarity
clustering problems [11][12]. While X-Means is a data index and the clustering targets minimizes the sum of the
mining algorithm that is an extension of K-Means which is squares of the various types; that is, it minimizes [15]
stated to be able to cover the shortcomings of K-Means [13].
We use K-Means and X-Means algorithms for predictive
(1)
analysis of the Big Mart case. X-means clustering is used to
overcome one of the main weaknesses of K-means
clustering, it requires prior knowledge of the number of
where k represents K cluster centers, 𝒰𝑘 represents the kth
clusters (K) [14]. We use the two algorithms to show the
center, and 𝑥𝑖 represents the ith point in the data set. The
difference in output, and how the results can provide input
for decision-makers. solution to the centroid 𝒰𝑘 is as follows:

II. METHODS
A. Algorithms and Research Data
This study uses the K-means and X-Means Algorithm (2)
which are the Data Mining algorithms used to cluster data.
While the research data is data from data scientists at
BigMart have earned sales input for 1,559 products in 10
stores in various cities. In addition, certain attributes of each
product and shop have been defined. By these facts, the goal
is to build a predictive model and predict the sales of each
1
product in certain outlets. To evaluate the performance of Let Equation (2) be zero, then 𝒰𝑘 = ∑𝑛𝑖=1 𝑥𝑖 .
𝑘
the prediction and clustering under test, we use a collection
The central idea of algorithm implementation is to
of sales data from the online machine learning repository
randomly extract K sample points from the sample set as the
and data scientist community, Kaggle. The dataset contains
center of the initial cluster: Divide each sample point into
a total of 8,522 samples with randomly selected training and
the cluster represented by the nearest center point; then the
testing data. Few of the training and testing data can be seen
center point of all sample points in each cluster is the center
in Table 1 and Table 2.
point of the cluster. Repeat the above steps until the center
A structured approach to data analytics is needed to
point of the cluster is unchanged or reaches the set number
discovering useful information from a collection of data in
of iterations. The algorithm results change with the choice
this research involves basic preprocessing up to results. The
of the center point, resulting in an instability of the results.
specific stages are described as follows:
The determination of the central point depends on the choice
1) Basic preprocessing
of the K value, which is the focus of the algorithm; it
At this stage, the data set is loaded and some basic
directly affects the clustering results, such as the local
preprocessing tasks are performed. Provides all labeled
optimality or global optimality [16].
and unlabeled data points to which the model should be
applied later, and normalizes the data and its normalized C. X-Means
model so that the data can be changed later.
X-Means is a clustering algorithm that determines the
2) Feature engineering & modeling
correct number of centroids based on a heuristic. It begins
Performs selections using multi-purpose optimization
with a minimum set of centroids and then iteratively
and information preservation concepts, as well as
exploits if using more centroids makes sense according to
performs actual grouping on the changed data.
the data. If a cluster is split into two sub-clusters is
3) Visualization
determined by the Bayesian Information Criteria (BIC),
Create visualization for the cluster model, which
balancing the trade-off between precision and model
describes which data points belong to which cluster.
complexity In essence, the algorithm starts with K equal to
4) Process results
the lower bound of the given range and continues to add
B. K-Means centroids where they are needed until the upper bound is
reached. During this process, the centroid set that achieves
The K-means algorithm is a simple iterative clustering
the best score is recorded, and this is the one that is finally
algorithm. Using the distance as the metric and given the K
output. The algorithm consists of the following two
classes in the data set, calculate the distance mean, giving
operations repeated until completion :

110
Tri Wiyanti, et al.
IJCCSISSN
1) The improve-params operation: consists of running
conventional K-means to convergence.

TABLE 1
A MINOR PART OF THE DATA TRAINING USED

Item_ Item_ Item_Fat Item_Visibili Item_Type Item_MRP Outlet_Ide Outlet_Esta Outlet_ Outlet_Loc Outlet_Type Item_Outlet_
Identifier Weight _Content ty ntifier blishment_ Size ation_Type Sales
Year

Supermarket
FDA15 9.3 Low Fat 0.016047301 Dairy 249.8092 OUT049 1999 Medium Tier 1 Type1 3735.138
Supermarket
DRC01 5.92 Regular 0.019278216 Soft Drinks 48.2692 OUT018 2009 Medium Tier 3 Type2 443.4228
Supermarket
FDN15 17.5 Low Fat 0.016760075 Meat 141.618 OUT049 1999 Medium Tier 1 Type1 2097.27
Fruits and Grocery
FDX07 19.2 Regular 0 Vegetables 182.095 OUT010 1998 Tier 3 Store 732.38
Supermarket
NCD19 8.93 Low Fat 0Household 53.8614 OUT013 1987 High Tier 3 Type1 994.7052
Baking Supermarket
FDP36 10.395 Regular 0 Goods 51.4008 OUT018 2009 Medium Tier 3 Type2 556.6088
Snack Supermarket
FDO10 13.65 Regular 0.012741089 Foods 57.6588 OUT013 1987 High Tier 3 Type1 343.5528
Snack Supermarket
FDP10 Low Fat 0.127469857 Foods 107.7622 OUT027 1985 Medium Tier 3 Type3 4022.7636

TABLE 2
A MINOR PART OF THE DATA TESTING USED

Item_ Item_ Item_Fat Item_Visibili Item_Type Item_MRP Outlet_Ide Outlet_Esta Outlet_ Outlet_Loc Outlet_Type Item_Outlet_
Identifier Weight _Content ty ntifier blishment_ Size ation_Type Sales
Year

Snack Supermarket
FDW58 20.75 Low Fat 0.007564836 Foods 107.8622 OUT049 1999 Medium Tier 1 Type1 FDW58
Supermarket
FDW14 8.3 reg 0.038427677 Dairy 87.3198 OUT017 2007 Tier 2 Type1 FDW14
NCN55 14.6 Low Fat 0.099574908 Others 241.7538 OUT010 1998 Tier 3 Grocery Store NCN55
Snack Supermarket
FDQ58 7.315 Low Fat 0.015388393 Foods 155.034 OUT017 2007 Tier 2 Type1 FDQ58
Supermarket
FDY38 Regular 0.118599314
Dairy 234.23 OUT027 1985 Medium Tier 3 Type3 FDY38
Fruits and Supermarket
FDH56 9.8 Regular 0.063817206 Vegetables 117.1492 OUT046 1997 Small Tier 1 Type1 FDH56
Baking Supermarket
FDL48 19.35 Regular 0.082601537 Goods 50.1034 OUT018 2009 Medium Tier 3 Type2 FDL48
Baking Supermarket
FDC48 Low Fat 0.015782495 Goods 81.0592 OUT027 1985 Medium Tier 3 Type3 FDC48

111
Tri Wiyanti, et al.
IJCCSISSN
2) The improve-structure operation: finds out if and where While cluster 1 gives results with the attribute
new centroids should appear. This is achieved by letting item_outlet_sales is on average 44.27% smaller,
some centroids split in two. It begins by describing and outlet_establishment_year is on average 91.44% smaller,
dismissing two obvious strategies, after which we will and item_MRP is on average 18.44% smaller.
combine their strengths and avoid weakness in the X- Then cluster 2 brings out an average of 49.31% smaller
means strategy. for the attribute item_outlet_sales,
3) If 𝐾 > 𝐾𝑚𝑎𝑥 stop and report the best-scoring model outlet_establishment_year is on average 34.18% larger, and
found during the search. Else, go to 1 [17]. item_MRP is on average 47.11% larger.
The last cluster is cluster 3 shows the attribute
item_outlet_sales is on average 114.70% larger,
III. RESULTS AND DISCUSSION
outlet_establishment_year is on average 96.30% smaller,
This stage consists of an evaluation of the pattern to and item_MRP is on average 34.07% larger.
identify useful patterns representing knowledge based on The Figure 2 also indicates that the two most important
some appropriate and appropriate actions, and knowledge attributes as part of future prediction decisions are
presentation, to present mined knowledge to decision- item_outlet_sales and outlet_establishment_year.
makers. For the final result, the centroid table of K-Means and X-
Means can be seen in Table 3 dan 4.
The item_outlet_sales attribute is sales of the product in
the particular store, and item_MRP is the maximum retail
price (list price) of the product. While item_weight is the
weight of the product, and the outlet_establishment_year
attribute is the year in which the store was established. The
summary of statistics obtained is shown in Table 5.

Figure 1. K-Means scatter plot. IV. CONCLUSION


In this paper, the main objective to be conveyed is the use
Figure 1 shows the output of K-Means gives 2 clusters, of clustering in a CRM system which is a fascinating and
that is clusters 0 and 1. In cluster 0 (blue dots), the attribute effective technique for customer clustering so that it can
item_outlet_sales is on average 45.59% smaller, item_MRP produce irresistible information. Thus, BigMart will try to
is on average 33.83% smaller, and item_weight is on understand the properties of products and outlets that play
average 2.62% smaller. While in cluster 1 (red dots), the an important role in increasing sales from sales data for
attribute item_outlet_sales is on average 67.42% larger, 1559 products in 10 stores in various cities. In the
item_MRP is on average 50.02% larger, and item_weight is implementation, the results of clustering show that not all
on average 3.87% larger. From the Figure 1 we could see attributes affect the high sales results. Several important
that the two most important attributes as part of future attributes need to be considered by decision-makers, and
prediction decisions are item_outlet_sales and item_MRP. how certain outlet locations record the highest sales, other
shopping locations need to follow the same pattern to
increase sales.
The overall goal of the data mining process is to extract
information from a large data set and convert it into an
understandable form for further use. Clustering is important
in data analysis and data mining applications by grouping a
set of objects so that objects in the same group are more
similar to each other than those in other groups (clusters).
Another interesting future work is on the use of data
Figure 2. X-Means scatter plot. mining classification techniques in CRM systems in order to
not only analyze customer behavior but also to predict it. In
Meanwhile, Figure 2 represents the result of X-Means
addition, it is quite interesting to integrate clustering and
which gives 4 clusters, that is clusters 0 (blue dots), 1 (dark
classification algorithms in business intelligence systems to
green dots), 2 (light green dots), and 3 (red dots). In cluster
make it easier for marketing and sales teams to use them.
0, the attribute item_outlet_sales is on average 41.70%
smaller, outlet_establishment_year is on average 38.71%
larger, and item_MRP is on average 32.77% smaller.

112
Tri Wiyanti, et al.
IJCCSISSN
TABLE 3
K-MEANS CENTROID TABLE

Outlet_
Item_Fat Item_Outlet Item_ Outlet_ Outlet Outlet
Cluster Item_MRP Item_Type Location_
_Content _Sales Weight Identifier _Size _Type
Type
Cluster 0 1 103.889 1202.132 13 12.640 3 2 1 1
Cluster 1 1 195.880 3629.646 6 13.179 5 1 1 1

TABLE 4
X-MEANS CENTROID TABLE

Outlet_ Outlet_
Item_Fat Item_Outlet Item_ Outlet_ Outlet Outlet
Cluster Item_MRP Item_Type Establishment Location
_Content _Sales Weight Identifier _Size _Type
_Year _Type
Cluster 0 1 105.049 1285.568 6 12.194 2002.799 7 1 3 1
Cluster 1 1 120.774 1230.366 13 12.965 1986.099 1 2 0 1
Cluster 2 1 192.684 3240.694 13 13.718 2002.218 9 0 1 1
Cluster 3 1 178.381 4645.302 6 12.917 1985.475 5 2 1 3

TABLE 5
DATA TESTING

Value
Attribute
item_outlet_sales item_MRP item_weight outlet_establishment_year
Name
Minimum 33.290 31.290 4.555 1985
Maximum 13086.965 266.888 21.350 2009
Average 2181.455 141.000 12.857 1997.832
Standard Deviation 1706.531 62.275 4.644 8.372

DAFTAR ACUAN Appl. Res., vol. 3, no. 1, pp. 22–32, 2020.


[6] R. S. Winer, “A framework for customer
[1] R. Swift, Accelerating Customer Relationship Using relationship management,” Calif. Manage. Rev., vol.
CRM and Relationship Technologies. New York: 43, no. 4, pp. 89–105, 2001.
Prentice Hall Inc., 2001. [7] A. Khan, N. Ehsan, E. Mirza, and S. Z. Sarwar,
[2] I. Enesi, L. Liço, A. Biberaj, and D. Shahu, “Integration between Customer Relationship
“Analysing Clustering Algorithms Performance in Management (CRM) and Data Warehousing,”
CRM Systems,” in Proceedings of the 23rd Procedia Technol., vol. 1, pp. 239–249, 2012.
International Conference on Enterprise Information [8] A. Cornuéjols, C. Wemmert, P. Gançarski, and Y.
Systems (ICEIS 2021), vol. 1, no. 1, pp. 803–809, Bennani, “Collaborative clustering: Why, when,
2021. what and how,” Inf. Fusion, vol. 39, pp. 81–95,
[3] C. Fisher, “New Technologies for Mobile Salesforce 2018.
Management and CRM,” Am. J. Ind. Bus. Manag., [9] Tutorials Point, “Data Mining - Cluster Analysis”
vol. 7, no. 4, pp. 548–558, 2017. https://www.tutorialspoint.com/data_mining/dm_clu
[4] A. Dwiastuti, A. Larasati, and E. Prahastuti, “The ster_analysis.htm.
implementation of Customer Relationship [10] H. I. Arumawadu, R. M. K. T. Rathnayaka, and S.
Management (CRM) on textile supply chain using K. Illangarathne, “Mining Profitability of
k-means clustering in data mining,” MATEC Web Telecommunication Customers Using K-Means
Conf., vol. 204, 2018. Clustering,” J. Data Anal. Inf. Process., vol. 03, no.
[5] N. Malik and K. Singh, “Sales Prediction Model for 03, pp. 63–71, 2015.
Big Mart,” Parichay Maharaja Surajmal Inst. J. [11] R. M. K. T. Rathnayaka, “Cross-Cultural

113
Tri Wiyanti, et al.
IJCCSISSN
Dimensions of Business Communication: Evidence IOP Conf. Ser. Mater. Sci. Eng., vol. 725, no. 1,
from Sri Lanka,” Int. Rev. Manag. Bus. Res., vol. 3, 2020.
no. 3, pp. 1579–1588, 2014. [15] Q. Wang, C. Wang, Z. Feng, and J. Ye, “Review of
[12] R. M. K. T. Rathnayaka, D. M. K. . Seneviratna, and K-means clustering algorithm,” Electron. Des. Eng,
W. Jianguo, “Grey system based novel approach for vol. 20, pp. 21–24, 2012.
stock market forecasting,” Grey Syst. Theory Appl., [16] R. R. Rathod and R. D. Garg, “Design of electricity
vol. 5, no. 2, pp. 178–193, 2015. tariff plans using gap statistic for K-means
[13] A. Radwan et al., “X-means clustering for wireless clustering based on consumers monthly electricity
sensor networks,” J. Robot. Netw. Artif. Life, vol. 7, consumption data,” Int. J. Energy Sect. Manag., vol.
no. 2, pp. 111–115, 2020. 11, no. 2, pp. 295–310, 2017.
[14] M. Mughnyanti, S. Efendi, and M. Zarlis, “Analysis [17] D. Pelleg and A. W. Moore, “X-means: Extending
of determining centroid clustering x-means k-means with efficient estimation of the number of
algorithm with davies-bouldin index evaluation,” clusters,” in In Icml, pp. 727–734, 2000.

114

You might also like