You are on page 1of 60

CHAPTER ONE

INTRODUCTION

1.1 Background of the study


The rise in commercial competition, as well as the availability of massive historical
data repositories, has driven the widespread use of data mining techniques to find
valuable and strategic information buried in company databases. Data mining is the
process of extracting useful information from a dataset and presenting it in a
human-friendly format for decision-making purposes (Tekin et al, 2018). Statistics,
artificial intelligence, machine learning, and database systems are all influenced by
data mining approaches. Bioinformatics, weather forecasting, fraud detection,
financial research, and customer segmentation are just a few of the applications of
data mining.
Customer segmentation is the sub-division of a company's customer base into
customer segments, which each comprise customers with similar market
characteristics. This segmentation is based on factors that might directly or
indirectly influence the market or business, such as product preferences or
expectations, location, and behavior. The importance of customer segmentation
include the ability of a business to customize market programs that will be suitable
for each of its customer segments; business decision support in terms of risky
situations such as credit relationships with its customers; identification of products
associated with each segment and how to manage the forces of demand and supply;
unfolding some hidden relationships and associations amongst customers, amongst
1
products, or relationship between customers and products that the firm may not be
aware of; the capacity to predict customer defection and which customers are most
likely to defect; and the ability to raise more market research questions while also
providing guidance to finding solutions.
Clustering is an iterative approach for extracting knowledge from large amounts of
unstructured data. Clustering is an exploratory data mining technique used in a
variety of applications, including machine learning, classification, and pattern
recognition. Clustering has been shown to be effective for detecting subtle yet
tactical patterns or relationships hidden within a repository of unlabeled datasets.
Unsupervised learning, also known as unsupervised machine learning, is a method
of evaluating and clustering unlabeled information with the use of machine
learning algorithms. These algorithms identify hidden patterns or data groupings
without the need for human intervention. The k-Means algorithm, the k-Nearest
Neighbor algorithm, and the Self-Organizing Map (SOM) are examples of
clustering algorithms. These algorithms are capable of detecting clusters in a
dataset without prior knowledge of the dataset by comparing input patterns
repeatedly until stable clusters in the training examples are achieved depending on
the clustering criterion or criteria. Each cluster comprises data points that are
extremely similar yet vastly different from data points in other clusters. Pattern
recognition, image analysis, bioinformatics, and other fields benefit greatly from
clustering.
The K-means algorithm is an iterative algorithm that attempts to split a dataset into
K separate non-overlapping subgroups (clusters), each of which contains only one
data point. It attempts to make terms of inter data points as similar as possible
while maintaining clusters as distinct (far) as possible. It distributes data points to
clusters in such a way that the sum of the squared distances between them and the
cluster's centroid (arithmetic mean of all the data points in that cluster) is as small
2
as possible. Within clusters, the less variance there is, the more homogenous
(similar) the data points are. In this study, the k-means clustering algorithm has
been applied in customer segmentation in splitting a mall customer database record
based on their spending behavior.

1.2 Statement of the Problem


The benefits of customer segmentation analysis are clear. By having a stronger
understanding of their consumer base, retailers can properly allocate resources to
collect and mine relevant information to boost profits. However, getting to the
point of performing high-level customer segmentation analysis is more difficult
than originally thought for many retailers. Many retailers may have the rights to
the necessary data to perform the analysis, but do not have either the ability to
access it in a user-friendly manner or have an employee that has the skill set to
work with it. The lack of proper personnel or equipment to handle the necessary
volume of data is perhaps the biggest hindrance to smaller firms being able to
perform such analysis. The popularity of open source programming software such
as R or Python has certainly helped make this type of analysis more accessible, but
it still would require retailers having someone on their team who can code in either
of those languages. Additionally, some retailers are simply unaware of either the
extent of their data collection or are not yet inspired to dig into it. Nevertheless,
retailers that have not fully adopted customer segmentation analysis are likely not
doing so simply because they cannot afford to spend the time, money, or labor to
perform the analysis. Therefore, it is an aim of this paper to show that this rich
analysis can be performed cheaply and efficiently.

However, there is a far subtler but still consequential reason why retailers do not
implement customer segmentation analysis: it is too complicated to understand.

3
When compared to traditional demographic segmentation or Recency, Frequency
and Monetary(RFM) analysis, high-level customer segmentation analysis requires
far more precise knowledge of machine learning and the mathematics that describe
how the algorithms work. In addition, traditional marketing analysts are not
equipped with the math or programming skills necessary to successfully implement
customer segmentation analysis with machine learning methods (Ozan, 2018) ,
programmers and data analysts are not well-suited to handle marketing tasks. This
poses another conundrum as it involves transforming a typical marketing
assignment—segmenting customers based on purchasing behaviors— into a purely
programming one, which means the marketing team does not have the skills to
code it up themselves but the programming team does not have the marketing skills
to interpret the results. As a result, a blended role with business, programming, and
marketing experience is required. In today's workplaces, this position is known as a
data scientist or an information specialist.

In summary, customer segmentation analysis is the process of trying to understand


a consumer base by splitting it up into segments. While traditional analysts found
some success with demographic or RFM analysis, these models simply do not have
the technological capabilities to provide rich insight into more specific details
regarding the customers. On the other hand, customer segmentation analysis that is
combined with machine learning methods has the ability to transform the way a
retailer thinks about their data. As such, retailers are trying to find cheap, easy
ways to implement and communicate how clustering can be used to segment their
customers.

4
1.3 Aim and objectives of the study
The aim of this project is to perform customer segmentation on a sample mall
customer’s dataset and segment the data into homogeneous groups based on
spending habits.

The specific objectives are as follows:

i. To study the working of the current system.


ii. To design a system to perform customer segmentation using Unified
Modeling Language (UML) diagrams and techniques.
iii. Implement a computerized system for customer segmentation using K-
means algorithm and python programming language.
iv. Evaluate the system to identify appropriate segmentation type.

1.4 Scope of the project


There are several ways in which profit can be improved using customer
segmentation. But this study uses k means clustering algorithm to segment a mall
customer’s dataset to find the level of customer’s type towards the company based
on their annual income and their spending habits.

1.5 Significance of the project


In addition to the scientific contribution, Commercial businesses will find the
following benefits:
 This Study contributed to customer characteristics and their relationship to
customer loyalty, satisfaction, and profitability.

5
 Investigation is supported by customer data and K means clustering
techniques, which gives the opportunity to contribute to the field of
Customer Relationship Management (CRM) analysis by applying general
knowledge in a specific environment.
 Business Organizations can make use of the results as describing profitable
customer’s levels and develop effective strategies.
 It also helps to better understand the structure of profitable relationships and
to realize implications to better manage a customer’s profitable bond with
the company.

1.6 Definition of terms


Customer Segmentation: Customer segmentation is the practice of dividing a
company’s customers into groups that reflect similarity among customers in each
group. The goal of segmenting customers is to decide how to relate to customers in
each segment in order to maximize the value of each customer to the business.

Data mining: Data mining is the process of finding anomalies, patterns and
correlations within large data sets to predict outcomes. Using a broad range of
techniques, you can use this information to increase revenues, cut costs, improve
customer relationships, reduce risks and more.
Databases: A database is an organized collection of structured information, or
data, typically stored electronically in a computer system. A database is usually
controlled by a database management system (DBMS).

Clustering algorithm: Clustering is a Machine Learning technique that involves


the grouping of data points. Given a set of data points, we can use a clustering
algorithm to classify each data point into a specific group

6
Machine learning: Machine learning (ML) is the subset of artificial intelligence
(AI) that focuses on building systems that learn or improve performance based on
the data they consume.
RFM (recency, frequency, monetary) Analysis: RFM analysis is a marketing
approach that ranks and groups customers numerically based on the recency,
frequency, and monetary amount of their recent transactions in order to find the
best customers and launch focused marketing campaigns.
Customer Relationship Management: Customer relationship management
(CRM) is the combination of practices, strategies and technologies that companies
use to manage and analyze customer interactions and data throughout the customer
lifecycle. The goal is to improve customer service relationships and assist in
customer retention and drive sales growth.

7
CHAPTER TWO

LITERATURE REVIEW

2.1 Overview of Cluster Algorithm


Clustering can be defined as the procedure of splitting data into groups. The main
objective is that instances/elements in each group have significantly more
similarity between them than with those outside the group. The subsets/groups
should be relevant based on specific similitude quantification. Following the
production of a specific number of significantly distinct and dissimilar groups in
the feature set, clustering techniques are effectively used to obtain summaries and
visualize data (Jain et al, 1999). There have been breakthrough applications of
clustering methods in everyday life problems involving customer segmentation,
gene expression data, document grouping and many more examples (Shaw et al,
2001; Chang, 2009; Liu et al., 2008; Liang, 2010). Overall, clustering techniques
are useful in the following main ways:

 Summarization - derivation of a miniaturized representation of the full data


set
 Discovery - finding and identifying contemporary insights into the structure
of a dataset

There are other numerous uses such as investigation of the validity of pre-existing
group assignments and as a precursor to prediction by either regression or

8
classification. Clustering is categorized as an unsupervised learning type of
machine learning, where the machine receives inputs but no desired targets
(outputs) or rewards from the surroundings. Usually, the objective is establishing
patterns in the data above and beyond what would be considered noise.

2.2 Cluster Analysis in Market Segmentation


For marketing researchers, cluster analysis has become a standard technique. The
technique is used by both academic and marketing applications researchers to
create empirical groupings of people, products, or events that can be used as a
starting point for additional analysis. Despite their widespread use, nothing is
known about the features of clustering algorithms or how they should be used.
Numerous authors in the marketing literature have failed to clarify which
clustering approach is being utilized, indicating a general lack of awareness of
clustering methodology. Another example is some authors' inability to distinguish
between techniques that differ simply in name.

All segmentation research, regardless of the method used, is used designed to


identify groups of entities that share certain common characteristics (attitudes,
purchase propensities, media habits etc.). Without the specific data used to arrive at
these and the detailed layout of the scope and objectives of the research,
segmentation is equivalent to a grouping exercise. The two researchers add that
clustering techniques also have had and essential role to play in seeking improved
comprehension of buyer behaviors by establishing homogeneous classes of
consumers. Over the years, clustering techniques have been used across a wide
array of industries to segment an organization's customers. Brito et al,(2015)
delved into two separate techniques for customer segmentation: subgroup
discovery and clustering. The models obtained produced six market segments and
9
forty-nine rules that provided an improved comprehension of customer preferences
in a tremendously customized organization dealing with fashion manufacturing.

Jansen (2007) performs segmentation and subsequent profiling of Vodafone


customers based on usage call behavior. He utilizes several progressive clustering
techniques that are adapted and activated for customer segment creation. An
optimality yardstick is defined to measure the performance of each and the best
clustering technique is used to perform customer segmentation. A description of
each segment is provided and followed by analyzed. Finally, the Support Vector
Machines (SVM) algorithm is employed to provide an estimate the group in which
a customer will fall into by utilizing the provided profile. Based on the SVM
approach, it is conceivable to categorize the group of a customer using its profile
for the four segment scenario in 80.3% of the cases. An accurate classification of
78.5% is achieves for six distinct segments.

Ansari and Riasi (2016) used a combination of and genetic algorithm and Fuzzy-C
means techniques to segment the steel market customers. The customers were
grouped into two segments by using the LRFM (length, recency, frequency,
monetary value) variables model. From the results, customers in the first segment
had a greater trade recency, higher relationship length, as well as trade frequency.
However, their monetary value was lower in comparison to the mean values for
these parameters across the customer base.

2.3 Classification of Clustering


Clustering techniques are commonly divided into the following broad categories:

 Hierarchical clustering
 Partitioning clustering
 Density-based clustering

10
However, this classification cannot be either forthright, or entirely canonical. The
classes overlap in reality (Singh, 2010).

2.3.1 Hierarchical methods


This method provides for construction of a hierarchy of clusters by allowing
clusters to have their own sub-clusters, forming a systematic sequence of clusters.
Each leaf in the sequence, also known as tree, represents a data instance. This is
the tiniest possible group. The node at the root on the other hand represents the
group that contains every data object. This is the biggest cluster possible. Every
internal node within the sequence is a group whose components are all the objects
in the nodes of the child (union of the sub-clusters). Designating an end of a given
level provides the ability to extract a collection of non-overlapping objects.

Partition takes place sequentially. This process could in the end cluster all the
instances into one group on n groups of one instance each. A two-dimensional
diagram is used to illustrate hierarchical clustering by showing the divisions or
fusions formed at each successive level of the clustering process.

Hierarchical methods are advantageous in that they provide embedded adaptability


in as far as the extent of granularity and easily handle any typed of similarity or
separation. They are also applicable to any attribute type, be it numeric or
categorical. However, they tend to be vague when it comes to the termination
criteria and most algorithms do not revisit preceding constructed clusters with the
purpose of improvement.

11
2.3.2 Partitioning methods
These simply divide the objects/elements into a set of M groups, where each
element has membership to one group. It is the most popular method. A unique
centroid or cluster representative acts as the representative of each group. The
centroid provides a near summary, if not a precise one, of the cluster objects. A
precise characterization is dependent on the form of the object under consideration.
In instances where the value of the data is available, the arithmetic mean of the
variables for every object within a group gives a fitting representative. Whenever
these values are unavailable, centroids in other forms may be needed.

2.4 K-means Clustering Algorithms


This is an iterative technique whose objective is to minimize the sum of squares
within a class for any number of clusters. The algorithm commences with the
primary guess of center for every cluster. Each instance is subsequently allocated
in to a group to which it is most similar. This step is followed by updating the
cluster centers, and the procedure is reiterated until the centers do not shift any
more. An augmenting clustering technique such as hierarchical algorithm is
normally applied at the onset to arrive at the cluster center starting values.

The k-Means algorithm works thus: given a set of d-dimensional training input
vectors { x1, x2,.., xn}, the k-Means clustering algorithm partitions the n training
examples into k sets of data points or clusters S = {S1, S2, …, Sk}, where k n,
such that the within cluster sum of squares is minimized.

12
2.4.1 Steps to perform customer segmentation
Machine learning, a class of artificial intelligence, can investigate data sets of
similar customers and interpret the most beneficial and most inadequate
performing customer segments.

The subsequent actions are one of many strategies to tackle customer segmentation
over machine learning. You can utilize your favorite tools, partners, and skills to
handle these methods conveniently. Outlined below are the basic steps the
researcher must follow in order to perform customer segmentation.

Step 1: Design a Proper Business Case before you Start


In the case research, we need to visualize consumer habits and styles from different
perspectives. You don’t need to go into this method recklessly. Otherwise, the
result will be dirty and disordered.

Alternatively, you require a good business case to start with. The prospect of


applying machine learning and artificial intelligence can be thought of with:

 Can the consumer support be organized into groups to generate customized


connections within them?
 Is determining the most vital customer gatherings within the entire pool of
consumers worthy?

To fully appreciate customers’ spending and regulation, you can practice with the
latter points in mind:

 Amount of commodities ordered


 ordinary return rate
 cumulative spending

13
Once you’ve prepared the business case, proceed to the next step.

Step 2: Collect and prepare data


The next step is to assemble the data to discover more different patterns and biases
inside the datasets.

You will also necessitate setting complex characteristics depending on the most


relevant metrics for your organization. It may involve:

 Medium lifetime value


 Consumer purchase cost
 Consumer pleasure
 Maintenance rate
 Net earnings

You will need to scale, preprocess and fill the missing values using the open-
source tools available in python, such as pandas, NumPy, etc. This step needs to be
fixed because they add to the visualization step later.

The more extra customer data you have, the more precise decision you will
perform in customer segmentation with machine learning.

14
Figure 1: Data properties (Source: Prateek, 2021)

Step 3: Performing Segmentation Using k-Means Clustering


K-means clustering is a famous method of unsupervised machine learning. This
method obtains all of the diverse “clusters” and clubs them collectively while
maintaining them as tiny as attainable.

15
Figure 2: K-means clustering illustration (Source: Prateek, 2021)

Algorithms works in this manner:

 First, we randomly initialize the value of k as the number of clusters or n-


centroids.
 Next, we allot each data points to the nearest centroid forming separate
groups while relocating the center to the middle of all cluster employing
euclidian distance.
 While working through the preceding steps, the algorithm checks and tries to
reduce the sum of squared distances among clustered-point and middle for
all clusters.

Step 4: Tuning the Optimal Hyper-parameters for the Model


Determining the most beneficial kit of hyper-parameters for an algorithm is the
subsequent measure in customer segments with machine learning because it assists
us in attaining the most genuine and satisfying customer crowds.

16
While choosing the k value, we will select upon the optimization principles of the
K-means, inertia, practicing the elbow method.

With the elbow method, we will decide the k value wherever the drop in the inertia
sustains.

Step 5: Visualization of the results


At last, we visualize the decisions applying the open-source Plotly-Python, a
plotting library in python for making interactive graphs, plots, and charts. Then we
understand the charts and various graphs to develop our enterprise.

Possessing genuine consumer profiles at your fingertips will help enhance


marketing operations targeting, innovation launches, and the merchandise
roadmap.

It will provide your organization exceptionally more evident thoughts about which
customers have the most effective retention rate, contracts, and additional metrics
you initially planned.

2.5 Empirical review


In this section, the researcher is going to cover some notable works that have been
carried out by previous researchers on the topic.

2.5.1 An Empirical Study on Customer Segmentation by Purchase Behaviors


Using a RFM Model and K-Means Algorithm
Jun et al. conducted this investigation (2020). The authors presented a
computational method based on a real-world transaction dataset. They extracted
the buying behavior characteristics of each sort of user through customer
segmentation and then developed accurate marketing tactics based on this
information. The research was divided into three stages, as listed below:
17
Phase 1: Numerical Experiments
The dataset used for this experiment consists of 10,248 purchase data entries
created at a community shopping platform from November 1, 2017 to April 15,
2019, involving 1,013 customers. This platform sells 134 types of commodities,
mostly including cooked food and pasta. The following data processing steps are
carried out in the research:

Step 1: data cleaning


The data entries are initially composed of 12 components such as user ID, product
ID, quantity purchased, and consumption date. The three components user ID,
consumption amount, and consumption date are selected, and outliers and
abnormal information are removed to form the initial dataset (See Table 1).

Step 2: the range method is used to standardize the initial dataset and get an initial
standardized dataset.

Step 3: principal component analysis is performed to objectively weight RFM


indicators to obtain a final standardized dataset.

Phase 2: User Classification Results


The K-means clustering algorithm is used to cluster the data. Judging by the elbow
method (see Figure 2.3), the decrease in SSE is not significant when K is higher
than 4. Hence, choosing K = 4 would yield favorable result. A sk-learn open-
source library in Python language is used to implement the K-means algorithm,
and the results are shown in Figure 3 and 4.

18
Figure 3: Result of optimal user cluster number with elbow method (Source: Jun
et al., 2020)

Figure 4: User clustering scatter plot based on K-means algorithm (Source: Jun et
al., 2020)
In the plot, X axis represents total purchase amount, Y axis represents the most
recent purchase time, and Z axis represents purchase frequency. It can be seen that
the overall user data are close to 0 on X axis that represents the total purchase
19
amount. In the range method, the customer with the highest total purchase amount
is taken as the maximum value. The plot shows that a small number of customers
far exceed the average purchase amount.

Phase 3: Precision Marketing Strategy


Through customer grouping, we can accurately extract the purchase behavior
characteristics of each type of customer and make accurate marketing strategies.

Customer Group 1: These customers represent certain loss risks and need to be
further observed. Active measures can be taken to make them feel more attached to
the platform, including by informing them of attractive promotion activities like
holiday discounts and clearance sales or sending SMS to remind them of gift
packages offered for returning customers.

Customer Group 2: All indicators of these users are the highest and above the
average. Apparently, they spend more time and money shopping on the platform,
and for the platform operator, they represent an important value source. For these
customers, especially VIP ones, marketing activities can focus on improving their
purchase satisfaction and experiences and maintaining their loyalty to the platform.

Customer Group 3: These customers completed their last purchase at an earlier-


than-average time, and their total purchase amount and purchase frequency are
relatively low, implying that they have brought limited profits to the platform.

Customer Group 4: Although these customers have formed certain consumption


habits, they have not left an impressive consumption record on the platform.
Active efforts should be put by the platform into further cultivation of purchase
will among such customers.

20
2.5.2 Customer Segmentation by using RFM Model and Clustering Methods:
A Case Study in Retail Industry
This research was carried out by Onur et al., (2018). The research methodology
includes three major steps. The first phase was related to pre-analysis efforts which
refer data cleaning and transformation. Second, data were analyzed by using RFM
analysis, two- step cluster analysis and K-means clustering. Finally, the results
were presented. The full step methodology process is presented in Fig. 1. The
secondary data set obtained from the customer loyalty cards accounts from the
database of a sports retailing company as suggested from Hu & Yeh (2014). In this
study, we used data that have been collected by a retail store chain which is one of
the biggest of Turkey in sports retailing.

Like any other sports retailing companies, the company offers products such as
footwear, shirts, sweats, accessories and sports equipment. Managers had decided
to create customer loyalty card system for the year 2010 on the purpose of
segmenting customers and creating a customer loyalty program. The loyalty card
program consisted of three card levels; bronze, gold, and premium. Customers who
are members of the loyalty program have been upgraded from the points they earn
depending upon their spending in a one calendar year.

2.6 Summary of Literature


Customer segmentation is essential. Machine learning can get control over the
complete process. Discovering all of the different groups that build up a more
meaningful customer base permits you to get into customers’ brains and give them
precisely what they crave, enhancing their participation and expanding profits.
This customer segmentation project will be carried out using the k-means
algorithm.

21
CHAPTER THREE

SYSTEM ANALYSIS AND DESIGN

3.1 System Analysis


System analysis is the study of a business issue area in order to provide
recommendations for improvements and to define the solution's business needs and
priorities. It entails studying and comprehending an issue, as well as discovering
various solutions, deciding on the best course of action, and constructing the
chosen answer. It entails determining how current systems function and the issues
that come with them. It is important to remember that before designing a new
system, it is required to research the system that will be enhanced or replaced, if
one exists. A system analysis is carried out to investigate a system or a component
of it in order to determine its goals. The purpose of system analysis is to specify
what the system should accomplish. It entails gathering information, examining an
existing solution, and creating a logical model of the system.

The System Development Life Cycle was used as the research approach in this
study. The SDLC technique used for software development was the Waterfall
Model. The whole development process is separated into several phases in the
waterfall model method. The output of one phase serves as the serial input for the
following phase.

3.1.1 Fact finding


In order to carry out this research, the researcher has sought information from
different sources, including the internet, shopping malls, retailers shop etc. the
dataset that issued for the development of the proposed customer segmentation
system was gotten from kaggle.com.
22
Mall Customer Dataset was used for this study. Mall Customer Data is an
intriguing dataset that contains fictitious customer information. The dataset has 5
attributes and 200 tuples, which represent 200 customers. Customers' ID, Gender,
Age, Annual Income, and Spendingscore are among the five characteristics
(behavior). In order to get an accurate outcome in this investigation, four steps
were taken. The three primary generic steps in the k-Means algorithms are feature
normalization, centroids initialization, assignment, and updating. The first 5 rows
of the dataset on a jupyter notebook are shown in the figure below.

Figure 5: Dataset description on Jupyter notebook

3.1.2 Analysis of the existing system


Manual customer segmentation was done by analyzing customer data and records.
While the goal of customer segmentation research has been the same for many
years, past approaches relied on significantly less powerful analytical tools than
those available today. It's foolish to criticize firms in the past for not making the
best use of their data; technology and data infrastructure just weren't accessible or
inexpensive enough to allow them to collect massive amounts of data like they do
23
now. The use of this antiquated technology has various disadvantages,
necessitating the construction of a new automated system..

3.1.3 Disadvantages of the existing system


Outlined below are the disadvantages of the existing system in customer
segmentation.

 Collecting customer data using manual method can be a tedious task.


 High cost of human resources to perform data analysis.
 Data analysis and possible customer segmentation takes long period of time.
 Risk of losing customer data.
 Reduces data integrity.

3.1.4 Advantages of the proposed system


 Focus on acquiring the right customers

 Produce healthier customers

 Improve your product or service


 Refine messaging

3.2 Modeling the Proposed System


System modeling is the process of developing abstract models of a system, with
each model presenting a different perspective of the system. It is all about
representing a system using graphical notation. Models help the analyst to
understand the functionality of the system.

The Unified Modeling Language (UML) is a general-purpose developmental


modeling language in the field of software engineering, which is intended to
24
provide a standard way to visualize the design of a system. It can be used to model
the structures of an application, behaviors and even business processes. The central
idea behind the usage of UML in this research is to capture the significant details
about the system, such that the problem will be clearly understood, solution
architecture can be developed, and a chosen implementation scheme can be clearly
identified and constructed.

3.2.1 Proposed System Architecture


System architecture is the conceptual model that defines the structure, behaviour
and representation of a system. The architecture of the system is shown in the
figure below:

25
Figure 6: System architecture of the proposed system

The architecture shows that after the dataset obtained, it is first preprocessed
which involve transforming raw data into an understandable format and check if
there are missing values. Processing the dataset is done to remove rows or
columns that have missing values due to mistakes the might have occurred when
entering the data into the CSV file. This is important because it prevents some
runtime problems, like as the Not a Number (NaN) error, from preventing the
system from functioning properly. This is important because it avoids some
runtime errors from stopping the system from running properly, such as the Not a
Number (NaN) error. After that, the cluster model is built, which involves
determining how many cluster bases the dataset should be divided into, and fitting
the pre-processed dataset in cluster model clustering.

3.2.2 Use Case Diagram


A use case diagram is a representation of a user’s interaction with the system that
shows the relationship between the user and the different use cases in which the
user is involved. Use case diagrams are a way to capture the system’s functionality
and requirements in UML diagrams. It captures the dynamic behavior of live
system. A use case diagram consists of a use case and an actor. The systems use
case diagram is shown below.

26
Register

Login

Upload dataset

Apply K means Cluster

Display Result
User System

View Result

Figure 7: Use case diagram of the system

3.3 System design


System design is the process of defining the components, modules, interfaces, and
data for a system to satisfy specified requirements. System development is the
process of creating or altering systems, along with the processes, practices, models,
and methodologies used to develop them (Wasson, 2005).

27
The purpose of the System Design process is to provide sufficient detailed data and
information about the system and its system elements to enable the implementation
consistent with architectural entities as defined in models and views of the system
architecture. In this section, the input and out design of the customer segmentation
system will be illustrated.

3.3.1 Input Design


Input part is prerequisite for customer segmentation system. The dataset is
uploaded in this part. Data cleaning and exploration data analysis are performed in
order to get an overview of the data. Then the k means algorithm is applied on the
dataset.

3.3.2 Output design

This part visualizes the segmented dataset adding a categorized label for the
dataset

3.4 Program Design


Program design is the process of translating system requirements into a program
that can be executed on a computer system. The program was designed using K
means clustering algorithm.

3.4.1 Program flowchart


A flowchart shows all the steps that a programmer must follow in order to write a
program. The flowchart of the proposed system is given below;

28
Start

Insert Dataset

Data initialization

Setting of matrix S

Operational

No
Detecting
No whether

Yes

Whether
iteration ends

Yes

Output cluster
number and cluster

Initialize k-means
algorithm with cluster

Run the k-means


algorithm and get
\ clustering result

Stop

29
Figure 8: Flowchart of the propose system

3.5 Description of modules


The new customer segmentation is comprises 3 basic modules. The modules of the
proposed system are discussed below.

3.5.1 Data collection module


This module allows a user to upload customer database record or dataset in a csv
format.

3.5.2 The decision module


Based on the dataset fed into the system entered by the user, this module employs
the k-means clustering algorithm to perform customer segmentation on the data
and then classify customers into appropriate labeled categories.

3.5.3 The Output Module


After data have been analyzed and customer has been segmented, there is need to
display result to the user, hence the need for this module. This module displays the
virtual analysis of the segmented dataset with the category label included in the
column which can be downloaded.

3.5.4 The user interface module


This module allows for the systems ease of use by the user. The interface of this
new customer segmentation system was built using streamlit Application
interface(API) built for the purpose of easy deployment of machine learning
models developed using python.

30
3.6 Development Tools

These are programs that software developers use to create, debug, maintain, or
otherwise support other programs and applications. The tools used are discussed
below:

3.6.1 Choice of programming language used


The programming language used for this project is python programming.

3.6.2 Advantages of python over other languages


i. Fast development: Python has a syntax that is easy to understand and
friendly. Furthermore, the numerous frameworks and libraries boost
software development. By using out-of-box solutions, a lot can be done with
a few lines of code.
ii. Flexible integrations: Python projects can be integrated with other systems
coded in different programming languages. This means that it is much easier
to blend it with other AI projects written in other languages.
iii. Fast code test: Python provides a lot of code review and test tools.
Developers can easily check the correctness and quality of the code.
iv. Visualization tools: Python comes with a wide variety of libraries. Some of
these frameworks offer good visualization tools. In AI, machine learning and
deep learning, it is important to present data in human readable format.
Therefore python is a perfect choice for implementing this feature.
v. Python community: One reason that Python is so well-known is a direct
result of its community. As the data science community continues to adopt
it, more users are volunteering by creating additional data science libraries.
The community is a tight-knit one, and finding a solution to a challenging
problem has never been easier.

31
3.6.3 Choice of development environment
The python development environment used for this project is the jupyter notebook.

3.6.4 Advantages of Jupyter notebook over other IDEs


i. Exploratory Data Analysis (EDA): Usually in answering any business
problem, first step is to analyze the available data using visualization tools.
Jupyter notebook is an excellent tool here, because you can easily create
clean and beautiful reports to communicate your findings.
ii. Model training: After the first step, you need to make new features,
normalize it and transform it to build new models on top of it. Here Jupyter
is a mediocre tool, while it is easy to see the result.
iii. Fast data exploration: one of the most important advantages of Jupyter
notebook is coming from the cell-by-cell nature of the notebook, which
means splitting every possible logical step allows exploring the data at hand
in very interactive manner.
iv. Data caching: Each of the cells been responsible solely for themselves
allows the automatic keeping of cell data for future reference without the
need of running the whole script to get to certain point.
v. Embedded documentation: one of the most simple but effective features
of Jupyter notebook is its simple documentation pop-up
vi. Language Independent:Because of its representation in JSON format,
Jupyter Notebook is platform-independent as well as language-independent.
Another reason is that Jupyter can be processed by any several languages,
and can be converted to any file formats such as Markdown, HTML, PDF,
and others.

32
CHAPTER FOUR

RESULT AND DISCUSSION

4.1 System Implementation


The samples dataset and result of this research project is presented in this chapter
starting from the system requirements to data visualization of customer
segmentation of the dataset.

4.2 Program testing


After the design and coding of an application, it is imperative to run a test to
ascertain that the actual results match the expected results.

In Customer segmentation system using K means clustering Algorithm, the system


testing strategies implored are the unit testing and integration testing strategies.
While testing the system, errors were uncovered. The testing also ensured that the
system works as expected.

4.2.1 Unit testing


Unit testing is a software development process in which the smallest testable parts
of an application, called units, are individually and independently scrutinized for
proper operation. This testing methodology is done during the development
process by the software developers and sometimes Quality Assurance(QA) staff.
During the development of this customer segmentation system , each module was
tested individually after development and necessary adjustments and errors were
corrected accordingly.
33
4.2.2 Integration testing
Integration testing is the phase in software testing in which individual software
modules are combined and tested as a group. Integration testing is conducted to
evaluate the compliance of a system or component with specified functional
requirements. It occurs after unit testing and before system testing. After all the
modules of this customer segmentation system have been developed, each module
was added and tested until the entire system was integrated and tested. During this
integration testing, some errors e.g. lack of some libraries was encountered, but all
the errors was fixed and the system worked as expected.

4.3 Results
The software was executed as specified below. The outputs were evaluated to
determine the performance of the software.

Figure 9: The homepage

34
Figure 10: The Login page

Figure 11: Uploading Dataset

35
Figure 12: Dataset Description

36
Figure 13: Exploratory data analysis 1 (Pie chart for gender distribution)

37
Figure 14: Exploratory data analysis 2 (Heatmap for data distribution )

38
Figure 15: Exploratory data analysis 3 (Distribution chart for Age in the dataset)

39
Figure 16: Result for k means clustering algorithm on elbow method charts and
scatter plots

40
Figure 17: Result for k means clustering algorithm(3d scatter plots and data table)

41
Figure 18: Result for k means clustering algorithm (Csv file)

4.4 Discussion of Results


Based on the results gotten from the tests carried out, the system was able to come
out with the expected outcomes by clustering the uploaded dataset into 5 groups.

4.5 System requirement


System requirements are statements that identify the functionality that is needed by
a system in order to satisfy the customer’s requirements. The requirements of this
customer segmentation system Using K means clustering algorithm are classified
into hardware, software, functional and non-functional requirements.

Stated below are the requirements of the proposed system;

42
4.5.1 Hardware Requirement:
The hardware requirement comprises of the physical parts or component of the
system that are required in order for the system to satisfy its purpose. Outline
below are the hardware requirements of this new system;
 Core i2 Processor Based Computer or higher
 Memory: 1 GB RAM
 Hard Drive: 50 GB
 Monitor

4.5.2 Software Requirement:


 Windows 7 or higher
 Python
 Pandas
 Matplotlib
 Numpy
 seaborn
 plotly
 scikit-learn
 streamlit framework

4.5.3 Functional requirement


The proposed customer segementation system is built on the following functional
requirement;

 An interface to upload customer dataset.


 Function to preview dataset.
 Function to perform exploratory data analysis of the dataset .
 Function to perform K means clustering of the dataset.
43
 Display data visualization of the dataset and downloads a csv file of
segmented data.

4.5.4 Non-Functional requirement


The Non-Functional Requirements of the project are as follows:

 The system should be User friendly.


 The system should meet with all its requirement as stated above.

4.6 User Documentation


A user documentation is used to assist the users by providing them with clear and
comprehensible information about the software. For the house price prediction
system, the steps are:

 Copy the folder that contains the code for the new system and paste in any
location of the computer system

 Launch the command prompt and change directory to the software


location on the computer

 Type streamlit run customerapp.py

 Click on the url https://127.0.0.1:8501/ displayed, this url will run on a web
browser displaying the user interface

 Upload the customer dataset for clustering.

44
CHAPTER FIVE

RECOMMENDATION AND CONCLUSION

5.1 CONCLUSION
Improvement in computing technology has made it possible to examine
information that cannot previously be captured, processed and analyzed. New
analytical techniques of machine learning can be used. This study is an
exploratory attempt to use K means clustering algorithms in grouping
organizations datasets into distinct group based on their age, spending score and
annual income.

The study has shown that k means clustering algorithm is an important tool for
clustering companies customer base so they can have knowledge and better
understanding of the targeted customers. However, these machine learning tool
also have limitations.

The choice of algorithm depends on consideration of a number of factors such


as the size of the data set, computing power of the equipment, and the
availability of waiting time for the results.

To conclude, Machine learning is very useful for finding the relation between the
attributes and building the model according to the relation that attributes contain.
By using K means clustering algorithm which is an unsupervised machine learning
the system is able to cluster customers into distinct segment of similar market
characteristics

45
5.2 RECOMMENDATION
The efficiency and effectiveness of using machine learning to handle customer
segmentation has already been identified by the researcher, therefore the researcher
recommends;

 That the machine learning based customer segmentation should be adopted


in segmenting organizations customer base

5.2.1 Suggested areas for further research


This system focused mainly on the main objective. Other areas such as developing
a GUI for the customers for collecting users data directly into the system was not
considered .

Hence for the system to improve, the researcher now suggest that;

 Other researchers should perform more research on the topic in order to


improve on the system’s functionality.
 This project has been built using k means clustering algorithm machine
learning architecture, the researcher also suggests that other researchers
work on other machine learning architectures such as hierachical clustering,
fuzzy c means and K nearest neighbor algorithm to see which is more
optimal.
 Develop a form page for customer to collect data directly from customers

46
REFERENCES

A. Kumar, Day marketing research-7th, 2000

A. Nagpal, A. Jatain, and D. Gaur, “Review based on data clustering algorithms,”


in 2013 IEEE Conference on Information & Communication Technologies, IEEE,
2013, pp. 298–303.

Bailey. C.: Baines, P.; Wilson, H. and Clarke, M. (2009), "Segmentation and
customer insight in contemporary services marketing practice: why grouping
customers is no longer enough", Journal of Marketing Management, Vol.25,
No.3/4, pp.227-252. .
Brito, P. Q., Soares, C., Almeida, S., Monte, A., & Byvoet, M. (2015). Customer
segmentation in a large database of an online customized fashion
business. Robotics and Computer-Integrated Manufacturing, 36, 93-
100. https://doi.org/10.1016/j.rcim.2014.12.014
Castillo, G. M., González, Y. C., & Mena, A. V. (2017). Data clustering: An
approach for evaluating the adequate number of groups in partitioned
techniques. Journal of Computer Science and Information
Technology. https://doi.org/10.15640/jcsit.v5n1a3

CHANG. H.H. and P.W. KU (2009) Implementation of relationship quality for


CRM performance: acquisition of BPR and organisational learning. Total Quality
Management & Business Excellence. 20,327-348.

Chin-Feng Lin. (2002) "Segmenting customer brand preference: demographic or


psrlographic", Journal of Product & Brand Management. Vol. I I Issue: 4. pp.249-
268. https://doi.org/10.1108/10610420210435443

47
E. Mattila, Behavioral segmentation of telecommunication customers.
Datavetenskapoch kommunikation, Computer Science and Communication . . .,
2008. for the automotive maintenance industry, Expert Systems with Applications,
37, 74897496.

Han, J., Pei, J., & Kamber, M. (2011). Data mining: Concepts and techniques.


Elsevier.
JAIN, AK, MIN. MURTY and PJ. FLYNN (1999) Data clustering: a review,
ACM.‘Computing Surveys (CSUR), 31, 264-323.
JAIN. A.K.. M.N. MURTY and Pi. FLYNN (1999) Data clustering a review. ACM
Computing Surveys (CSUR). 31.264-323.
Jun W., Li S., Wen-Pin L., Sang-Bing T., Yuanyuan L., Liping Y., and Guangshu
X. (2020). An Empirical Study on Customer Segmentation by Purchase Behaviors
Using a RFM Model and K-Means Algorithm. School of Economic and
Management, Beijing University of Chemical Technology, Beijing 100029, China.

K. Tulankar and R. Wajgi, “Clustering telecom customers using emergent self


organizing maps for business profitability 1,” 2012

Laiderman, 1 (2005). "A structured approach to B2B segmentation", Database


Marketing and Customer Strategy Management. Vol.13. No. I. pp.64.75.
LIANG, Y-H. (2010) Integration of data mining technologics to analyze customer
value
Oliveira, R. D. (2020). Análise do USO Da cor no Diagrama de classes Da
Linguagem Unificada de Modelagem (UML) | Analysis of the use of color in the
unified modeling language class diagram (UML). InfoDesign - Revista Brasileira
de Design da Informação, 17(1), 116-130. doi:10.51358/id.v17i1.783

48
Ozan, S. (2018). A case study on customer segmentation by using machine
learning methods. 2018 International Conference on Artificial Intelligence and
Data Processing (IDAP). https://doi.org/10.1109/idap.2018.8620892
Schamai, W., Fritzson, P., Paredis, C., & Pop, A. (2009). Towards unified system
modeling and simulation with ModelicaML: Modeling of executable behavior
using graphical notations. Proceedings of the 7 International Modelica Conference
Como, Italy. doi:10.3384/ecp09430081
SHAW, M1., C. SUBRAMANIAM, G.W. TAN and MLE. WELGE (2001)
Knowledge‘management and data mining for marketing, Decision Support
Systems, 31, 127-137.

SHAW. MJ.. C. SUBRAMANIAM. G.W. TAN and M.E. WELGE (2001)


Knowledge management and data mining for marketing. Decision Support
Systems. 31.127-137.
Tekin, M., Etlioğlu, M., Koyuncuoğlu, Ö., & Tekin, E. (2018). Data mining in
digital marketing. Proceedings of the International Symposium for Production
Research 2018, 44-61. https://doi.org/10.1007/978-3-319-92267-6_4
W. Verbeke, K. Dejaeger, D. Martens, J. Hur, and B. Baesens, “New insights into
churn prediction in the telecommunication sector: A profit driven data mining
approach,” European Journal of Operational Research, vol. 218, no. 1, pp. 211–
229, 2012.

Wasson, C. S. (2015). System engineering analysis, design, and development:


Concepts, principles, and practices. John Wiley & Sons.

WHITE. C. and Y.T. YU (2005) Satisfaction emotions and consumer behavioural


intentions. Journal of Services Marketing. 19.411-420.

49
Yamamoto, C., Tanigawa, I., Hisazumi, K., Sato, M., Ohkawa, T., Ogura, N., &
Watanabe, H. (2021). Layer modeling and its code generation based on context-
oriented programming. Proceedings of the 9th International Conference on Model-
Driven Engineering and Software Development. doi:10.5220/0010328303300336

50
APPENDIX

import numpy as np
import pandas as pd
from pandas import plotting
# import SessionState
# for visualizations
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
# for interactive visualizations
import plotly.offline as py
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
from plotly import tools
init_notebook_mode(connected = True)
import plotly.figure_factory as ff
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
import streamlit as st
51
# EDA Pkgs
import os
# Plotting Pkgs
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image,ImageFilter,ImageEnhance
# from predict_page import predictor
from EDAappnew import show_main
from EDAappnew import explore_page
from analysis import show_customer_Analysis

# for some basic operations


import numpy as np
import pandas as pd

# for visualizations
import matplotlib.pyplot as plt
import seaborn as sns
import time
import plotly.express as px
from cleanData import stringOutput

# Components Pkgs
import streamlit.components.v1 as components
from streamlit_pandas_profiling import st_profile_report
52
from analysis import show_customer_Analysis

import sqlite3
conn=sqlite3.connect("data.db")
c=conn.cursor()

header = st.container()
inp = st.container()
pred = st.container()
footer_temp = """

<!-- CSS -->


<link href="https://fonts.googleapis.com/icon?family=Material+Icons"
rel="stylesheet">
<link href="https://cdnjs.cloudflare.com/ajax/libs/materialize/1.0.0/css/
materialize.min.css" type="text/css" rel="stylesheet"
media="screen,projection"/>
<link href="static/css/style.css" type="text/css" rel="stylesheet"
media="screen,projection"/>
<link rel="stylesheet"
href="https://use.fontawesome.com/releases/v5.5.0/css/all.css"
integrity="sha384-
B4dIYHKNBt8Bc12p+WXckhzcICo0wtJAoU8YZTY5qE0Id1GSseTk6S+L3B
lXeVIU" crossorigin="anonymous">

<footer class="page-footer grey darken-4">


53
<div class="container" id="aboutapp">
<div class="row">
<div class="col l6 s12">
<h5 class="white-text">About Customer Segmentation System</h5>
<p class="grey-text text-lighten-4">Using Streamlit,Python and
Pandas Profile for Market customer segmentation.</p>
</div>

<div class="col l3 s12">


<h5 class="white-text">Connect With Me</h5>
<ul>
<a href="https://facebook.com/Akinwande Alex " target="_blank"
class="white-text">
<i class="fab fa-facebook fa-4x"></i>
</a>
<a href="https://gh.linkedin.com/in/Akinwande Alexander"
target="_blank" class="white-text">
<i class="fab fa-linkedin fa-4x"></i>
</a>
<a
href="https://www.youtube.com/channel/UC2wMHF4HBkTMGLsvZAIWzRg
" target="_blank" class="white-text">
<i class="fab fa-youtube-square fa-4x"></i>
</a>
<a href="https://github.com/Akinwande/" target="_blank"
class="white-text">
<i class="fab fa-github-square fa-4x"></i>

54
</a>
</ul>
</div>
</div>
</div>
<div class="footer-copyright">
<div class="container">
Made by <a class="white-text text-lighten-3"
href="https://akinalex21@gmail.com">Fakorede Akinwande
Alexander</a><br/>
<a class="white-text text-lighten-3"
href="https://akinwandealex95@gmail.com">akinwandealex95@gmail.com</
a>
</div>
</div>
</footer>

"""

def create_table():
c.execute('CREATE TABLE IF NOT EXISTS usertable(username TEXT,
password TEXT)')

def add_userdata(username,password):
c.execute('INSERT INTO usertable(username,password) VALUES (?,?)',
(username,password))
conn.commit()
55
def login_user(username,password):
c.execute('SELECT * FROM usertable WHERE username=? AND
password=?',(username, password))
data=c.fetchall()
return data

def view_all_users():
c.execute('SELECT * FROM usertable')
data=c.fetchall()
return data

def Dataset_upload():
# st.markdown('## Upload dataset') #Streamlit also accepts markdown
data_file = st.file_uploader("Upload a CSV file", type="csv") #data uploader
# st.sidebar.markdown('## Data Import') #Streamlit also accepts markdown
# data_file = st.sidebar.file_uploader("Upload a CSV file", type="csv")
#data uploader
if data_file is not None:
df = pd.read_csv(data_file)
st.markdown('### Data Preview')
st.dataframe(df)
# st.warning("To get an overview of the dataset click the Data Info
button in the sidebar")
56
return df

def main():
"""Customer Segmentation using k-means clustering algorithm"""

menu=["Home" ,"Login","Signup","About"]
choice=st.sidebar.selectbox("Menu",menu)

if choice=="Home":

st.subheader("Home")
st.image('bck.jpg')
st.warning("Check the sidebar menu to login into the system")

elif choice=="Login":
st.subheader("Login Section")
username=st.sidebar.text_input("Username")
password=st.sidebar.text_input("Password", type='password')
if st.sidebar.checkbox("Login"):
create_table()
result=login_user(username,password)

#if password=="12345":
if result:
57
st.success("Logged In as {}".format(username))
st.image('bck.jpg')
task=st.selectbox("Task",
["Homepage","Analaytics","Exploration","Profiles"])
if task=="Homepage":
st.write("welcome to customer segmatation
system")
if task=="Analaytics":
show_main()
elif task=="Exploration":
explore_page()

elif task=="Profiles":
st.subheader("User Profiles")
u_data=view_all_users()

clean_db=pd.DataFrame(u_data,columns=["Username","Password"])

st.dataframe(clean_db)

else:
st.warning("Incorrect Username/password")
elif choice=="Signup":
st.subheader("Create New Account")
new_user=st.text_input('Username')
new_password=st.text_input("Passord", type="password")

58
if st.button("Signup"):
create_table()
add_userdata(new_user,new_password)
st.success("You have succesfullly created a valid Account")
st.info("Go to Login Menu to login")

elif choice == "About":


st.subheader("About App")
# components.iframe('akinwandealex95@gmail.com')
components.html(footer_temp,height=500)

if __name__=='__main__':
main()

59
60

You might also like