You are on page 1of 20

To Develop clusters of the users

using ML for the customer


segmentation
Prepared By: Submitted To:
Name of Student/Students(Er. No) Name of Guide
Semester: VIII, BE-CE Asst. Prof CE Dept.,
SOCET SOCET
OUTLINE
2

1. Introduction (Description of the Project Title/Definition)(Must be Shown in review-1)


2. Literature Survey/Background Study(Must be Show in review-1)
3. Problem Statement(Must be Show in review-1)
4. Scopes and Objectives(Must be Shown in review-1)
5. Proposed Flow Chart/Proposed Model(Must be Shown in review-1)
6. Description of modules in your system. (Must be Shown in review-1)
7. System Analysis(Diagrams Use case, Class, State, Activity, Interaction etc..)
8. Tools and Technology for Implementation
9. Comparison of the existing system with your proposed system.
10. Learning Outcomes(What you have learned till now and how these learned concepts are
used in your projects) (Must be Shown in review-1)
11. Implementation
12. Timeline Chart(Must be Shown in review-1)
13. Conclusion
14. References
INTRODUCTION
3

 With the exponential growth of the Internet an Ecommerce development also grows rapidly in apart from that
Business to Consumer services is growing exponentially because of the easy Internet access. The easy way to use
the Ecommerce services encourage the customers to buy online. The major problem comes with the ease of buying
products online is the floating or the overloading of the information because of the various range of the same
products offered by the different vendors are listed on the cloud server. To reduce the overloading of the
information there are various techniques available in the platform such as, product recommendation,
recommendation of the links or filtering based on the characteristics of the customers.
DESCRIPTION OF THE PROJECT TITLE

 To resolve the problem of overloading of the information as well as personalized services in ecommerce
platform and maintain the customer loyalty of the existing customers becomes more complex task. Also,
generating new customers and providing services as per their needs and integration of the new characteristics
based on the customer becomes tedious task for the platform provides. Before the implementation of the
personalization, customer segmentation should be implemented to personalize the services of the ecommerce
platforms. Customer segmentation is the way to provide dynamic solution towards personalization of the
ecommerce services based on the current conditions of the customers.
LITERATURE SURVEY/BACKGROUND STUDY
5

Customer segmentation is performed by using processing of the database of the various customers ecommerce
platforms have which consist of information such as demographic or purchase history. Many research was
conducted in the field of the customer segmentation, such as (Magento, 2014), Who used different kinds of
variables to implement the customer segmentation, like transaction variables or product variables as well as
hobbies and the geographic variables. (Baer, 2012) and (R., 2011) Proposed a customer segmentation method
based on the business rules as well as added supervised & unsupervised clustering methods. They also added
purchased affinity clustering to identify the similarities of the customer behaviour.
(Wen, 2006) in his research identifies that Mean-shift algorithm comes with one problem that it does not work
with high dimension, where number of clusters changes randomly. In that case we don’t have direct control over
the clusters but in some application, it needs some specific number of clusters. So mean shift is not able to
differentiate the meaningful and meaningless mode of the cluster. Proposed system will use the silhouette
analysis which provides its value from -1 to 1. 1: Means clusters are well apart from each other and clearly
distinguished.
LITERATURE SURVEY/BACKGROUND STUDY
6

(Luai, 2006) in his research identifies that in Elbow method, number of the cluster is picked from the range of
candidate values of K and then apply K-Means clustering using each value of K. Here selection of clusters
cannot always be unambiguously identified which makes this method very subjective and unreliable. On the
other side Silhouette Analysis can be used to study the separation distance between the resulting distance which
provides capability to identify the outliers easily.
PROBLEM STATEMENT

 Z-score normalization is not able to produce normalized data with the exact same scale of the original data so need
some research to find the optimal solution for that. If u want to interpret the original value of the attribute, then you
have to transform back after the z-score normalization which leads to increase in computational time. So, I have to
work on the process of the back transformation with effective computational time.
 When standard deviation is zero, means there is no variability in the sampling then it would become impossible task
to perform Z test because z-test measures the number of standard errors (SE) which defines the difference between
observed value and the expected value. Calculation of the Z-test is given by (observed value-expected value)/SE,
where SE = square root of N multiplied by the Standard Deviation. So, when you get SD as zero it ultimately creates
Standard Error to Zero. So, when we perform Z-test we have to divide difference of observed value and the expected
value to the zero which is not possible. That’s why I have to find appropriate solution for this problem.
 I have to make sure that for the clustering purpose model will generate the same size of the cluster. Because if the
size of the cluster differs from each other then K-Means clustering technique is not that much effective.
 Number of cluster while performing the K-Means clustering plays a major role for the analysis purpose. Silhouette
analysis provides better result when the number of clusters are either 2 or 4. If the number of clusters are 3, 5 or 6
then it gives a fluctuated plot while doing the analysis. So, I have to work on how can I resolve this problem.
SCOPES AND OBJECTIVES

 The main objectives to implement this system are given below.


 Grouping Customers on the basis of their homogeneous characteristics, such as tastes, needs etc. using k-means
Clustering.
 To determine marketing strategies, target and goals.
 To measure the quality of classification of customers using silhouette analysis.
 Filter out potential customers that can grow the business using K-means.
PROPOSED FLOW CHART/PROPOSED MODEL
DESCRIPTION OF MODULES IN YOUR SYSTEM.
 This system consists mainly four modules which are Data Pre-Processing on the data, Z-score Normalization, K-Means Clustering
Algorithm and Evaluation by Silhouette Analysis.

A. Data Pre-Processing:
 Customer segmentation requires customer data from various sources. Magento (Magento, 2014) categorizes the data into internal
data and external Data. Customer registration, customer profile, and purchase history are the internal data obtained from the
database of an ecommerce. While external data are census data, media browsing, surveys and market search, cookies, web and
social media analysis.

Data Set for the Research:


I will prefer A UK-based online retail store has captured the sales data for different products for the period of one year (Nov 2016 to
Dec 2017). The organization sells gifts primarily on the online platform. The customers who make a purchase consume directly for
themselves. There are small businesses that buy in bulk and sell to other customers through the retail outlet channel. This dataset
contains attributes like Invoice Number, Stock Code, Description of the product, Quantity of the items during the purchase, Invoice
date, Unit price of the product, Customer Id, and the country name.
Reference: https://www.kaggle.com/vik2012kvs/retail-dataset
DESCRIPTION OF MODULES IN YOUR SYSTEM.

B. Z-score Normalization:
Normalization is scaling technique or a mapping technique or a pre-processing stage (Luai A. Shalabi, 2006). Where, we can find
new range from an existing one range. It can be helpful for the prediction or forecasting purpose a lot (S.Gopal Krishna Patro,
2015). As we know there are so many ways to predict or forecast but all can vary with each other a lot. So, to maintain the large
variation of prediction and forecasting the Normalization technique is required to make them closer. The Z-score allows model to
understand the probability of the score occurring within the normal distribution.
C. K-Means Clustering:
Clustering is the process of partitioning or grouping a given set of patterns into disjoint clusters. This is done such that patterns in
the same cluster are alike and patterns belonging to two different clusters are different. K-means algorithm is one of the
partitioning-based clustering algorithms (Sudhir Singh, 2013). The general objective is to obtain the fixed number of
partitions/clusters that minimize the sum of squared Euclidean distances between objects and cluster centroids.
D. Evaluation by Silhouette Analysis:
Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique provides a
succinct graphical representation of how well each object has been classified. The silhouette value is a measure of how similar an
object is to its own cluster compared to other clusters.
SYSTEM ANALYSIS
TOOLS AND TECHNOLOGY FOR
IMPLEMENTATION
There are following libraries which was used in my project
Scipy: Provides function to perform various operations on sparse data.
Primarily we use two types of sparse matrices which are:
CSC - Compressed Sparse Column provides column slicing.
CSR - Compressed Sparse Row provides row slicing.
Nltk: NTLK package provides functions for the process of tokenization, tagging, stemming, topic analysis etc.
Textblob: Provides API for the Natural language Processing which performs task such as sentiment analysis, classification, translation etc.
Sklern: Provides supervised and unsupervised algorithms
Django: Django is a web framework of the python language which is useful for the rapid development of the website in a securely manner.
It is useful to develop any kind of website like content management to the social media sites and also news site. It is able to work with client-
side framework and applicable to any format like HTML, RSS, JSON, XML etc.
COMPARISON OF THE EXISTING SYSTEM WITH YOUR PROPOSED SYSTEM.

 There are various problems resides in the existing system which


proposed system solve such as
 Min-Max Normalization comes with the one disadvantage that if we have supposed 99 values between 0 to 40
and one single value is 100 then it converts only 99 values between score of 0 to 0.4.
 Mean-shift algorithm comes with one problem that it does not work with high dimension, where number of
clusters changes randomly.
 Elbow method, number of the cluster is picked from the range of candidate values of K and then apply K-Means
clustering using each value of K. Here selection of clusters cannot always be unambiguously identified which
makes this method very subjective and unreliable.
LEARNING OUTCOMES

By using This Product any business which have the sales data of customers can get a predictive analysis of the
potential customers that can grow their business in future. Many Stakeholders can use this product like retail stores,
shopping malls, wholesalers etc.
 This product will generate the accurate output based on the trained data using the K-means clustering technique
and Silhouette analysis. When database gets bigger and bigger model will have more data to train and it will
increase the accuracy of the prediction.
IMPLEMENTATION
TIMELINE CHART
CONCLUSION

The improvement in data collection and use of that data to figure out useful insights is a very crucial task in the field of
customer segmentation and in this project, all the parameters are utilized for the finding the useful information from the
available data.
The various techniques is used to achieve the various objectives of the proposed system like Grouping Customers based
on their homogeneous characteristics, such as tastes, needs etc. using k-means Clustering. Measure the quality of
classification of customers using silhouette analysis and filter out potential customers that can grow the business using K-
means.
The normalization process is applied by using Z-score normalization. The main purpose of the normalization is to find
probability of a score occurring within the normal distribution of the data. Simultaneously it will apply the clustering
technique on the unlabeled data to find the group in the data within the numbers of group. Here data points will be clustered
based on the similarity of the feature.
REFERENCES

Baer, 2012. CSI: Customer Segmentation Intelligence for Increasing Profits. s.l., SAS Global.

Ezenkwu, C. P., 2015. Application of K-Means Algorithm for Efficient Customer Segmentation: A Strategy for Targeted Customer Services.

International Journal of Advanced Research in Artificial Intelligence(IJARAI).

Han, 2012. Segmentation of telecom customers based on customer value by decision. ELSEVIER, p. 3.

Magento, 2014. An Introduction to Customer Segmentation. [Online]


Available at: https://magento.com/sites/default/files8/2019-01/introduction-to-customer-segmentation-v2.pdf

Ma, 2015. A Study on Customer Segmentation for E-Commerce Using the Generalized Association Rules and Decision Tree. American Journal of

Industrial and Business Management, 5(12), pp. 813-818.

PATRO, Technical Analysis on Financial Forecasting. International Journal of Computer Sciences and Engineering, 3(1), pp. 1-6.

20

Thank you

You might also like