You are on page 1of 8

MACHINE LEARNING AND CLUSTERING

Ms. Vedika V. Joshi#1, Ms. Srushti M. Deshmukh#2


Department of Computer Engineering,
Government Polytechnic, Amravati.
1
vedikajoshi18@gmail.com
2
srushtideshmukh2315@gmail.com
------------------------------------------------------------------------------------------------------------------------------------------------------------
Abstract — I. INTRODUCTION

Machine Learning is a scientific study of


Machine Learning is a vital and core area of
algorithms and statistical models that computer
artificial intelligence(AI) that allows software
systems use to perform a specific task without being
applications to become more accurate at predicting
explicitly programmed. Every time a web search
outcomes without being explicitly programmed to do so
engine like Google is used to search the internet, one
Machine learning algorithms build a model based on
of the reasons that work so well is because a learning
sample data, known as training data. Machine learning
algorithm that has learned how to rank web pages.
algorithms are used in a wide variety of applications, such
These algorithms are used for various purposes like
as in medicine, email filtering, speech recognition,
data mining, image processing, predictive analytics,
and computer vision, where it is difficult or unfeasible to
etc.
develop conventional algorithms to perform the needed
The main advantage of using machine learning is that,
tasks.
once an algorithm learns what to do with data, it can
Machine Learning is an essential skill for any aspiring
do its work automatically. The foremost illustrative
data analyst and data scientist, and also for those who
task in data mining process is clustering. It plays an
wish to transform a massive amount of raw data into
exceedingly important role in categorizing data, which
trends and predictions. This machine learning process
is one of the most rudimentary steps in knowledge
starts with feeding them good quality data and then
discovery. It is an unsupervised learning task used for
training the machines by building various machine
exploratory data analysis to find some unrevealed
learning models using the data and different algorithms.
patterns which are present in data but cannot be
The choice of algorithms depends on what type of data
categorized clearly. In this paper, a brief review of
we have and what kind of task we are trying to automate.
machine learning, clustering and their types with
application has been made.
Keywords :- Artificial Intelligence, unsupervised,
unlabeled dataset, cluster-ID, ML system.

1
Fig. 1. Machine Learning

II. BACKGROUND HISTORY

The term machine learning was coined in Fig. 2. Machine Learning Process

1959 by Arthur Samuel, an IBM employee and pioneer


IV. WORKING
in the field of computer gaming and artificial intelligence.
Also, the synonym self-teaching computers were used in
Machine Learning teaches computers to think
this time period. By the early 1960s an experimental
in a similar way to how humans do: Learning and
"learning machine" with punched tape memory, called
improving upon past experiences. It works by exploring
Cybertron, had been developed by Raytheon Company to
data and identifying patterns, and involves minimal
analyse sonar signals, electrocardiograms and speech
human intervention. Almost any task that can be
patterns using rudimentary reinforcement learning. In
completed with a data-defined pattern or set of rules can
1981 a report was given on using teaching strategies so
be automated with machine learning. This allows
that a neural network learns to recognize 40 characters
companies to transform processes that were previously
(26 letters, 10 digits, and 4 special symbols) from a
only possible for humans to perform think responding
computer terminal.
to customer service calls, bookkeeping, and reviewing
III. IMPORTANCE OF MACHINE LEARNING resumes.

Machine learning is important because it V. TYPES OF MACHINE LEARNING


gives enterprises a view of trends in customer behaviour
and business operational patterns, as well as supports the
A. Supervised Learning:
development of new products. Many of today's leading
companies, such as Facebook, Google and Uber, make In this type of machine

machine learning a central part of their operations. learning, data scientists supply algorithms with labelled

Machine learning has become a significant competitive training data and define the variables they want the
differentiator for many companies. algorithm to assess for correlations. Both the input and the
output of the algorithm is specified.

2
B. Unsupervised Learning: data points in other groups. It is basically a collection of
This type of machine learning objects on the basis of similarity and dissimilarity
involves algorithms that train on unlabelled data. The between them. It is basically a type of unsupervised
algorithm scans through data sets looking for any learning method. It does it by finding some similar
meaningful connection. The data that algorithms train on patterns in the unlabelled dataset such as shape, size,

as well as the predictions or recommendations they output colour, behaviour, etc., and divides them as per the
are predetermined. presence and absence of those similar patterns. After
applying this clustering technique, each cluster or group

C. Reinforcement Learning:
is provided with a cluster-ID. ML system can use this id
to simplify the processing of large and complex datasets.
Data scientists typically
If the examples are labeled, then clustering
use reinforcement learning to teach a machine to
becomes classification. The process of classifying the
complete a multi-step process for which there are clearly
input instances based on their corresponding class labels
defined rules. Data scientists program an algorithm to
is known as classification whereas grouping the instances
complete a task and give it positive or negative cues as it
based on their similarity without the help of class labels is
works out how to complete a task. But for the most part,
known as clustering.
the algorithm decides on its own what steps to take along
the way.
Clustering can be divided into two subgroups:
• Hard Clustering: In hard clustering, each data
point either belongs to a cluster completely or not.
• Soft Clustering: In soft clustering, instead of
putting each data point into a separate cluster, a
probability or likelihood of that data point to be
in those clusters is assigned.

Fig. 3. Types of Machine Learning

VI. CLUSTERING

Fig. 4. Clustering Example


Clustering is the task of dividing the
population or data points into a number of groups such
that data points in the same groups are more similar to
other data points in the same group and dissimilar to the

3
VII. TYPES OF CLUSTERING MODELS VIII.TYPES OF CLUSTERING ALGORITHMS

A. K-Means Clustering:

A. Connectivity Model: The k-means algorithm is one


In these models, the data points of the most popular clustering algorithms. It classifies the
closer in data space exhibit more similarity to each other dataset by dividing the samples into different clusters of
than the data points lying farther away. These models can equal variances. The number of clusters must be specified
follow two approaches. In the first approach, they start in this algorithm. It is fast with fewer computations
with classifying all data points into separate clusters & required, with the linear complexity of O(n). This type of
then aggregating them as the distance decreases. In the clustering is used to find a local maximum after each
second approach, all data points are classified as a single iteration in the set of multiple data entity sets.
cluster and then partitioned as the distance increases.

B. Centroid Model:

These are iterative clustering


algorithms in which the notion of similarity is derived by
the closeness of a data point to the centroid of the clusters. Fig. 5. K-Means Clustering

In these models, the no. of clusters required at the end B. Hierarchical Clustering:
have to be mentioned beforehand, which makes it
As the name suggests is an
important to have prior knowledge of the dataset. These
algorithm that builds hierarchy of clusters. This algorithm
models run iteratively to find the local optima.
starts with all the data points assigned to a cluster of their
own. Then two nearest clusters are merged into the same
C. Distribution Model:
cluster. In the end, this algorithm terminates when there is
These clustering models are only a single cluster left. The results of hierarchical
based on the notion of how probable is it that all data clustering can be shown using dendrogram.
points in the cluster belong to the same distribution (For
C. Density Based Spatial Clustering of Applications
example: Normal, Gaussian). These models often suffer
with Noise (DBSCAN):
from overfitting.
DBSCAN is a base algorithm for density-based
D. Density Model: clustering. It can discover clusters of different shapes and
sizes from a large amount of data, which is containing
These models search the data
noise and outliers. It is an example of a density-based
space for areas of varied density of data points in the
model. In this algorithm, the areas of high density are
data space. It isolates various different density
separated by the areas of low density. Because of this, the
regions and assign the data points within these
clusters can be found in any arbitrary shape.
regions in the same cluster. Popular examples of
density models are DBSCAN and OPTICS.

4
D. T-Distributed StocasticNeighbour Embedding (t- inbox, email companies use algorithms. The purpose of
SNE) Algorithm: these algorithms is to flag an email as spam correctly or
t-SNE is a non-linear dimensionality not.
reduction algorithm used for exploring high- How clustering works: K-Means clustering
dimensional data. It maps multi-dimensional data to two techniques have proven to be an effective way of
or more dimensions suitable for human observation. With identifying spam. The way that it works is by looking at
help of the t-SNE algorithms, you may have to plot fewer the different sections of the email (header, sender, and
exploratory data analysis plots next time you work with content). The data is then grouped together.
high dimensional data. These groups can then be classified to identify which are

E. Spectral Clustering: spam. Including clustering in the classification process


improves the accuracy of the filter to 97%.
Spectral Clustering is a growing
clustering algorithm which has performed better than
many traditional clustering algorithms in many cases. It B. Identifying Fake News –
treats each data point as a graph-node and thus Fake news is not a new
transforms the clustering problem into a graph- phenomenon, but it is one that is becoming prolific.
partitioning problem.
What the problem is: Fake news is being created and
F. Agglomerative Hierarchical Algorithm: spread at a rapid rate due to technology innovations such

The as social media.

Agglomerative hierarchical algorithm performs the How clustering works: The way that the algorithm works
bottom-up hierarchical clustering. In this, each data point is by taking in the content of the fake news article, the
is treated as a single cluster at the outset and then corpus, examining the words used and then clustering
successively merged. It is not as well-known as K-Means, them. These clusters are what helps the algorithm
but it is quite flexible and often easier to interpret. The determine which pieces are genuine and which are fake
cluster hierarchy can be represented as a dendrogram or news. Certain words are found more commonly in
tree-structure. sensationalized, click-bait articles. When you see a high
percentage of specific terms in an article, it gives a higher
IX. APPLICATIONS OF CLUSTERING
probability of the material being fake news.

A. Spam Filter - C. Market Analysis –


You know the junk folder in Personalization and targeting in
your email inbox? It is the place where emails that have marketing are big business.
been identified as spam by the algorithm. This is achieved by looking at specific characteristics of
What the problem is: Spam emails are at best an annoying a person and sharing campaigns with them that have been
part of modern day marketing techniques, and at worst, an successful with other similar people.
example of people phishing for your personal data. To
avoid getting these emails in your main

5
What the problem is: If you are a business trying to get What is the problem: You need to look into fraudulent
the best return on your marketing investment, it is crucial driving activity. The challenge is how do you identify
that you target people in the right way. If you get it wrong, what is true and which is false?
you risk not making any sales, or worse, damaging
How clustering works: By analysing the GPS logs, the
your Customer trust.
algorithm is able to group similar behaviours. Based on
How clustering works: Clustering algorithms are able to the characteristics of the groups you are then able to
group together people with similar traits and likelihood to classify them into those that are real and which are
purchase. Once you have the groups, you can run tests on fraudulent.
each group with different marketing copy that will help
you better target your messaging to them in the future.
F. Document Analysis –
There are many different reasons
D. Classifying Network Traffic – why you would want to run an analysis on a document. In
Imagine you want to this scenario, you want to be able to organize
understand the different types of traffic coming to your the documents quickly and efficiently.
website. You are particularly interested in understanding
What the problem is: Imagine you are limited in time and
which traffic is spam or coming from bots.
need to organize information held in documents quickly.
What the problem is: As more and more services begin to To be able to complete this ask you need to: understand
use APIs on your application, or as your website grows, it the theme of the text, compare it with other documents
is important to know where the traffic is coming from. For and classify it.
example, you want to be able to block harmful traffic and
How clustering works: Hierarchical clustering has been
double down on areas driving growth. However, it is hard
used to solve this problem. The algorithm is able to look
to know which is which when it comes to classifying the
at the text and group it into different themes. Using this
traffic.
technique, you can cluster and organize similar
How clustering works: K-means clustering is used to documents quickly using the characteristics identified in
group together characteristics of the traffic sources. When the paragraph.
the clusters are created, you can then classify the traffic
types. By having precise information on traffic sources,
G. Fantasy Football and Sports –
you are able to grow your site and plan capacity
Now let’s look for the
effectively.
critical issues – fantasy football!

What is the problem: Who should you have in your team?


E. Identifying Fraudulent or Criminal Activity –
Which players are going to perform best for your team
In this
and allow you to beat the competition? The challenge at
scenario, we are going to focus on fraudulent taxi driver
the start of the season is that there is very little if any data
behaviour. However, the technique has been used in
available to help you identify the winning players.
multiple scenarios.

6
How clustering works: When there is little performance XIII.CONCLUSION
data available to train your model on, you have an
advantage for unsupervised learning. In this type of In summary, Machine Learning is a technique
machine learning problem, you can find similar players of training machines to perform the activities a human
using some of their characteristics. This has been done brain can do, albeit bit faster and better than an average
using K-Means clustering. Ultimately this means you can human-being. Clustering is an integral part of data mining
get a better team more quickly at the start of the year, and machine learning. It segments the datasets into groups
giving you an advantage. with similar characteristics, which can help you make
better user behavior predictions. Various clustering
models and algorithms are present that will help to create
X. ADVANTAGES OF CLUSTERING
the best potential groups of data objects. It has a large
number of applications in various domains such as
A. Clustering is very useful in data analysis and data mapping, customer reports, etc. Moreover, using
pattern recognition. clustering, we can easily increase the accuracy of the
B. Clustering with use of Dendrograms help us in machine learning approach. we have discussed what are
clear visualization, which is practical and easy to the various ways of performing clustering.
understand.
Although clustering is easy to implement, you need to
take care of some important aspects like treating outliers
in your data and making sure each cluster has sufficient
XI. DISADVANTAGES OF CLUSTERING
population. At last, clustering is rarely the end of your
analysis; often, it is just the beginning.
A. Algorithm selection is easy.
B. The speed at which the data is generated after
clustering algorithm is a challenge. XIV. REFERENCES

[1]https://towardsdatascience.com/hac-hierarchical-
XII. BENEFITS OF CLUSTERING agglomerative-clustering-is-it-better-than-k-means-

A. Increased resource availability. 4ff6f459e390

B. Strategic resource usage.


[2]https://www.educba.com/clustering-algorithms/
C. Multiple machines provide greater processing
power, hence increasing the performance. [3]https://developers.google.com/machine-

D. Greater Scalability. learning/clustering/clustering-algorithms

E. Clustering simplifies the management of large or [4]https://www.analyticsvidhya.com/blog/2016/11/an-


rapidly growing systems. introduction-to-clustering-and-different-methods-of-
clustering/

7
[5]https://www.kdnuggets.com/2020/04/dbscan-
clustering-algorithm-machine-learning.html

[6]Amandeep Kaur Mann and Navneet Kaur, "Survey


Paper on Clustering Techniques", IJSETR: International
Journal of Science Engineering and Technology
Research, vol. 2, no. 4, April 2013, ISSN 2278-7798.

[7]Arun K. Pujari, Data mining techniques-a reference


book, pp. 114-147.

[8]Anil K. Jain and Richard C. Dubes, Algorithms


for Clustering Data, Prentice Hall (1988).

You might also like