You are on page 1of 28

UNIT-IV

What Is a Classifier in Machine Learning?

A classifier in machine learning is an algorithm that automatically orders or categorizes


data into one or more of a set of “classes.”

 One of the most common examples is an email classifier that scans emails to filter them
by class label: Spam or Not Spam

What’s the Difference Between a Classifier and a Model?

A classifier is the algorithm itself – the rules used by machines to classify data.

A classification model, on the other hand, is the end result of your classifier’s machine
learning.

The model is trained using the classifier, so that the model, ultimately, classifies your data.
There are both supervised and unsupervised classifiers.

Unsupervised machine learning classifiers are fed only unlabeled datasets, which they
classify according to pattern recognition or structures and anomalies in the data.

Supervised and semi-supervised classifiers are fed training datasets, from which they
learn to classify data according to predetermined categories.

What is the role of a classifier?

A classifier, for example, is used to classify certain features extracted from a face image
and provide a label (an identity of the individual).

The typical recognition/classification framework in Artificial Vision uses a set of object


features for discrimination.
How do you describe classifier accuracy?

For classification, the accuracy estimate is the overall number of correct classifications from the k iterations, divided
by the total number of tuples in the initial data.

 For prediction, the error estimate can be computed as the total loss from the k iterations, divided by the total number
of initial tuples.

Classifier Accuracy

Evaluating & estimating the accuracy of classifiers is important in that it allows one to
evaluate how accurately a given classifier will label future data, that, is, data on which the
classifier has not been trained.

For example, suppose you used data from previous sales to train a classifier to predict
customer purchasing behavior.
Techniques/Methods To Find Accuracy Of The Classifiers
1. Holdout Method
2. Random Subsampling
3. K-fold Cross-Validation
4. Bootstrap Methods
HoldOut

In the holdout method, the largest dataset is randomly divided into three subsets:
A training set is a subset of the dataset which are been used to build predictive models.

The validation set is a subset of the dataset which is been used to assess the
performance of the model built in the training phase. It provides a test platform for fine-
tuning of the model’s parameters and selecting the best-performing model. It is not
necessary for all modeling algorithms to need a validation set.

Test sets or unseen examples are the subset of the dataset to assess the likely future
performance of the model. If a model is fitting into the training set much better than it fits
into the test set, then overfitting is probably the cause that occurred here.
Random Sub sampling

Random sub sampling is a variation of the holdout method. The holdout method is been
repeated K times.

The holdout sub sampling involves randomly splitting the data into a training set and a
test set.
On the training set the data is been trained and the mean square error (MSE) is been
obtained from the predictions on the test set.

As MSE is dependent on the split, this method is not recommended. So a new split can give
you a new MSE.

The overall accuracy is been calculated as E = 1/K \sum_{k}^{i=1} E_{i}


Cross-Validation

K-fold cross-validation is been used when there is only a limited amount of data available,
to achieve an unbiased estimation of the performance of the model.

Here, we divide the data into K subsets of equal sizes.

We build models K times, each time leaving out one of the subsets from the training, and
use it as the test set.

If K equals the sample size, then this is called a “Leave-One-Out”


Bootstrapping

Bootstrapping is one of the techniques which is used to make the estimations from the
data by taking an average of the estimates from smaller data samples.

The bootstrapping method involves the iterative resampling of a dataset with


replacement.

On resampling instead of only estimating the statistics once on complete data, we can do it
many times.

Repeating this multiple times helps to obtain a vector of estimates.

Bootstrapping can compute variance, expected value, and other relevant statistics of these
estimates.
How do you evaluate accuracy of a classifier?

Accuracy. The accuracy of a classifier is given as the percentage of total correct predictions
divided by the total number of instances.

Recall. Recall is one of the most used evaluation metrics for an unbalanced dataset.

Precision. Precision describes how accurate or precise our data mining model is.

F1 Score.

ROC Curve.
Bootstrapping is a resampling technique that helps in estimating the uncertainty of a statistical
model.
It includes sampling the original dataset with replacement and generating multiple new datasets
of the same size as the original.
Each of these new datasets is then used to calculate the desired statistic, such as the mean or
standard deviation.
This process is repeated multiple times, and the resulting values are used to construct a
probability distribution for the desired statistic.
Clustering

Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset.
It can be defined as "A way of grouping the data points into different clusters, consisting of
similar data points. The objects with the possible similarities remain in a group that has less or
no similarities with another group.“

It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.

It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it


deals with the unlabeled dataset.

After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML
system can use this id to simplify the processing of large and complex datasets.
Example: Let's understand the clustering technique with the real-world example of Mall:

When we visit any shopping mall, we can observe that the things with similar usage are
grouped together.

Such as the t-shirts are grouped in one section, and trousers are at other sections,
similarly, at vegetable sections, apples, bananas, Mangoes, etc., are grouped in separate
sections, so that we can easily find out the things.

The clustering technique also works in the same way. Other examples of clustering are
grouping documents according to the topic.
Types of Clustering Methods

The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one
group) and Soft Clustering (data points can belong to another group also).

But there are also other various approaches of Clustering exist.

Below are the main clustering methods used in Machine learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
1. Partitioning Clustering

It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method.

The most common example of partitioning clustering is the K-Means Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define the number
of pre-defined groups.

The cluster center is created in such a way that the distance between the data points of one
cluster is minimum as compared to another cluster centroid.
2. Density-Based Clustering

The density-based clustering method connects the highly-dense areas into clusters, and
the arbitrarily shaped distributions are formed as long as the dense region can be
connected.

This algorithm does it by identifying different clusters in the dataset and connects the
areas of high densities into clusters.

 The dense areas in data space are divided from each other by sparser areas.

These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.
3. Distribution Model-Based Clustering

In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution.

The grouping is done by assuming some distributions commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization Clustering algorithm that


uses Gaussian Mixture Models (GMM).
4. Hierarchical Clustering

Hierarchical clustering can be used as an alternative for the partitioned clustering as there
is no requirement of pre-specifying the number of clusters to be created.

 In this technique, the dataset is divided into clusters to create a tree-like structure, which
is also called a dendrogram.

The observations or any number of clusters can be selected by cutting the tree at the
correct level.

The most common example of this method is the Agglomerative Hierarchical algorithm.
5. Fuzzy Clustering

Fuzzy clustering is a type of soft method in which a data object may belong to more than
one group or cluster.

Each dataset has a set of membership coefficients, which depend on the degree of
membership to be in a cluster.

Fuzzy C-means algorithm is the example of this type of clustering; it is sometimes also
known as the Fuzzy k-means algorithm.
What Is Data Visualization

Data visualization is the process of graphical representation of data in the form of


geographic maps, charts, sparklines, infographics, heat maps, or statistical graphs.

Data presented through visual elements is easy to understand and analyze, enabling the
effective extraction of actionable insights from the data.

Relevant stakeholders can then use the findings to make more efficient real-time
decisions.

Data visualization tools, incorporating support for streaming data, AI integration,


embeddability, collaboration, interactive exploration, and self-service capabilities, facilitate
the visual representation of data.
1. Tableau

Tableau is a highly popular tool for visualizing data for two main reasons: it's easy to use
and very powerful.

You can connect it to lots of data sources and create all sorts of charts and maps.
Salesforce owns Tableau, and it's widely used by many people and big companies.

Tableau has different versions like desktop, server, and web-based options, plus some
customer relationship management (CRM) software.

Providing integration for advanced databases, including Teradata, SAP, My SQL, Amazon
AWS, and Hadoop, Tableau efficiently creates visualizations and graphics from large,
constantly-evolving datasets used for artificial intelligence, machine learning, and Big Data
applications.
2. Dundas BI

Dundas BI offers highly-customizable data visualizations with interactive scorecards,


maps, gauges, and charts, optimizing the creation of ad-hoc, multi-page reports.

By providing users full control over visual elements, Dundas BI simplifies the complex
operation of cleansing, inspecting, transforming, and modeling big datasets.

3. JupyteR

A web-based application, JupyteR, is one of the top-rated data visualization tools that
enable users to create and share documents containing visualizations, equations, narrative
text, and live code.

JupyteR is ideal for data cleansing and transformation, statistical modeling, numerical
simulation, interactive computing, and machine learning.
4. Zoho Reports

Zoho Reports, also known as Zoho Analytics, is a comprehensive data visualization tool
that integrates Business Intelligence and online reporting services, which allow quick
creation and sharing of extensive reports in minutes.

The high-grade visualization tool also supports the import of Big Data from major
databases and applications.

5. Google Charts

One of the major players in the data visualization market space, Google Charts, coded with
SVG and HTML5, is famed for its capability to produce graphical and pictorial data
visualizations.

Google Charts offers zoom functionality, and it provides users with unmatched cross-
platform compatibility with iOS, Android, and even the earlier versions of the Internet
Explorer browser.
6. Visual.ly

Visual.ly is one of the data visualization tools on the market, renowned for its impressive
distribution network that illustrates project outcomes.

 Employing a dedicated creative team for data visualization services, Visual.ly streamlines
the process of data import and outsource, even to third parties.

7. Highcharts

Deployed by seventy-two of the world's top hundred companies, the Highcharts tool is
perfect for visualization of streaming big data analytics.

Running on Javascript API and offering integration with jQuery, Highcharts provides
support for cross-browser functionalities that facilitates easy access to interactive
visualizations.

You might also like