You are on page 1of 32

Top Data Science Tools

Here is the list of 14 best data science tools that most of the data scientists used.
1. SAS
It is one of those data science tools which are specifically designed for statistical operations. SAS is
a closed source proprietary software that is used by large organizations to analyze data. SAS uses
base SAS programming language which for performing statistical modeling. It is widely used by
professionals and companies working on reliable commercial software. SAS offers numerous
statistical libraries and tools that you as a Data Scientist can use for modeling and organizing their
data. While SAS is highly reliable and has strong support from the company, it is highly expensive
and is only used by larger industries. Also, SAS pales in comparison with some of the more
modern tools which are open-source. Furthermore, there are several libraries and packages in SAS
that are not available in the base pack and can require an expensive upgradation.
2. Apache Spark
Apache Spark or simply Spark is an all-powerful analytics engine and it is the most used Data
Science tool. Spark is specifically designed to handle batch processing and Stream Processing. It
comes with many APIs that facilitate Data Scientists to make repeated access to data for Machine
Learning, Storage in SQL, etc. It is an improvement over Hadoop and can perform 100 times
faster than MapReduce. Spark has many Machine Learning APIs that can help Data Scientists to
make powerful predictions with the given data.
Spark does better than other Big Data Platforms in its ability to handle streaming data. This means
that Spark can process real-time data as compared to other analytical tools that process only
historical data in batches. Spark offers various APIs that are programmable in Python, Java, and R.
But the most powerful conjunction of Spark is with Scala programming language which is based on
Java Virtual Machine and is cross-platform in nature.
Spark is highly efficient in cluster management which makes it much better than Hadoop as the
latter is only used for storage. It is this cluster management system that allows Spark to process
application at a high speed.
3. BigML
BigML, it is another widely used Data Science Tool. It provides a fully interactable, cloud-based
GUI environment that you can use for processing Machine Learning Algorithms. BigML provides a
standardized software using cloud computing for industry requirements. Through it, companies can
use Machine Learning algorithms across various parts of their company. For example, it can use
this one software across for sales forecasting, risk analytics, and product innovation. BigML
specializes in predictive modeling. It uses a wide variety of Machine Learning algorithms like
clustering, classification, time-series forecasting, etc.
BigML provides an easy to use web-interface using Rest APIs and you can create a free account or
a premium account based on your data needs. It allows interactive visualizations of data and
provides you with the ability to export visual charts on your mobile or IOT devices.
Furthermore, BigML comes with various automation methods that can help you to automate the
tuning of hyperparameter models and even automate the workflow of reusable scripts.
4. D3.js
Javascript is mainly used as a client-side scripting language. D3.js, a Javascript library allows you
to make interactive visualizations on your web-browser. With several APIs of D3.js, you can use
several functions to create dynamic visualization and analysis of data in your browser. Another
powerful feature of D3.js is the usage of animated transitions. D3.js makes documents dynamic by
allowing updates on the client side and actively using the change in data to reflect visualizations on
the browser.
You can combine this with CSS to create illustrious and transitory visualizations that will help you
to implement customized graphs on web-pages. Overall, it can be a very useful tool for Data
Scientists who are working on IOT based devices that require client-side interaction for
visualization and data processing.
5. MATLAB
MATLAB is a multi-paradigm numerical computing environment for processing mathematical
information. It is a closed-source software that facilitates matrix functions, algorithmic
implementation and statistical modeling of data. MATLAB is most widely used in several scientific
disciplines.
In Data Science, MATLAB is used for simulating neural networks and fuzzy logic. Using the MATLAB
graphics library, you can create powerful visualizations. MATLAB is also used in image and signal
processing. This makes it a very versatile tool for Data Scientists as they can tackle all the
problems, from data cleaning and analysis to more advanced Deep Learning algorithms.
Furthermore, MATLAB‘s easy integration for enterprise applications and embedded systems make
it an ideal Data Science tool. It also helps in automating various tasks ranging from extraction of
data to re-use of scripts for decision making. However, it suffers from the limitation of being a
closed-source proprietary software.
6. Excel
Probably the most widely used Data Analysis tool. Microsoft developed Excel mostly for
spreadsheet calculations and today, it is widely used for data processing, visualization, and
complex calculations. Excel is a powerful analytical tool for Data Science. While it has been the
traditional tool for data analysis, Excel still packs a punch.
Excel comes with various formulae, tables, filters, slicers, etc. You can also create your own
custom functions and formulae using Excel. While Excel is not for calculating the huge amount of
Data, it is still an ideal choice for creating powerful data visualizations and spreadsheets. You can
also connect SQL with Excel and can use it to manipulate and analyze data. A lot of Data Scientists
use Excel for data cleaning as it provides an interactable GUI environment to pre-process
information easily.

With the release of ToolPak for Microsoft Excel, it is now much easier to compute complex
analyzations. However, it still pales in comparison with much more advanced Data Science tools
like SAS. Overall, on a small and non-enterprise level, Excel is an ideal tool for data analysis.
7. ggplot2
ggplot2 is an advanced data visualization package for the R programming language. The
developers created this tool to replace the native graphics package of R and it uses powerful
commands to create illustrious visualizations. It is the most widely used library that Data Scientists
use for creating visualizations from analyzed data.
Ggplot2 is part of tidyverse, a package in R that is designed for Data Science. One way in which
ggplot2 is much better than the rest of the data visualizations is aesthetics. With ggplot2, Data
Scientists can create customized visualizations in order to engage in enhanced storytelling. Using
ggplot2, you can annotate your data in visualizations, add text labels to data points and boost
intractability of your graphs. You can also create various styles of maps such as choropleths,
cartograms, hexbins, etc. It is the most used data science tool.
8. Tableau
Tableau is a Data Visualization software that is packed with powerful graphics to make interactive
visualizations. It is focused on industries working in the field of business intelligence. The most
important aspect of Tableau is its ability to interface with databases, spreadsheets, OLAP (Online
Analytical Processing) cubes, etc. Along with these features, Tableau has the ability to visualize
geographical data and for plotting longitudes and latitudes in maps.
Along with visualizations, you can also use its analytics tool to analyze data. Tableau comes with
an active community and you can share your findings on the online platform. While Tableau is
enterprise software, it comes with a free version called Tableau Public.
9. Jupyter
Project Jupyter is an open-source tool based on IPython for helping developers in making open-
source software and experiences interactive computing. Jupyter supports multiple languages like
Julia, Python, and R. It is a web-application tool used for writing live code, visualizations, and
presentations. Jupyter is a widely popular tool that is designed to address the requirements of
Data Science.
It is an interactable environment through which Data Scientists can perform all of their
responsibilities. It is also a powerful tool for storytelling as various presentation features are
present in it. Using Jupyter Notebooks, one can perform data cleaning, statistical computation,
visualization and create predictive machine learning models. It is 100% open-source and is,
therefore, free of cost. There is an online Jupyter environment called Collaboratory which runs on
the cloud and stores the data in Google Drive.
10. Matplotlib
Matplotlib is a plotting and visualization library developed for Python. It is the most popular tool for
generating graphs with the analyzed data. It is mainly used for plotting complex graphs using
simple lines of code. Using this, one can generate bar plots, histograms, scatterplots etc.
Matplotlib has several essential modules. One of the most widely used modules is pyplot. It offers
a MATLAB like an interface. Pyplot is also an open-source alternative to MATLAB‘s graphic
modules.
Matplotlib is a preferred tool for data visualizations and is used by Data Scientists over other
contemporary tools. As a matter of fact, NASA used Matplotlib for illustrating data visualizations
during the landing of Phoenix Spacecraft. It is also an ideal tool for beginners in learning data
visualization with Python.
11. NLTK
Natural Language Processing has emerged as the most popular field in Data Science. It deals with
the development of statistical models that help computers understand human language. These
statistical models are part of Machine Learning and through several of its algorithms, are able to
assist computers in understanding natural language. Python language comes with a collection of
libraries called Natural Language Toolkit (NLTK) developed for this particular purpose only.
NLTK is widely used for various language processing techniques like tokenization, stemming,
tagging, parsing and machine learning. It consists of over 100 corpora which are a collection of
data for building machine learning models. It has a variety of applications such as Parts of Speech
Tagging, Word Segmentation, Machine Translation, Text to Speech Speech Recognition, etc.
12. Scikit-learn
Scikit-learn is a library based in Python that is used for implementing Machine Learning
Algorithms. It is simple and easy to implement a tool that is widely used for analysis and data
science. It supports a variety of features in Machine Learning such as data preprocessing,
classification, regression, clustering, dimensionality reduction, etc
Scikit-learn makes it easy to use complex machine learning algorithms. It is therefore in situations
that require rapid prototyping and is also an ideal platform to perform research requiring basic
Machine Learning. It makes use of several underlying libraries of Python such as SciPy, Numpy,
Matplotlib, etc.
13. TensorFlow
TensorFlow has become a standard tool for Machine Learning. It is widely used for advanced
machine learning algorithms like Deep Learning. Developers named TensorFlow after Tensors
which are multidimensional arrays. It is an open-source and ever-evolving toolkit which is known
for its performance and high computational abilities. TensorFlow can run on both CPUs and GPUs
and has recently emerged on more powerful TPU platforms. This gives it an unprecedented edge in
terms of the processing power of advanced machine learning algorithms.
Due to its high processing ability, Tensorflow has a variety of applications such as speech
recognition, image classification, drug discovery, image and language generation, etc. For Data
Scientists specializing in Machine Learning, Tensorflow is a must know tool.
14. Weka
Weka or Waikato Environment for Knowledge Analysis is a machine learning software written in
Java. It is a collection of various Machine Learning algorithms for data mining. Weka consists of
various machine learning tools like classification, clustering, regression, visualization and data
preparation.
It is an open-source GUI software that allows easier implementation of machine learning
algorithms through an interactable platform. You can understand the functioning of Machine
Learning on the data without having to write a line of code. It is ideal for Data Scientists who are
beginners in Machine Learning.

Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are
programmed to think like humans and mimic their actions. The term may also be applied to any
machine that exhibits traits associated with a human mind such as learning and problem-solving.
The ideal characteristic of artificial intelligence is its ability to rationalize and take actions that have
the best chance of achieving a specific goal. A subset of artificial intelligence is machine learning,
which refers to the concept that computer programs can automatically learn from and adapt to
new data without being assisted by humans. Deep learning techniques enable this automatic
learning through the absorption of huge amounts of unstructured data such as text, images, or
video.
Understanding Artificial Intelligence (AI)
When most people hear the term artificial intelligence, the first thing they usually think of is
robots. That's because big-budget films and novels weave stories about human-like machines that
wreak havoc on Earth. But nothing could be further from the truth.
Artificial intelligence is based on the principle that human intelligence can be defined in a way that
a machine can easily mimic it and execute tasks, from the most simple to those that are even
more complex. The goals of artificial intelligence include mimicking human cognitive activity.
Researchers and developers in the field are making surprisingly rapid strides in mimicking
activities such as learning, reasoning, and perception, to the extent that these can be concretely
defined. Some believe that innovators may soon be able to develop systems that exceed the
capacity of humans to learn or reason out any subject. But others remain skeptical because all
cognitive activity is laced with value judgements that are subject to human experience.
As technology advances, previous benchmarks that defined artificial intelligence become outdated.
For example, machines that calculate basic functions or recognize text through optical character
recognition are no longer considered to embody artificial intelligence, since this function is now
taken for granted as an inherent computer function.
AI is continuously evolving to benefit many different industries. Machines are wired using a cross-
disciplinary approach based on mathematics, computer science, linguistics, psychology, and more.
Algorithms often play a very important part in the structure of artificial intelligence, where simple
algorithms are used in simple applications, while more complex ones help frame strong artificial
intelligence.
Applications of Artificial Intelligence
The applications for artificial intelligence are endless. The technology can be applied to many
different sectors and industries. AI is being tested and used in the healthcare industry for dosing
drugs and different treatment in patients, and for surgical procedures in the operating room.
Other examples of machines with artificial intelligence include computers that play chess and self-
driving cars. Each of these machines must weigh the consequences of any action they take, as
each action will impact the end result. In chess, the end result is winning the game. For self-
driving cars, the computer system must account for all external data and compute it to act in a
way that prevents a collision.
Artificial intelligence also has applications in the financial industry, where it is used to detect and
flag activity in banking and finance such as unusual debit card usage and large account deposits—
all of which help a bank's fraud department. Applications for AI are also being used to help
streamline and make trading easier. This is done by making supply, demand, and pricing of
securities easier to estimate.
Categorization of Artificial Intelligence
Artificial intelligence can be divided into two different categories: weak and strong. Weak
artificial intelligence embodies a system designed to carry out one particular job. Weak AI
systems include video games such as the chess example from above and personal assistants such
as Amazon's Alexa and Apple's Siri. You ask the assistant a question, it answers it for you.
Strong artificial intelligence systems are systems that carry on the tasks considered to be
human-like. These tend to be more complex and complicated systems. They are programmed to
handle situations in which they may be required to problem solve without having a person
intervene. These kinds of systems can be found in applications like self-driving cars or in hospital
operating rooms.

Machine Learning
Machine learning is a growing technology which enables computers to learn automatically from
past data. Machine learning uses various algorithms for building mathematical models and
making predictions using historical data or information. Currently, it is being used for
various tasks such as image recognition, speech recognition, email filtering, Facebook
auto-tagging, recommender system, and many more.
This machine learning tutorial gives you an introduction to machine learning along with the wide
range of machine learning techniques such as Supervised, Unsupervised,
and Reinforcement learning. You will learn about regression and classification models, clustering
methods, hidden Markov models, and various sequential models.
What is Machine Learning
In the real world, we are surrounded by humans who can learn everything from their experiences
with their learning capability, and we have computers or machines which work on our instructions.
But can a machine also learn from experiences or past data like a human does? So here comes the
role of Machine Learning.

Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the
development of algorithms which allow a computer to learn from the data and past experiences on
their own. The term machine learning was first introduced by Arthur Samuel in 1959. We can
define it in a summarized way as:
Machine learning enables a machine to automatically learn from data, improve performance from
experiences, and predict things without being explicitly programmed.
With the help of sample historical data, which is known as training data, machine learning
algorithms build a mathematical model that helps in making predictions or decisions without
being explicitly programmed. Machine learning brings computer science and statistics together for
creating predictive models. Machine learning constructs or uses the algorithms that learn from
historical data. The more we will provide the information, the higher will be the performance.
A machine has the ability to learn if it can improve its performance by gaining more
data.
How does Machine Learning work
A Machine Learning system learns from historical data, builds the prediction models, and
whenever it receives new data, predicts the output for it. The accuracy of predicted output
depends upon the amount of data, as the huge amount of data helps to build a better model which
predicts the output more accurately.
Suppose we have a complex problem, where we need to perform some predictions, so instead of
writing a code for it, we just need to feed the data to generic algorithms, and with the help of
these algorithms, machine builds the logic as per the data and predict the output. Machine
learning has changed our way of thinking about the problem. The below block diagram explains
the working of Machine Learning algorithm:

Features of Machine Learning:


o Machine learning uses data to detect various patterns in a given dataset.
o It can learn from past data and improve automatically.
o It is a data-driven technology.
o Machine learning is much similar to data mining as it also deals with the huge amount of
the data.
Need for Machine Learning
The need for machine learning is increasing day by day. The reason behind the need for machine
learning is that it is capable of doing tasks that are too complex for a person to implement directly.
As a human, we have some limitations as we cannot access the huge amount of data manually, so
for this, we need some computer systems and here comes the machine learning to make things
easy for us.
We can train machine learning algorithms by providing them the huge amount of data and let
them explore the data, construct the models, and predict the required output automatically. The
performance of the machine learning algorithm depends on the amount of data, and it can be
determined by the cost function. With the help of machine learning, we can save both time and
money.
The importance of machine learning can be easily understood by its uses cases, Currently,
machine learning is used in self-driving cars, cyber fraud detection, face recognition,
and friend suggestion by Facebook, etc. Various top companies such as Netflix and Amazon
have build machine learning models that are using a vast amount of data to analyze the user
interest and recommend product accordingly.
Following are some key points which show the importance of Machine Learning:
o Rapid increment in the production of data
o Solving complex problems, which are difficult for a human
o Decision making in various sector including finance
o Finding hidden patterns and extracting useful information from data.
Classification of Machine Learning
At a broad level, machine learning can be classified into four types:
1. Supervised learning
2. Unsupervised learning
3. Semi- Supervised learning
4. Reinforcement learning

1) Supervised Learning
Supervised learning is a type of machine learning method in which we provide sample labeled data
to the machine learning system in order to train it, and on that basis, it predicts the output.
The system creates a model using labeled data to understand the datasets and learn about each
data, once the training and processing are done then we test the model by providing a sample
data to check whether it is predicting the exact output or not.
The goal of supervised learning is to map input data with the output data. The supervised learning
is based on supervision, and it is the same as when a student learns things in the supervision of
the teacher. The example of supervised learning is spam filtering.
Supervised learning can be grouped further in two categories of algorithms:
o Classification
o Regression
2) Unsupervised Learning
Unsupervised learning is a learning method in which a machine learns without any supervision.
The training is provided to the machine with the set of data that has not been labeled, classified,
or categorized, and the algorithm needs to act on that data without any supervision. The goal of
unsupervised learning is to restructure the input data into new features or a group of objects with
similar patterns.
In unsupervised learning, we don't have a predetermined result. The machine tries to find useful
insights from the huge amount of data. It can be further classifieds into two categories of
algorithms:
o Clustering
o Association
3) Semi-Supervised Learning
Semi-Supervised Learning is a learning method in which a machine learns with and without any
supervision. It is combination of Supervised and Unsupervised Learning.

4) Reinforcement Learning
Reinforcement learning is a feedback-based learning method, in which a learning agent gets a
reward for each right action and gets a penalty for each wrong action. The agent learns
automatically with these feedbacks and improves its performance. In reinforcement learning, the
agent interacts with the environment and explores it. The goal of an agent is to get the most
reward points, and hence, it improves its performance.
The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.

Difference between Artificial intelligence and Machine learning


Artificial intelligence and machine learning are the part of computer science that are correlated
with each other. These two technologies are the most trending technologies which are used for
creating intelligent systems.
Although these are two related technologies and sometimes people use them as a synonym for
each other, but still both are the two different terms in various cases.
On a broad level, we can differentiate both AI and ML as:
AI is a bigger concept to create intelligent machines that can simulate human thinking capability
and behavior, whereas, machine learning is an application or subset of AI that allows machines to
learn from data without being programmed explicitly.
Below are some main differences between AI and machine learning along with the overview of
Artificial intelligence and machine learning.
Artificial Intelligence
Artificial intelligence is a field of computer science which makes a computer system that can mimic
human intelligence. It is comprised of two words "Artificial" and "intelligence", which means "a
human-made thinking power." Hence we can define it as,
Artificial intelligence is a technology using which we can create intelligent systems that can
simulate human intelligence.
The Artificial intelligence system does not require to be pre-programmed, instead of that, they use
such algorithms which can work with their own intelligence. It involves machine learning
algorithms such as Reinforcement learning algorithm and deep learning neural networks. AI is
being used in multiple places such as Siri, Google?s AlphaGo, AI in Chess playing, etc.
Based on capabilities, AI can be classified into three types:
Weak AI
General AI
Strong AI
Currently, we are working with weak AI and general AI. The future of AI is Strong AI for which it is
said that it will be intelligent than humans.
Machine learning
Machine learning is about extracting knowledge from the data. It can be defined as,
Machine learning is a subfield of artificial intelligence, which enables machines to learn from past
data or experiences without being explicitly programmed.
Machine learning enables a computer system to make predictions or take some decisions using
historical data without being explicitly programmed. Machine learning uses a massive amount of
structured and semi-structured data so that a machine learning model can generate accurate
result or give predictions based on that data.
Machine learning works on algorithm which learn by it?s own using historical data. It works only
for specific domains such as if we are creating a machine learning model to detect pictures of
dogs, it will only give result for dog images, but if we provide a new data like cat image then it will
become unresponsive. Machine learning is being used in various places such as for online
recommender system, for Google search algorithms, Email spam filter, Facebook Auto friend
tagging suggestion, etc.
Key differences between Artificial Intelligence (AI) and Machine learning (ML):

Artificial Intelligence Machine learning

Artificial intelligence is a technology which Machine learning is a subset of AI which allows a machine to
enables a machine to simulate human behavior. automatically learn from past data without programming
explicitly.

The goal of AI is to make a smart computer The goal of ML is to allow machines to learn from data so that
system like humans to solve complex problems. they can give accurate output.

In AI, we make intelligent systems to perform In ML, we teach machines with data to perform a particular
any task like a human. task and give an accurate result.

Machine learning and deep learning are the two Deep learning is a main subset of machine learning.
main subsets of AI.

AI has a very wide range of scope. Machine learning has a limited scope.

AI is working to create an intelligent system Machine learning is working to create machines that can
which can perform various complex tasks. perform only those specific tasks for which they are trained.

AI system is concerned about maximizing the Machine learning is mainly concerned about accuracy and
chances of success. patterns.

The main applications of AI are Siri, customer The main applications of machine learning are Online
support using catboats, Expert System, recommender system, Google search
Online game playing, intelligent humanoid algorithms, Facebook auto friend tagging suggestions,
robot, etc. etc.

On the basis of capabilities, AI can be divided Machine learning can also be divided into mainly three types
into three types, which are, Weak AI, General that are Supervised learning, Unsupervised learning,
AI, and Strong AI. and Reinforcement learning.

It includes learning, reasoning, and self- It includes learning and self-correction when introduced with
correction. new data.

AI completely deals with Structured, semi- Machine learning deals with Structured and semi-structured
structured, and unstructured data. data.

Classification Algorithm in Machine Learning

As we know, the Supervised Machine Learning algorithm can be broadly classified into Regression
and Classification Algorithms. In Regression algorithms, we have predicted the output for
continuous values, but to predict the categorical values, we need Classification algorithms.

What is the Classification Algorithm?

The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program learns from
the given dataset or observations and then classifies new observation into a number of classes or
groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called
as targets/labels or categories.

Unlike regression, the output variable of Classification is a category, not a value, such as "Green or
Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised learning technique,
hence it takes labeled input data, which means it contains input with the corresponding output.

The main goal of the Classification algorithm is to identify the category of a given dataset, and
these algorithms are mainly used to predict the output for the categorical data.

Classification algorithms can be better understood using the below diagram. In the below diagram,
there are two classes, class A and Class B. These classes have features that are similar to each
other and dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier. There are
two types of Classifications:

o Binary Classifier: If the classification problem has only two possible outcomes, then it is
called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.

o Multi-class Classifier: If a classification problem has more than two outcomes, then it is
called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.

Learners in Classification Problems:

In the classification problems, there are two types of learners:

1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the
test dataset. In Lazy learner case, classification is done on the basis of the most related
data stored in the training dataset. It takes less time in training but more time for
predictions.
Example: K-NN algorithm, Case-based reasoning

2. Eager Learners: Eager Learners develop a classification model based on a training dataset
before receiving a test dataset. Opposite to Lazy learners, Eager learners take less time in
training and more time in prediction. Example: Decision Trees, Naïve Bayes, ANN.

Classification Algorithms can be further divided into the Mainly two category:

o Linear Models

o Logistic Regression

o Support Vector Machines

o Non-linear Models

o K-Nearest Neighbours

o Kernel SVM
o Naïve Bayes

o Decision Tree Classification

o Random Forest Classification

Classification:

Classification is a process of finding a function which helps in dividing the dataset into classes
based on different parameters. In Classification, a computer program is trained on the training
dataset and based on that training, it categorizes the data into different classes.

The task of the classification algorithm is to find the mapping function to map the input(x) to the
discrete output(y).

Example: The best example to understand the Classification problem is Email Spam Detection.
The model is trained on the basis of millions of emails on different parameters, and whenever it
receives a new email, it identifies whether the email is spam or not. If the email is spam, then it is
moved to the Spam folder.

Types of ML Classification Algorithms:

Classification Algorithms can be further divided into the following types:

o Logistic Regression

o K-Nearest Neighbours

o Support Vector Machines

o Kernel SVM

o Naïve Bayes

o Decision Tree Classification

o Random Forest Classification

Regression:

Regression is a process of finding the correlations between dependent and independent variables.
It helps in predicting the continuous variables such as prediction of Market Trends, prediction of
House prices, etc.

The task of the Regression algorithm is to find the mapping function to map the input variable(x)
to the continuous output variable(y).

Example: Suppose we want to do weather forecasting, so for this, we will use the Regression
algorithm. In weather prediction, the model is trained on the past data, and once the training is
completed, it can easily predict the weather for future days.

Types of Regression Algorithm:

o Simple Linear Regression

o Multiple Linear Regression

o Polynomial Regression

o Support Vector Regression

o Decision Tree Regression


o Random Forest Regression

Difference between Regression and Classification

Regression Algorithm Classification Algorithm

In Regression, the output variable must In Classification, the output variable must be a
be of continuous nature or real value. discrete value.

The task of the regression algorithm is The task of the classification algorithm is to map
to map the input value (x) with the the input value(x) with the discrete output
continuous output variable(y). variable(y).

Regression Algorithms are used with Classification Algorithms are used with discrete
continuous data. data.

In Regression, we try to find the best fit In Classification, we try to find the decision
line, which can predict the output more boundary, which can divide the dataset into
accurately. different classes.

Regression algorithms can be used to Classification Algorithms can be used to solve


solve the regression problems such as classification problems such as Identification of
Weather Prediction, House price spam emails, Speech Recognition, Identification of
prediction, etc. cancer cells, etc.

The regression Algorithm can be further The Classification algorithms can be divided into
divided into Linear and Non-linear Binary Classifier and Multi-class Classifier.
Regression.

Clustering in Machine Learning

Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset. It can be defined as "A way of grouping the data points into different clusters, consisting
of similar data points. The objects with the possible similarities remain in a group that has less or
no similarities with another group."

It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.

It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it


deals with the unlabeled dataset.

After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML
system can use this id to simplify the processing of large and complex datasets.

The clustering technique is commonly used for statistical data analysis.

Example: Let's understand the clustering technique with the real-world example of Mall: When we
visit any shopping mall, we can observe that the things with similar usage are grouped together.
Such as the t-shirts are grouped in one section, and trousers are at other sections, similarly, at
vegetable sections, apples, bananas, Mangoes, etc., are grouped in separate sections, so that we
can easily find out the things. The clustering technique also works in the same way. Other
examples of clustering are grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most common uses of this
technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation system to
provide the recommendations as per the past search of products. Netflix also uses this technique
to recommend the movies and web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different fruits
are divided into several groups with similar properties.

What is Dimensionality Reduction?


The number of input features, variables, or columns present in a given dataset is known as
dimensionality, and the process to reduce these features is called dimensionality reduction.
A dataset contains a huge number of input features in various cases, which makes the predictive
modeling task more complicated. Because it is very difficult to visualize or make predictions for the
training dataset with a high number of features, for such cases, dimensionality reduction
techniques are required to use.
Dimensionality reduction technique can be defined as, "It is a way of converting the higher
dimensions dataset into lesser dimensions dataset ensuring that it provides similar
information." These techniques are widely used in machine learning for obtaining a better fit
predictive model while solving the classification and regression problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.
The number of input features, variables, or columns present in a given dataset is known as
dimensionality, and the process to reduce these features is called dimensionality reduction.
A dataset contains a huge number of input features in various cases, which makes the predictive
modeling task more complicated. Because it is very difficult to visualize or make predictions for the
training dataset with a high number of features, for such cases, dimensionality reduction
techniques are required to use.
Dimensionality reduction technique can be defined as, "It is a way of converting the higher
dimensions dataset into lesser dimensions dataset ensuring that it provides similar
information." These techniques are widely used in machine learning for obtaining a better fit
predictive model while solving the classification and regression problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech recognition,
signal processing, bioinformatics, etc. It can also be used for data visualization, noise reduction,
cluster analysis, etc.
Feature Selection
Feature selection is the process of selecting the subset of the relevant features and leaving out the
irrelevant features present in a dataset to build a model of high accuracy. In other words, it is a
way of selecting the optimal features from the input dataset.
Three methods are used for the feature selection:
1. Filters Methods
In this method, the dataset is filtered, and a subset that contains only the relevant features is
taken. Some common techniques of filters method are:
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a machine learning
model for its evaluation. In this method, some features are fed to the ML model, and evaluate the
performance. The performance decides whether to add those features or remove to increase the
accuracy of the model. This method is more accurate than the filtering method but complex to
work. Some common techniques of wrapper methods are:
o Forward Selection
o Backward Selection
o Bi-directional Elimination
3. Embedded Methods: Embedded methods check the different training iterations of the machine
learning model and evaluate the importance of each feature. Some common techniques of
Embedded methods are:
o LASSO
o Elastic Net
o Ridge Regression, etc.

Linear Regression in Machine Learning


Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the
linear relationship, which means it finds how the value of the dependent variable is changing
according to the value of the independent variable.

The linear regression model provides a sloped straight line representing the relationship between
the variables. Consider the below image:
Mathematically, we can represent a linear regression as:
y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression model representation.
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
o Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Simple Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.
Linear Regression Line
A linear line showing the relationship between the dependent and independent variables is called
a regression line. A regression line can show two types of relationship:
o Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable increases on X-
axis, then such a relationship is termed as a Positive linear relationship.

o Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent variable increases on
the X-axis, then such a relationship is called a negative linear relationship.
Logistic Regression in Machine Learning

o Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.

o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.

o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.

o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).

o The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.

o Logistic Regression is a significant machine learning algorithm because it has the ability to
provide probabilities and classify new data using continuous and discrete datasets.

o Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The below
image is showing the logistic function:

Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.

o It maps any real value into another value within a range of 0 and 1.

o The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function
or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.

o The independent variable should not have multi-collinearity.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:

The above equation is the final equation for Logistic Regression.

Linear Regression vs Logistic Regression


Linear Regression and Logistic Regression are the two famous Machine Learning Algorithms which
come under supervised learning technique. Since both the algorithms are of supervised in nature
hence these algorithms use labeled dataset to make the predictions. But the main difference
between them is how they are being used. The Linear Regression is used for solving Regression
problems whereas Logistic Regression is used for solving the Classification problems. The
description of both the algorithms is given below along with difference table.

Linear Regression:
o Linear Regression is one of the most simple Machine learning algorithm that comes under
Supervised Learning technique and used for solving regression problems.
o It is used for predicting the continuous dependent variable with the help of independent
variables.
o The goal of the Linear regression is to find the best fit line that can accurately predict the
output for the continuous dependent variable.
o If single independent variable is used for prediction then it is called Simple Linear
Regression and if there are more than two independent variables then such regression is
called as Multiple Linear Regression.
o By finding the best fit line, algorithm establish the relationship between dependent variable
and independent variable. And the relationship should be of linear nature.
o The output for Linear regression should only be the continuous values such as price, age,
salary, etc. The relationship between the dependent variable and independent variable can
be shown in below image:

In above image the dependent variable is on Y-axis (salary) and independent variable is on x-
axis(experience). The regression line can be written as:
y= a0+a1x+ ε
Where, a0 and a1 are the coefficients and ε is the error term.

Logistic Regression:
o Logistic regression is one of the most popular Machine learning algorithm that comes
under Supervised Learning techniques.
o It can be used for Classification as well as for Regression problems, but mainly used for
Classification problems.
o Logistic regression is used to predict the categorical dependent variable with the help of
independent variables.
o The output of Logistic Regression problem can be only between the 0 and 1.
o Logistic regression can be used where the probabilities between two classes is required.
Such as whether it will rain today or not, either 0 or 1, true or false etc.
o Logistic regression is based on the concept of Maximum Likelihood estimation. According to
this estimation, the observed data should be most probable.
o In logistic regression, we pass the weighted sum of inputs through an activation function
that can map values in between 0 and 1. Such activation function is known as sigmoid
function and the curve obtained is called as sigmoid curve or S-curve. Consider the below
image:

o The equation for logistic regression is:

Difference between Linear Regression and Logistic Regression:

Linear Regression Logistic Regression

Linear regression is used to predict the Logistic Regression is used to predict the
continuous dependent variable using a given categorical dependent variable using a given
set of independent variables. set of independent variables.

Linear Regression is used for solving Logistic regression is used for solving
Regression problem. Classification problems.

In Linear regression, we predict the value of In logistic Regression, we predict the values of
continuous variables. categorical variables.

In linear regression, we find the best fit line, In Logistic Regression, we find the S-curve by
by which we can easily predict the output. which we can classify the samples.

Least square estimation method is used for Maximum likelihood estimation method is
estimation of accuracy. used for estimation of accuracy.

The output for Linear Regression must be a The output of Logistic Regression must be a
continuous value, such as price, age, etc. Categorical value such as 0 or 1, Yes or No,
etc.

In Linear regression, it is required that In Logistic regression, it is not required to


relationship between dependent variable and have the linear relationship between the
independent variable must be linear. dependent and independent variable.

In linear regression, there may be collinearity In logistic regression, there should not be
between the independent variables. collinearity between the independent variable.
Gaussian Distribution:
In probability theory, a normal (or Gaussian or Gauss or Laplace–Gauss) distribution is a type
of continuous probability distribution for a real-valued random variable. The general form of
its probability density function is

In the above formula, all the symbols have their usual meanings, σ is the Standard Deviation
and µ is the Mean.

there are many cases where the data tends to be around a central value with no bias left or
right, and it gets close to a "Normal Distribution" like this:

The "Bell Curve" is a Normal Distribution. And the yellow histogram shows some data that
follows it closely, but not perfectly (which is usual).

Many things closely follow a Normal Distribution:

 heights of people
 size of things produced by machines
 errors in measurements
 blood pressure
 marks on a test

Standardization
Standardization or standardisation is the process of implementing and developing technical
standards based on the consensus of different parties that include firms, users, interest groups,
standards organizations and governments

Data standardization is the process of rescaling one or more attributes so that they have a
mean value of 0 and a standard deviation of 1. Standardization assumes that your data has a
Gaussian (bell curve) distribution
The z-score is a measure of how many standard deviations away a data point is from the
mean..Mathematically,

The exponent of in the above formula is the square of the z-score times . This is actually in
accordance to the observations that we made above. Values away from the mean have a lower
probability compared to the values near the mean. Values away from the mean will have a higher
z-score and consequently a lower probability since the exponent is negative. The opposite is true
for values closer to the mean.
This gives way for the 68-95-99.7 rule, which states that the percentage of values that lie within
a band around the mean in a normal distribution with a width of two, four and six standard
deviations, comprise 68%, 95% and 99.7% of all the values. The figure given below shows this
rule-

Standard Normal Probability Distribution in Excel

The NORMDIST function is categorized under Excel Statistical functions. It will return the normal
distribution for a stated mean and standard distribution. That is, it will calculate the normal
probability density function or the cumulative normal distribution function for a given set of
parameters.

To understand what a normal distribution is, consider an example. Suppose we take an average of
30 minutes to commute to the office daily, with a standard deviation of 5 minutes. Assuming a
normal distribution for the time it takes to go to work, we can calculate the percentage of time
that the commuting time would be between 25 minutes and 35 minutes.

As a financial analyst, the NORMDIST function is useful in stock market analysis. When investing,
we need to balance risk and return and aim for the highest possible return. Normal distribution
helps quantify the amount of return and risk by the mean for return and standard deviation for
risk.

Formula

=NORMDIST(x,mean,standard_dev,cumulative)
The NORMDIST function uses the following arguments:

1. X (required argument) – This is the value for which we wish to calculate the distribution.
2. Mean (required argument) – The arithmetic mean of the distribution.
3. Standard_dev (required argument) – This is the standard deviation of the distribution.
4. Cumulative (required argument) – This is a logical value. It specifies the type of
distribution to be used: TRUE (Cumulative Normal Distribution Function) or FALSE (Normal
Probability Density Function).

The formula used for calculating the normal distribution is:

Where:

 μ is the mean of the distribution


 σ2 is the variance, and x is the independent variable for which you want to evaluate the
function
 The Cumulative Normal Distribution function is given by the integral, from -∞ to x, of the
Normal Probability Density function.

How to use the NORMDIST Function in Excel?

To understand the uses of the NORMDIST function, let‘s look at an example:

Example – Normal Distribution Excel

Suppose we are given the following data:

 Value for which we need distribution: 52


 Arithmetic mean of the distribution: 50
 Standard deviation of the distribution: 2.5

If we wish to calculate the cumulative distribution function for the data above, the formula to use
is:

We get the result below:


If we wish to calculate the probability mass function for the data above, the formula to use is:

We get the result below:


What is the STANDARDIZE Z-Score Function?

The STANDARDIZE Function is available under Excel Statistical functions. It will return a
normalized value (z-score) based on the mean and standard deviation. A z-score, or standard
score, is used for standardizing scores on the same scale by dividing a score‘s deviation by the
standard deviation in a data set. The result is a standard score. It measures the number of
standard deviations that a given data point is from the mean.

A z-score can be negative or positive. A negative score indicates a value less than the mean, and a
positive score indicates a value greater than the mean. The average of every z-score for a data set
is zero.

Z-scores are a way to compare results from a test to a ―normal‖ population. The results from tests
or surveys can include thousands of possible results and units. However, the results often seem
meaningless. For example, knowing that someone‘s height is 180 cms. can be useful information.
However, if we want to compare it to the ―average‖ person‘s height, looking at a vast table of data
can be overwhelming (especially if some heights are recorded in feet). A z-score can tell us where
that person‘s height is, as compared to the population‘s mean (average) height.

Z-Score Formula

=STANDARDIZE(x, mean, standard_dev)

The STANDARDIZE function uses the following arguments:

1. X (required argument) – This is the value that we want to normalize.


2. Mean (required argument) – The arithmetic mean of the distribution.
3. Standard_dev (required argument) – This is the standard deviation of the distribution.

How to use the STANDARDIZE Z-Score Function in Excel?

To understand the uses of the STANDARDIZE function, let‗s consider an example:

Example 1

Suppose we are given the following data:

The formula we use is:


We get the result below:
How z-Tables Are Used
Take a look at ACT scores to illustrate this. The ACT is taken by high school students all across the
India as a means of determining their aptitude for attending college. Let‘s suppose in this instance
that the mean score for the population is 21, and the standard deviation in this case is 5. How will
you determine the probability that a score would fall within a particular range?

Consider the probability that the score would be higher than 30. How do you determine what that
value is? The first thing you do is use the z-score formula to figure out what the z-score is. In this
case, it is the difference between 30 and 21, which is 9, divided by the standard deviation of 5,
which gives you a z-score of 1.8. If you look at the z-table below, that gives you a probability
value of 0.9641.

Now, what does that tell you? It tells you that the area to the left of that point is 0.9641, or
roughly 96% of the area under the curve falls to the left of z equal to 1.8. To determine the
probability that the score is greater than 30, you‘re interested in the difference between 0.9641
and 1, which gives you a probability of 0.0359.

What if you‘re interested in the probability that a score falls between 23 and 27? In this instance,
you need to calculate two different z-scores, one for 23 and one for 27. Locate both of these z-
scores on the z-table. Then subtract the difference between the greater value and the lower value.
That is the probability that a randomly drawn score would fall between the range of 23 and 27
What if you‘re looking at some areas to the left of the mean? Suppose you‘re interested in the
probability that a score falls between 15 and 20. You will run the same type of calculations. The z-
score for a 20 would be equal to a negative 0.2. The z-score for 15 would be equal to negative 1.2.
Once again, you figure out the probabilities associated with each value by finding the difference.
This works out to be a probability of 0.3056

Lastly, look at the probability that a score will be less than 20. Here you simply need to find the z-
score for 20, which is equal to -0.20. The probability for this value is 0.4207, which tells you that‘s
the probability that a score would be below 20, based upon a normal distribution.
What if you‘re interested in determining the value of a variable based upon a predetermined
probability? Hypothetically speaking, let‘s look at the miles driven per year by the American driver.
Say that in the population that you‘re looking at, the mean miles driven per year is 16,550, with a
population standard deviation of 2,100.

Suppose you were interested in finding the number of miles driven per year in which there were
only 1.5% of all observations below that value. How do you determine that figure? Well, you
establish your probability of 1.5%. Then divide that by 100 to arrive at a probability equal to
0.015, which is the value you will look up in your z-table.

You scan through the numbers until you find the value that‘s equal to 0.015, which is going to be
in the negative side of the z-table. Look at the columns and the rows, and you will find the value of
0.015 happens to have a z-value of -2.17. If you solve for that question mark, you‘re going to
wind up with a z-value of 11,993 miles per year. Only 1.5% of drivers drive fewer than that
number of miles per year.
Suppose that you have a probability. Say you‘re looking for the value of 69.5%. Once again, divide
this value by 100 to arrive at a probability of 0.695, and scan the z-table to find that 0.695 value.
It turns out that the z-value is equal to 0.51.

To solve for your value, you look at whatever the value is minus 16,550. We take the difference
and divide it by 2,100 to get 0.51. So if you do a little cross multiplication, you arrive at a value of
17,621 miles per year, in which case 69.5% of drivers would drive less than the value.

What Is the Central Limit Theorem (CLT)?

In the study of probability theory, the central limit theorem (CLT) states that the distribution of
sample approximates a normal distribution (also known as a ―bell curve‖) as the sample size
becomes larger, assuming that all samples are identical in size, and regardless of the population
distribution shape.

Said another way, CLT is a statistical theory stating that given a sufficiently large sample size from
a population with a finite level of variance, the mean of all samples from the same population will
be approximately equal to the mean of the population. Furthermore, all the samples will follow an
approximate normal distribution pattern, with all variances being approximately equal to
the variance of the population, divided by each sample's size.

In probability theory, the central limit theorem (CLT) establishes that, in many situations,
when independent random variables are added, their properly normalized sum tends toward
a normal distribution (informally a bell curve) even if the original variables themselves are not
normally distributed. The theorem is a key concept in probability theory because it implies that
probabilistic and statistical methods that work for normal distributions can be applicable to many
problems involving other types of distributions.
Algebra with Gaussians

Gaussian elimination is the name of the method we use to perform the three types of matrix
row operations on an augmented matrix coming from a linear system of equations in order to find
the solutions for such system. This technique is also called row reduction and it consists of two
stages: Forward elimination and back substitution.
These two Gaussian elimination method steps are differentiated not by the operations you can use
through them, but by the result they produce. The forward elimination step refers to the row
reduction needed to simplify the matrix in question into its echelon form. Such stage has the
purpose to demonstrate if the system of equations portrayed in the matrix have a unique possible
solution, infinitely many solutions or just no solution at all. If found that the system has no
solution, then there is no reason to continue row reducing the matrix through the next stage.
If is possible to obtain solutions for the variables involved in the linear system, then the Gaussian
elimination with back substitution stage is carried through. This last step will produce a reduced
echelon form of the matrix which in turn provides the general solution to the system of linear
equations.
The Gaussian elimination rules are the same as the rules for the three elementary row operations,
in other words, you can algebraically operate on the rows of a matrix in the next three ways (or
combination of):
1. Interchanging two rows
2. Multiplying a row by a constant (any constant which is not zero)
3. Adding a row to another row

And so, solving a linear system with matrices using Gaussian elimination happens to be a
structured, organized and quite efficient method.
How to do Gaussian elimination
The is really not an established set of Gaussian elimination steps to follow in order to solve a
system of linear equations, is all about the matrix you have in your hands and the necessary row
operations to simplify it. For that, let us work on our first Gaussian elimination example so you can
start looking into the whole process and the intuition that is needed when working through them:
Example 1

 If we were to have the following system of linear equations containing three equations for three
unknowns:

 We know from our lesson on representing a linear system as a matrix that we can represent such
system as an augmented matrix like the one below:

Transcribing the linear system into an augmented matrix


 Let us row-reduce (use Gaussian elimination) so we can simplify the matrix:

Row reducing (applying the Gaussian elimination method to) the augmented matrix

 Resulting in the matrix:

Reduced matrix into its echelon form


Notice that at this point, we can observe that this system of linear equations is solvable with a
unique solution for each of its variables. What we have performed so far is the first stage of row
reduction: Forward elimination. We can continue simplifying this matrix even more (which would
take us to the second stage of back substitution) but we really don't need to since at this point the
system is easily solvable. Thus, we look at the resulting system to solve it directly:

Resulting linear system of equations to solve


From this set, we can automatically observe that the value of the variable z is: z=-2. We use this
knowledge to substitute it on the second equations to solve for y, and the substitute both y and z
values on the first equations to solve for x:

Applying the values of y and z to the first equation

Solving the resulting linear system of equations

And the final solution for the system is:

You might also like