You are on page 1of 28

Module 2 AI & ML

Chapter 2 - Introduction to ML
Machine Learning is a branch of Artificial Intelligence that develops algorithms by learning
the hidden patterns of the datasets used it to make predictions on new similar type data,
without being explicitly programmed for each task.
Machine Learning (ML) is a promising and flourishing field. It can enable top management
of an organization to extract the knowledge from the data stored in various archives of the
business organizations to facilitate decision making. Such decisions can be useful for
organizations to design new products, improve business processes, and to develop decision
support systems.

2.1 NEED FOR MACHINE LEARNING


Business organizations use huge amount of data for their daily activities. Earlier, the full
potential of this data was not utilized due to two reasons. One reason was data being scattered
across different archive systems and organizations not being able to integrate these sources
fully. Secondly, the lack of awareness about software tools that could help to unearth the
useful information from data.
Machine learning has become so popular because of three reasons:

1. High volume of available data to manage: Big companies such as Facebook, Twitter,
and YouTube generate huge amount of data that grows at a phenomenal rate. It is
estimated that the data approximately gets doubled every year.

2. Second reason is that the cost of storage has reduced. The hardware cost has also
dropped. Therefore, it is easier now to capture, process, store, distribute, and transmit
the digital information.

3. Third reason for popularity of machine learning is the availability of complex


algorithms now. Especially with the advent of deep learning, many algorithms are
available for machine learning.

With the popularity and ready adaption of machine learning by business organizations, it has
become a dominant technology trend now. Before starting the machine learning journey, let
us establish these terms - data, information, knowledge, intelligence, and wisdom. A
knowledge pyramid is shown in Figure 1.1.

What is data? All facts are data. Data can be numbers or text that can be processed by a
computer. Today, organizations are accumulating vast and growing amounts of data with data
1 Dept of CSE, Geetha N
Module 2 AI & ML

sources such as flat files, databases, or data warehouses in different storage formats.
Processed data is called information. This includes patterns, associations, or relationships
among data. For example, sales data can be analyzed to extract information like which is the
fast selling product. Condensed information is called knowledge. For example, the historical
patterns and future trends obtained in the above sales data can be called knowledge. Unless
knowledge is extracted, data is of no use. Similarly, knowledge is not useful unless it is put
into action. Intelligence is the applied knowledge for actions. An actionable form of
knowledge is called intelligence. Computer systems have been successful till this stage. The
ultimate objective of knowledge pyramid is wisdom that represents the maturity of mind that
is, so far, exhibited only by humans.

2.2 MACHINE LEARNING IN RELATION TO OTHER FIELDS

Machine learning is an important branch of AI, which is a much broader subject. The aim of
AI is to develop intelligent agents. An agent can be a robot, humans, or any autonomous
systems. Initially, the idea of AI was ambitious, that is, to develop intelligent systems like
human beings. The focus was on logic and logical inferences. It had seen many ups and
downs. These down periods were called AI winters. The resurgence in AI happened due to
development of data driven systems. The aim is to find relations and regularities present in
the data. Machine learning is the sub-branch of AI, whose aim is to extract the patterns for
prediction. It is a broad field that includes learning from examples and other areas like
reinforcement learning.

Deep learning is a subbranch of machine learning. In deep learning, the models are
constructed using neural network technology. Neural networks are based on the human
neuron models. Many neurons form a network connected with the activation functions that
trigger further neurons to perform tasks.

Machine Learning, Data Science, Data Mining, and Data Analytics


Data science is an ‘Umbrella’ term that encompasses many fields. Machine learning starts
with data. Therefore, data science and machine learning are interlinked. Machine learning is a
branch of data science. Data science deals with gathering of data for analysis. It is a broad
field that includes:

Data science concerns about collection of data. Big data is a field of data science that Big
Data deals with data’s following characteristics:
1. Volume: Huge amount of data is generated by big companies like Facebook, Twitter,
YouTube.
2. Variety: Data is available in variety of forms like images, videos, and in different formats.
2 Dept of CSE, Geetha N
Module 2 AI & ML

3. Velocity: It refers to the speed at which the data is generated and processed.

Big data is used by many machine learning algorithms for applications such as language
translation and image recognition. Big data influences the growth of subjects like Deep
learning. Deep learning is a branch of machine learning that deals with constructing models
using neural networks.

Data Mining Data mining’s original genesis is in the business. Like while mining the earth
one gets into precious resources, it is often believed that unearthing of the data produces
hidden information that otherwise would have eluded the attention of the management.
Nowadays, many consider that data mining and machine learning are same. There is no
difference between these fields except that data mining aims to extract the hidden patterns
that are present in the data, whereas, machine learning aims to use it for prediction.

Data Analytics Another branch of data science is data analytics. It aims to extract useful
knowledge from crude data. There are different types of analytics. Predictive data analytics is
used for making predictions. Machine learning is closely related to this branch of analytics
and shares almost all algorithms.

Pattern Recognition It is an engineering field. It uses machine learning algorithms to extract


the features for pattern analysis and pattern classification. One can view pattern recognition
as a specific application of machine learning.

2.3 Types of Machine Learning

Machine learning is a subset of AI, which enables the machine to automatically learn
from data, improve performance from past experiences, and make predictions. Machine
learning contains a set of algorithms that work on a huge amount of data. Data is fed to these
algorithms to train them, and on the basis of training, they build the model & perform a
specific task.

Labelled and Unlabelled Data Data is a raw fact. Normally, data is represented in the form
of a table. Data also can be referred to as a data point, sample, or an example. Each row of the
table represents a data point. Features are attributes or characteristics of an object. Normally,
the columns of the table are attributes. Out of all attributes, one attribute is important and is
called a label. Label is the feature that we aim to predict. Thus, there are two types of data –
labelled and unlabelled.
Labelled Data To illustrate labelled data, let us take one example dataset called Iris flower
dataset or Fisher’s Iris dataset. The dataset has 50 samples of Iris – with four attributes,
length and width of sepals and petals. The target variable is called class. There are three

3 Dept of CSE, Geetha N


Module 2 AI & ML

classes – Iris setosa, Iris virginica, and Iris versicolor. The partial data of Iris dataset is shown
in Table 1.1.

A dataset need not be always numbers. It can be images or video frames. Deep neural
networks can handle images with labels. In the following Figure 1.6, the deep neural network
takes images of dogs and cats with labels for classification.

(a) Labelled Dataset (b) Unlabelled Dataset

In unlabelled data, there are no labels in the dataset.

These ML algorithms help to solve different business problems like Regression,


Classification, Forecasting, Clustering, and Associations, etc. Based on the methods and way
of learning, machine learning is divided into mainly four types, which are:

4 Dept of CSE, Geetha N


Module 2 AI & ML

 Supervised Machine Learning


 Unsupervised Machine Learning
 Semi-Supervised Machine Learning
 Reinforcement Learning

1. Supervised Machine Learning

Supervised learning is defined as when a model gets trained on a “Labelled Dataset”.


Labelled datasets have both input and output parameters. In Supervised
Learning algorithms learn to map points between inputs and correct outputs. It has both
training and validation datasets labelled.

Supervised Learning

Let’s understand it with the help of an example.


Example: Consider a scenario where you have to build an image classifier to differentiate
between cats and dogs. If you feed the datasets of dogs and cats labelled images to the
algorithm, the machine will learn to classify between a dog or a cat from these labeled
images. When we input new dog or cat images that it has never seen before, it will use the
learned algorithms and predict whether it is a dog or a cat. This is how supervised
learning works, and this is particularly an image classification.
There are two main categories of supervised learning that are mentioned below:
 Classification
 Regression

Classification

Classification deals with predicting categorical target variables, which represent discrete
classes or labels. For instance, classifying emails as spam or not spam, or predicting whether
a patient has a high risk of heart disease. Classification algorithms learn to map the input
features to one of the predefined classes.
Here are some classification algorithms:
 Support Vector Machine (SVM)
 Random Forest
 Decision Tree
 K- Nearest Neighbors (KNN)
 Naïve Bayes

5 Dept of CSE, Geetha N


Module 2 AI & ML

Regression

Regression on the other hand, deals with predicting continuous target variables, which
represent numerical values. For example, predicting the price of a house based on its size,
location, and amenities, or forecasting the sales of a product. Regression algorithms learn to
map the input features to a continuous numerical value.
Here are some regression algorithms:
 Linear Regression
 Polynomial Regression
 Decision Tree
 Random Forest

2. Unsupervised Machine Learning

Unsupervised learning is a type of machine learning technique in which an algorithm


discovers patterns and relationships using unlabeled data. Unlike supervised learning,
unsupervised learning doesn’t involve providing the algorithm with labeled target outputs.
The primary goal of Unsupervised learning is often to discover hidden patterns, similarities,
or clusters within the data, which can then be used for various purposes, such as data
exploration, visualization, dimensionality reduction, and more.

Unsupervised Learning

Let’s understand it with the help of an example.


Example: Consider that you have a dataset that contains information about the purchases you
made from the shop. Through clustering, the algorithm can group the same purchasing
behavior among you and other customers, which reveals potential customers without
predefined labels. This type of information can help businesses get target customers as well
as identify outliers.
There are two main categories of unsupervised learning that are mentioned below:
 Clustering
 Association

Clustering

Clustering is the process of grouping data points into clusters based on their similarity. This
technique is useful for identifying patterns and relationships in data without the need for
labeled examples.
Here are some clustering algorithms:
6 Dept of CSE, Geetha N
Module 2 AI & ML

 K- means Clustering
 DBSCAN Algorithm
 Principal Component Analysis

Association

Association rule learning is a technique for discovering relationships between items in a


dataset. It identifies rules that indicate the presence of one item implies the presence of
another item with a specific probability.

3. Semi-supervised Learning

There are circumstances where the dataset has a huge collection of unlabelled data and some
labelled data. Labelling is a costly process and difficult to perform by the humans. Semi-
supervised algorithms use unlabelled data by assigning a pseudo-label. Then, the labelled and
pseudo-labelled dataset can be combined

4. Reinforcement Machine Learning

Reinforcement Machine Learning algorithm is a learning method that interacts with the
environment by producing actions and discovering errors. Trial, error, and delay are the
most relevant characteristics of reinforcement learning. In this technique, the model keeps on
increasing its performance using Reward Feedback to learn the behavior or pattern. These
algorithms are specific to a particular problem e.g. Google Self Driving car, AlphaGo where
a bot competes with humans and even itself to get better and better performers in Go Game.
Each time we feed in data, they learn and add the data to their knowledge which is training
data. So, the more it learns the better it gets trained and hence experienced.

Here are some of most common reinforcement learning algorithms:


Q-learning : Q-learning is a model-free RL algorithm that learns a Q-function, which maps
states to actions. The Q-function estimates the expected reward of taking a particular action
in a given state.
SARSA (State-Action-Reward-State-Action) : SARSA is another model-free RL algorithm
that learns a Q-function. However, unlike Q-learning, SARSA updates the Q-function for the
action that was actually taken, rather than the optimal action.
Deep Q-learning : Deep Q-learning is a combination of Q-learning and deep learning. Deep
Q-learning uses a neural network to represent the Q-function, which allows it to learn
complex relationships between states and actions.

7 Dept of CSE, Geetha N


Module 2 AI & ML

Example: Consider that you are training an AI agent to play a game like chess. The agent
explores different moves and receives positive or negative feedback based on the outcome.
Reinforcement Learning also finds applications in which they learn to perform tasks by
interacting with their surroundings.
There are two main types of reinforcement learning:

Positive reinforcement
Rewards the agent for taking a desired action.
Encourages the agent to repeat the behavior.
Examples: Giving a treat to a dog for sitting, providing a point in a game for a correct answer.

Negative reinforcement
Removes an undesirable stimulus to encourage a desired behavior.
Discourages the agent from repeating the behavior.
Examples: Turning off a loud buzzer when a lever is pressed, avoiding a penalty by
completing a task.

2.4 CHALLENGES OF MACHINE LEARNING


Computers are better than humans in performing tasks like computation. For example, while
calculating the square root of large numbers, an average human may blink but computers can
display the result in seconds. Computers can play games like chess, GO, and even beat
professional players of that game. However, humans are better than computers in many
aspects like recognition. But, deep learning systems challenge human beings in this aspect as
well. Machines can recognize human faces in a second. Still, there are tasks where humans
are better as machine learning systems still require quality data for model construction. The
quality of a learning system depends on the quality of data. This is a challenge. Some of the
challenges are listed below:
1. Problems – Machine learning can deal with the ‘well-posed’ problems where
specifications are complete and available. Computers cannot solve ‘ill-posed’ problems.
Consider one simple example (shown in Table):

Input(x1,x2) Output(y)
1,1 1
2,1 2
3,1 3
4,1 4
5,1 5

Can a model for this test data be multiplication? That is, y = x1 × x2 . Well! It is true!
But, this is equally true that y may be y = x1 ÷ x2 , or y = x1 x2. So, there are three
functions that fit the data. This means that the problem is ill-posed. To solve this
problem, one needs more example to check the model. Puzzles and games that do not
have sufficient specification may become an ill-posed problem and scientific
computation has many ill-posed problems.

2. Huge data – This is a primary requirement of machine learning. Availability of a quality


data is a challenge. A quality data means it should be large and should not have data
problems such as missing data or incorrect data.

8 Dept of CSE, Geetha N


Module 2 AI & ML

3. High computation power – With the availability of Big Data, the computational resource
requirement has also increased. Systems with Graphics Processing Unit (GPU) or even
Tensor Processing Unit (TPU) are required to execute machine learning algorithms.
Also, machine learning tasks have become complex and hence time complexity has
increased, and that can be solved only with high computing power.

4. Complexity of the algorithms – The selection of algorithms, describing the algorithms,


application of algorithms to solve machine learning task, and comparison of algorithms
have become necessary for machine learning or data scientists now. Algorithms have
become a big topic of discussion and it is a challenge for machine learning professionals
to design, select, and evaluate optimal algorithms.

5. Bias/Variance – Variance is the error of the model. This leads to a problem called bias/
variance tradeoff. A model that fits the training data correctly but fails for test data, in
general lacks generalization, is called overfitting. The reverse problem is called
underfitting where the model fails for training data but has good generalization.
Overfitting and underfitting are great challenges for machine learning algorithms.

2.5 MACHINE LEARNING PROCESS / LIFE CYCLE


This process involves six steps. The steps are listed below.

1. Understanding the business – This step involves understanding the objectives and
requirements of the business organization. Generally, a single data mining algorithm is
enough for giving the solution. This step also involves the formulation of the problem
statement for the data mining process.
2. Understanding the data – It involves the steps like data collection, study of the
characteristics of the data, formulation of hypothesis, and matching of patterns to the selected
hypothesis.
3. Preparation of data – This step involves producing the final dataset by cleaning the raw
data and preparation of data for the data mining process. The missing values may cause
problems during both training and testing phases. Missing data forces classifiers to produce
inaccurate results. This is a perennial problem for the classification models. Hence, suitable
strategies should be adopted to handle the missing data.

9 Dept of CSE, Geetha N


Module 2 AI & ML

4. Modelling – This step plays a role in the application of data mining algorithm for the data
to obtain a model or pattern.
5. Evaluate – This step involves the evaluation of the data mining results using statistical
analysis and visualization methods. The performance of the classifier is determined by
evaluating the accuracy of the classifier. The process of classification is a fuzzy issue. For
example, classification of emails requires extensive domain knowledge and requires domain
experts. Hence, performance of the classifier is very crucial.
6. Deployment – This step involves the deployment of results of the data mining algorithm to
improve the existing process or for a new situation.

2.6 MACHINE LEARNING APPLICATIONS

Machine Learning technologies are used widely now in different domains. Machine learning
applications are everywhere! One encounters many machine learning applications in the day-
to-day life. Some applications are listed below:
1. Sentiment analysis – This is an application of natural language processing (NLP) where the
words of documents are converted to sentiments like happy, sad, and angry which are
captured by emoticons effectively. For movie reviews or product reviews, five stars or one
star are automatically attached using sentiment analysis programs.
2. Recommendation systems – These are systems that make personalized purchases possible.
For example, Amazon recommends users to find related books or books bought by people
who have the same taste like you, and Netflix suggests shows or related movies of your taste.
The recommendation systems are based on machine learning.
3. Voice assistants – Products like Amazon Alexa, Microsoft Cortana, Apple Siri, and Google
Assistant are all examples of voice assistants. They take speech commands and perform
tasks. These chatbots are the result of machine learning technologies.
4. Technologies like Google Maps and those used by Uber are all examples of machine
learning which offer to locate and navigate shortest paths to reduce time.

10 Dept of CSE, Geetha N


Module 2 AI & ML

Understanding Data

2.7 What is Data?

Data is a crucial component in the field of Machine Learning. It refers to the set of
observations or measurements that can be used to train a machine-learning model. The quality
and quantity of data available for training and testing play a significant role in determining the
performance of a machine-learning model. Data can be in various forms such as numerical,
categorical, or time-series data, and can come from various sources such as databases,
spreadsheets, or APIs. Machine learning algorithms use data to learn patterns and relationships
between input variables and target outputs, which can then be used for prediction or
classification tasks.
Data is typically divided into two types:
1. Labeled data
2. Unlabeled data
Labeled data includes a label or target variable that the model is trying to predict, whereas
unlabeled data does not include a label or target variable. The data used in machine learning is
typically numerical or categorical. Numerical data includes values that can be ordered and
measured, such as age or income. Categorical data includes values that represent categories,
such as gender or type of fruit.
Data preprocessing is an important step in the machine learning pipeline. This step can include
cleaning and normalizing the data, handling missing values, and feature selection or
engineering.

DATA: It can be any unprocessed fact, value, text, sound, or picture that is not being
interpreted and analyzed. Data is the most important part of all Data Analytics, Machine
Learning, and Artificial Intelligence. Without data, we can‟t train any model and all modern
research and automation will go in vain. Big Enterprises are spending lots of money just to
gather as much certain data as possible.
INFORMATION: Data that has been interpreted and manipulated and has now some
meaningful inference for the users.
KNOWLEDGE: Combination of inferred information, experiences, learning, and insights.
Results in awareness or concept building for an individual or organization.

Consider an example:
There‟s a Shopping Mart Owner who conducted a survey for which he has a long list of
questions and answers that he had asked from the customers, this list of questions and answers
is DATA. Now every time when he wants to infer anything and can‟t just go through each and
every question of thousands of customers to find something relevant as it would be time-
consuming and not helpful. In order to reduce this overhead and time wastage and to make
work easier, data is manipulated through software, calculations, graphs, etc. as per your own

10 Dept of CSE, Geetha N


Module 2 AI & ML

convenience, this inference from manipulated data is Information. So, Data is a must for
Information. Now Knowledge has its role in differentiating between two individuals having
the same information. Knowledge is actually not technical content but is linked to the human
thought process.

How do we split data in Machine Learning?


 Training Data: The part of data we use to train our model. This is the data that your model
actually sees(both input and output) and learns from.
 Validation Data: The part of data that is used to do a frequent evaluation of the model, fit
on the training dataset along with improving involved hyperparameters (initially set
parameters before the model begins learning). This data plays its part when the model is
actually training.
 Testing Data: Once our model is completely trained, testing data provides an unbiased
evaluation. When we feed in the inputs of Testing data, our model will predict some
values(without seeing actual output). After prediction, we evaluate our model by comparing
it with the actual output present in the testing data. This is how we evaluate and see how
much our model has learned from the experiences feed in as training data, set at the time of
training.

Properties of Data –
1. Volume: Scale of Data. With the growing world population and technology at exposure,
huge data is being generated each and every millisecond.
2. Variety: Different forms of data – healthcare, images, videos, audio clippings.
3. Velocity: Rate of data streaming and generation.
4. Value: Meaningfulness of data in terms of information that researchers can infer from it.
5. Accessibility: The ease of obtaining and utilizing data for decision-making purposes.
6. Integrity: The accuracy and completeness of data over its entire lifecycle.

Advantages of using data in Machine Learning:


1. Improved accuracy: With large amounts of data, machine learning algorithms can learn
more complex relationships between inputs and outputs, leading to improved accuracy in
predictions and classifications.
2. Automation: Machine learning models can automate decision-making processes and can
perform repetitive tasks more efficiently and accurately than humans.
3. Personalization: With the use of data, machine learning algorithms can personalize
experiences for individual users, leading to increased user satisfaction.
4. Cost savings: Automation through machine learning can result in cost savings for
businesses by reducing the need for manual labor and increasing efficiency.

11 Dept of CSE, Geetha N


Module 2 AI & ML

Disadvantages of using data in Machine Learning:

1. Bias: Data used for training machine learning models can be biased, leading to biased
predictions and classifications.
2. Privacy: Collection and storage of data for machine learning can raise privacy concerns and
can lead to security risks if the data is not properly secured.
3. Quality of data: The quality of data used for training machine learning models is critical to
the performance of the model. Poor quality data can lead to inaccurate predictions and
classifications.
4. Lack of interpretability: Some machine learning models can be complex and difficult to
interpret, making it challenging to understand how they are making decisions.

2.7.1 Data Types In ML


The Data Type Is Broadly Classified Into Quantitative and Qualitative

1. Quantitative Data Type: –


This type of data type consists of Numerical values. Anything which is measured by numbers.
Ex., Profit, Quantity sold, Height, Weight, Temperature, etc.
This is again of two types

A.) Discrete Data Type: – The numeric data which have Discrete values or whole numbers.
This type of variable value if expressed in decimal format will have no proper meaning. Their
values can be counted. Ex.: – No. of cars you have, No. of marbles in containers, students in a
class, etc.

B.) Continuous Data Type: – The numerical measures which can take the value within a certain
range. This type of variable value if expressed in decimal format has true meaning. Their values
cannot be counted but measured. The value can be infinite. Ex.: – Height, Weight, Time, Area,
Distance, Measurement of rainfall, etc.

2. Qualitative Data Type: –


These are the data types that cannot be expressed in numbers. This describes categories or groups
and is hence known as the categorical data type. This can be divided into:-

A. Structured Data:
This type of data is either number or words. This can take numerical values but mathematical
operations cannot be performed on it. This type of data is expressed In Tabular format.
E.G.) Sunny=1, Cloudy=2, Windy=3 Or binary form data like 0 or1, Good Or Bad, Etc.

Fig: Structured Data

12 Dept of CSE, Geetha N


Module 2 AI & ML

B. Unstructured Data:
This type of data does not have the proper format and therefore known as Unstructured Data.
This comprises textual data, sounds, images, videos, etc.

Besides this, there are also other types refer as Data Types Preliminaries or Data Measures:-
 Nominal
 Ordinal
 Interval
 Ratio

I. Nominal Data Type:


This is in use to express names or labels which are not order or measurable.
Ex., male Or female (gender), blood type, eye color, or nationality etc.

II. Ordinal Data Type:


This is also like nominal data but has some natural ordering associated with it.
Ex., Like rating scale, Shirt sizes, Ranks, Grades, etc.

III. Interval Data Type:


Categorizes and ranks data, and introduces precise and continuous intervals, e.g. temperature
measurements in Fahrenheit and Celsius, or the pH scale. Interval data always lack what‟s
known as a „true zero.‟ In short, this means that interval data can contain negative values and that
a measurement of „zero‟ can represent a quantifiable measure of something.
Ex., Temperature measured in Degree Celsius, Time, Sat Score, Credit Score, PH, etc.

IV. Ratio Data Type:


Categorizes and ranks data, and uses continuous intervals (like interval data). However, it also
has a true zero, which interval data does not. Essentially, this means that when a variable is equal
to zero, there is none of this variable. An example of ratio data would be temperature measured
on the Kelvin scale, for which there is no measurement below absolute zero (which represents a
total absence of heat).
Ex., Temperature In Kelvin, Height, Weight, Age, Etc.

Types of data based on variables

In case of univariate data, the dataset has only one variable. A variable is also called as category.
Bivariate data includes that the number of variables used are two and multivariate data uses three
or more variables.

2.7.2 Data Storage and Representation

13 Dept of CSE, Geetha N


Module 2 AI & ML

Machine learning is all about data and works best when a large amount of data is used. Many
machine learning frameworks support formats for packaging your data more efficiently.
Common formats include WebData and TensorFlow for PyTorch. Other examples include using
HDF5 (Hierarchical Data Format version 5), or LMDB (The Landscape Management
Database) formats, or even humble ZIP-files, e.g., via Python's zipfile library. The main point
with all of these formats is that instead of many thousands of small files you have one or a few
bigger files, which are much more efficient to access and read sequentially.

Data representation methods in machine learning refer to the techniques used to transform and
present input data in a format that is suitable for training and evaluating machine learning
models. Effective data representation is crucial for ensuring that models can learn meaningful
patterns and relationships from the input features. Different types of data, such as numerical,
categorical, and text, may require specific representation methods. Data representation methods
in machine learning involve preparing and transforming input data to a format suitable for model
training and evaluation. The choice of representation methods depends on the nature of the data
and the requirements of the machine learning task at hand. Effective data representation
contributes significantly to the model's ability to extract meaningful insights and make accurate
predictions.

Here are key aspects of data representation methods:

Numerical Data - Scaling and Normalization: Numerical features often have different scales,
and models might be sensitive to these variations. Scaling methods, such as Min-Max scaling or
Z-score normalization, ensure that numerical features are on a similar scale, preventing certain
features from dominating the model training process.

Categorical Data - One-Hot Encoding: Categorical variables, which represent discrete


categories, need to be encoded numerically for machine learning models. One-hot encoding is a
common method where each category is transformed into a binary vector, with a 1 indicating the
presence of the category and 0 otherwise.

Text Data - Vectorization: Text data needs to be converted into a numerical format for machine
learning models. Techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or
word embeddings, such as Word2Vec or GloVe, are used to represent words or documents as
numerical vectors.

Time Series Data - Temporal Features: For time series data, relevant temporal features may be
extracted, such as day of the week, month, or time of day. Additionally, lag features can be
created to capture historical patterns in the data.

Image Data - Pixel Values: Images are typically represented as grids of pixel values. Deep
learning models, particularly convolutional neural networks (CNNs), directly operate on these
pixel values to extract hierarchical features.

14 Dept of CSE, Geetha N


Module 2 AI & ML

Composite Data - Combining Representations: In some cases, datasets may consist of a


combination of numerical, categorical, and text features. Representing such composite data
involves using a combination of the methods mentioned above, creating a comprehensive and
effective input format for the model.

Sparse Data - Sparse Matrix: In cases where data is sparse, such as in natural language
processing with a large vocabulary, a sparse matrix representation may be used. This is an
efficient way to represent data with a significant number of zero values.

Handling Missing Data - Imputation Techniques: Dealing with missing values is an essential
part of data representation. Imputation techniques, such as filling missing values with mean or
median, are commonly applied to ensure that models can still learn from instances with
incomplete information.

Feature Engineering - Creating Informative Features: Feature engineering is that involves


creating new features or transforming existing ones to provide the model with more informative
input. This process is critical for enhancing the model's ability to capture relevant patterns.

2.8 Big Data Analytics


In this new digital world, data is being generated in an enormous amount which opens new
paradigms. As we have high computing power and a large amount of data we can use this data
to help us make data-driven decision making. The main benefits of data-driven decisions are
that they are made up by observing past trends which have resulted in beneficial results.
In short, we can say that data analytics is the process of manipulating data to extract useful
trends and hidden patterns that can help us derive valuable insights to make business
predictions.

Types of Data Analytics


There are four major types of data analytics:
1. Predictive (forecasting)
2. Descriptive (business intelligence and data mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics

Predictive Analytics
Predictive analytics turn the data into valuable, actionable information. predictive analytics
uses data to determine the probable outcome of an event or a likelihood of a situation
occurring. Predictive analytics holds a variety of statistical techniques from modeling, machine
learning, data mining and game theory that analyze current and historical facts to make
predictions about a future event. Techniques that are used for predictive analytics are:
 Linear Regression
 Time Series Analysis and Forecasting
 Data Mining

15 Dept of CSE, Geetha N


Module 2 AI & ML

Descriptive Analytics
Descriptive analytics looks at data and analyze past event for insight as to how to approach
future events. It looks at past performance and understands the performance by mining
historical data to understand the cause of success or failure in the past. Almost all management
reporting such as sales, marketing, operations, and finance uses this type of analysis.
The descriptive model quantifies relationships in data in a way that is often used to classify
customers or prospects into groups. Unlike a predictive model that focuses on predicting the
behavior of a single customer, Descriptive Analysis identifies many different relationships
between customer and product.
Common examples of Descriptive analytics are company reports that provide historic
reviews like:
 Data Queries
 Reports
 Descriptive Statistics
 Data dashboard

Prescriptive Analytics
Prescriptive Analytics automatically synthesize big data, mathematical science, business rule,
and machine learning to make a prediction and then suggests a decision option to take
advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting action
benefits from the predictions and showing the decision maker the implication of each decision
option. Prescriptive Analytics not only anticipates what will happen and when to happen but
also why it will happen. Further, Prescriptive Analytics can suggest decision options on how to
take advantage of a future opportunity or mitigate a future risk and illustrate the implication of
each decision option.
For example, Prescriptive analysis can benefit healthcare strategic planning by using analytics
to leverage operational and usage data combined with data of external factors such as
economic data, population demography, etc.

Diagnostic Analytics
In this analysis, we generally use historical data over other data to answer any question or for
the solution of any problem. We try to find any dependency and pattern in the historical data of
the particular problem.
For example, companies go for this analysis because it gives a great insight into a problem, and
they also keep detailed information about their disposal otherwise data collection may turn out
individual for every problem and it will be very time-consuming. Common techniques used
for Diagnostic Analytics are:
 Data discovery
 Data mining
 Correlations

Usage of Data Analytics


There are some key domains and strategic planning techniques in which Data Analytics has
played a vital role:

16 Dept of CSE, Geetha N


Module 2 AI & ML

 Improved Decision-Making – If we have supporting data in favour of a decision, then we


can implement them with even more success probability. For example, if a certain decision
or plan has to lead to better outcomes then there will be no doubt in implementing them
again.
 Better Customer Service – Churn modeling is the best example of this in which we try to
predict or identify what leads to customer churn and change those things accordingly so,
that the attrition of the customers is as low as possible which is a most important factor in
any organization.
 Efficient Operations – Data Analytics can help us understand what is the demand of the
situation and what should be done to get better results then we will be able to streamline our
processes which in turn will lead to efficient operations.
 Effective Marketing – Market segmentation techniques have been implemented to target
this important factor only in which we are supposed to find the marketing techniques which
will help us increase our sales and leads to effective marketing strategies.

2.9 The Big Data Framework

Frameworks provide structure. The core objective of the Big Data Framework is to provide a
structure for enterprise organisations that aim to benefit from the potential of Big Data. In order
to achieve long-term success, Big Data is more than just the combination of skilled people and
technology – it requires structure and capabilities.

The Big Data Framework was developed because – although the benefits and business cases of
Big Data are apparent – many organizations struggle to embed a successful Big Data practice in
their organization. The structure provided by the Big Data Framework provides an approach for
organizations that takes into account all organizational capabilities of a successful Big Data
practice. All the way from the definition of a Big Data strategy, to the technical tools and
capabilities an organization should have.

The main benefits of applying a Big Data framework include:

1. The Big Data Framework provides a structure for organisations that want to start with
Big Data or aim to develop their Big Data capabilities further.
2. The Big Data Framework includes all organisational aspects that should be taken into
account in a Big Data organization.
3. The Big Data Framework is vendor independent. It can be applied to any organization
regardless of choice of technology, specialisation or tools.
4. The Big Data Framework provides a common reference model that can be used across
departmental functions or country boundaries.
5. The Big Data Framework identifies core and measurable capabilities in each of its
domains so that the organization can develop over time.

Big Data is a people business. Even with the most advanced computers and processors in the
world, organisations will not be successful without the appropriate knowledge and skills. The
Big Data Framework therefore aims to increase the knowledge of everyone who is interested in

17 Dept of CSE, Geetha N


Module 2 AI & ML

Big Data. The modular approach and accompanying certification scheme aims to develop
knowledge about Big Data in a similar structured fashion.

The Big Data framework provides a holistic structure toward Big Data. It looks at the various
components that enterprises should consider while setting up their Big Data organization. Every
element of the framework is of equal importance and organisations can only develop further if
they provide equal attention and effort to all elements of the Big Data framework.

The Structure of the Big Data Framework


The Big Data framework is a structured approach that consists of core capabilities that
organisations need to take into consideration when setting up their Big Data organization.

The Big Data Framework consists of the following main elements:

1. Big Data Strategy - Data has become a strategic asset for most organisations. The capability
to analyse large data sets and discern pattern in the data can provide organisations with a
competitive advantage. Netflix, for example, looks at user behaviour in deciding what movies or
series to produce. Alibaba, the Chinese sourcing platform, became one of the global giants by
identifying which suppliers to loan money and recommend on their platform. Big Data has
become Big Business.

2. Big Data Architecture - In order to work with massive data sets, organisations should have
the capabilities to store and process large quantities of data. In order to achieve this, the
enterprise should have the underlying infrastructure to facilitate Big Data. Enterprises should
therefore have a comprehensive Big Data Architecture to facilitate Big Data analysis. How
should enterprises design and set up their architecture to facilitate Big Data? And what are the
requirements from a storage and processing perspective?

3. Big Data Algorithms - A fundamental capability of working with data is to have a thorough
understanding of statistics and algorithms. Algorithms are unambiguous specifications of how to
solve a class of problems. Algorithms can perform calculations, data processing and automated
reasoning tasks. By applying algorithms to large volumes of data, valuable knowledge and
insights can be obtained.

4. Big Data Processes - In order to make Big Data successful in enterprise organization, it is
necessary to consider more than just the skills and technology. Processes can help enterprises to
focus their direction. Processes bring structure, measurable steps and can be effectively managed
on a day-to-day basis. Additionally, processes embed Big Data expertise within the organization
by following similar procedures and steps, embedding it as „a practice‟ of the organization.

5. Artificial Intelligence - The last element of the Big Data Framework addresses Artificial
Intelligence (AI). In this part of the framework, we address the relation between Big Data and
Artificial Intelligence and outline key characteristics of AI. The last section of the framework
therefore showcases how AI follows as a logical next step for organisations that have built up the
other capabilities of the Big Data Framework. Artificial Intelligence can start to continuously
learn from the Big Data in the organization in order to provide long lasting value.

18 Dept of CSE, Geetha N


Module 2 AI & ML

2.9.1 Data Collection

Data collection means pooling data by scraping, capturing, and loading it from multiple
sources, including offline and online sources. High volumes of data collection or data
creation can be the hardest part of a machine learning project. Data collection allows you to
capture a record of past events so that we can use data analysis to find recurring patterns.
From those patterns, you build predictive models using machine learning algorithms that
look for trends and predict future changes. This is an interesting question, but it has no
definite answer because “how much” data you need depends on how many features there are
in the data set. It is recommended to collect as much data as possible for good predictions.
You can begin with small batches of data and see the result of the model. The most
important thing to consider while data collection is diversity. Diverse data will help your
model cover more scenarios. So, when focusing on how much data you need, you should
cover all the scenarios in which the model will be used.
The quantity of data also depends on the complexity of your model. If it is as simple as
license plate detection then you can expect predictions with small batches of data. But if are
working on higher levels of Artificial intelligence like medical AI, you need to consider
huge volumes of Data.
The main purpose of the data collection is to gather information in a measured and
systematic way to ensure accuracy and facilitate data analysis. Since all collected data are
intended to provide content for analysis of the data, the information gathered must be of the
highest quality to have any value.
here are the traits that you want your datasets to have:
Uniformity: Regardless of where data pieces come from, they must be uniformly verified,
depending on the model. For instance, when coupled with audio datasets designed
specifically for NLP models like chatbots and Voice Assistants, a well-seasoned annotated
video dataset would not be uniform.
Consistency: If data sets are to be considered high quality, they must be consistent. As a
complement to any other unit, every unit of data must try to make the model‟s decision -
making process faster.
Comprehensiveness: Plan out every aspect and characteristic of the model and ensure that
the sourced datasets cover all the bases. For instance, NLP-relevant data must adhere to the
semantic, syntactic, and even contextual requirements.
Relevance: If you want to achieve a specific result, make sure the data is homogenous and
relevant so that AI algorithms can process it quickly.

2.9.2 Data Preprocessing

Pre-processing refers to the transformations applied to our data before feeding it to the
algorithm. Data preprocessing is a technique that is used to convert the raw data into a clean
data set. In other words, whenever the data is gathered from different sources it is collected in
raw format which is not feasible for the analysis.
Data Preprocessing is an important step in the machine learning algorithm. Imagine a situation
where you are working on an assignment at your college, and the lecturer does not provide the
raw headings and the idea of the topic. In this case, it will be very difficult for you to complete

19 Dept of CSE, Geetha N


Module 2 AI & ML

that assignment because raw data is not presented well to you. The same is the case in Machine
Learning. Suppose the Data preprocessing step is missing while implementing the machine
learning algorithm. In that case, it will definitely affect your work at the end, when it will be the
final stage of applying the available data set to your algorithm.
While performing data preprocessing, it is important to ensure data accuracy so that it doesn't
affect your machine learning algorithm at the final stage.

Major Tasks in Data Preprocessing


There are 4 major tasks in data preprocessing – Data cleaning, Data integration, Data reduction,
and Data transformation.

1.Data Cleaning
Data cleaning is the process of removing incorrect data, incomplete data, and inaccurate data
from the datasets, and it also replaces the missing values. Here are some techniques for data
cleaning:
Handling Missing Values
Standard values like “Not Available” or “NA” can be used to replace the missing values.
Missing values can also be filled manually, but it is not recommended when that dataset is big.
The attribute‟s mean value can be used to replace the missing value when the data is normally
distributed wherein in the case of non-normal distribution median value of the attribute can be
used. While using regression or decision tree algorithms, the missing value can be replaced by
the most probable value.
Handling Noisy Data
Noisy generally means random error or containing unnecessary data points. Handling noisy data
is one of the most important steps as it leads to the optimization of the model we are using Here
are some of the methods to handle noisy data.
Binning: This method is to smooth or handle noisy data. First, the data is sorted then, and then
the sorted values are separated and stored in the form of bins. There are three methods for
smoothing data in the bin. Smoothing by bin mean method: In this method, the values in the
bin are replaced by the mean value of the bin; Smoothing by bin median: In this method, the
values in the bin are replaced by the median value; Smoothing by bin boundary: In this
method, the using minimum and maximum values of the bin values are taken, and the closest
boundary value replaces the values.
Clustering: This is used for finding the outliers and also in grouping the data. Clustering is
generally used in unsupervised learning.

2.Data Integration

The process of combining multiple sources into a single dataset. The Data integration process is
one of the main components of data management. There are some problems to be considered
during data integration.
Schema integration: Integrates metadata(a set of data that describes other data) from different
sources.

20 Dept of CSE, Geetha N


Module 2 AI & ML

Entity identification problem: Identifying entities from multiple databases. For example, the
system or the user should know the student id of one database and studentname of another
database belonging to the same entity.
Detecting and resolving data value concepts: The data taken from different databases while
merging may differ. The attribute values from one database may differ from another database.
For example, the date format may differ, like “MM/DD/YYYY” or “DD/MM/YYYY”.

3.Data Reduction

This process helps in the reduction of the volume of the data, which makes the analysis easier yet
produces the same or almost the same result. This reduction also helps to reduce storage space.
Some of the data reduction techniques are dimensionality reduction, numerosity reduction, and
data compression.
Dimensionality reduction: This process is necessary for real-world applications as the data size
is big. In this process, the reduction of random variables or attributes is done so that the
dimensionality of the data set can be reduced. Combining and merging the attributes of the data
without losing its original characteristics. This also helps in the reduction of storage space, and
computation time is reduced. When the data is highly dimensional, a problem called the “Curse
of Dimensionality” occurs.
Data compression: The compressed form of data is called data compression. This compression
can be lossless or lossy. When there is no loss of information during compression, it is called
lossless compression. Whereas lossy compression reduces information, but it removes only the
unnecessary information.

4.Data Transformation

The change made in the format or the structure of the data is called data transformation. This step
can be simple or complex based on the requirements. There are some methods for data
transformation.
Smoothing: With the help of algorithms, we can remove noise from the dataset, which helps in
knowing the important features of the dataset. By smoothing, we can find even a simple change
that helps in prediction.
Aggregation: In this method, the data is stored and presented in the form of a summary. The
data set, which is from multiple sources, is integrated into with data analysis description. This is
an important step since the accuracy of the data depends on the quantity and quality of the data.
When the quality and the quantity of the data are good, the results are more relevant.
Discretization: The continuous data here is split into intervals. Discretization reduces the data
size. For example, rather than specifying the class time, we can set an interval like (3 pm-5 pm,
or 6 pm-8 pm).
Normalization: It is the method of scaling the data so that it can be represented in a smaller
range. Example ranging from -1.0 to 1.0.

21 Dept of CSE, Geetha N


Module 2 AI & ML

2.10 Univariate Data Analysis and Visualization

Types of data based on variables

In case of univariate data, the dataset has only one variable. A variable is also called as category.
Bivariate data includes that the number of variables used are two and multivariate data uses three
or more variables.

Univariate data –
This type of data consists of only one variable. The analysis of univariate data is thus the
simplest form of analysis since the information deals with only one quantity that changes. It
does not deal with causes or relationships and the main purpose of the analysis is to describe
the data and find patterns that exist within it. The example of a univariate data can be height.

Suppose that the heights of seven students of a class is recorded, there is only one variable that
is height and it is not dealing with any cause or relationship. The description of patterns found
in this type of data can be made by drawing conclusions using central tendency measures
(mean, median and mode), dispersion or spread of data (range, minimum, maximum, quartiles,
variance and standard deviation) and by using frequency distribution tables, histograms, pie
charts, frequency polygon and bar charts.
Frequency Distribution Tables
The frequency distribution table reflects how often an occurrence has taken place in the data. It
gives a brief idea of the data and makes it easier to find patterns.
Example:
The list of IQ scores is: 118, 139, 124, 125, 127, 128, 129, 130, 130, 133, 136, 138, 141, 142,
149, 130, 154.
IQ Range Number
118-125 3
126-133 7
134-141 4
142-149 2
150-157 1

Ø Bar Charts

22 Dept of CSE, Geetha N


Module 2 AI & ML

The bar graph is very convenient while comparing categories of data or different groups of data.
It helps to track changes over time. It is best for visualizing discrete data.

Ø Histograms
Histograms are similar to bar charts and display the same categorical variables against the
category of data. Histograms display these categories as bins which indicate the number of data
points in a range. It is best for visualizing continuous data.

Ø Pie Charts
Pie charts are mainly used to comprehend how a group is broken down into smaller pieces. The
whole pie represents 100 percent, and the slices denote the relative size of that particular
category.

2.11 Bivariate Data Analysis and Visualization


This type of data involves two different variables. The analysis of this type of data deals with
causes and relationships and the analysis is done to find out the relationship among the two
variables. Example of bivariate data can be temperature and ice cream sales in summer season.

23 Dept of CSE, Geetha N


Module 2 AI & ML

Suppose the temperature and ice cream sales are the two variables of a bivariate data. Here, the
relationship is visible from the table that temperature and sales are directly proportional to each
other and thus related because as the temperature increases, the sales also increase. Thus
bivariate data analysis involves comparisons, relationships, causes and explanations. These
variables are often plotted on X and Y axis on the graph for better understanding of data and
one of these variables is independent while the other is dependent.
The types of a bivariate analysis will depend upon the types of variables or attributes we will use
for analysing. The variable could be numerical, categorical or ordinal. If the independent variable
is categorical, like a particular brand of pen, then logit or probit regression can be used. If
independent and dependent both the attributes are ordinal, which means they have position or
ranking, then we can measure a rank correlation coefficient. If dependent attribute is ordinal,
then ordered logit or ordered probit can be utilised. Also, if the dependent attribute is either ratio
or interval, like temperature scale, then we can measure regression. So based on these data, we
can mention the types of bivariate data analysis:
Numerical and Numerical – In this type, both the variables of bivariate data, independent and
dependent, are having numerical values.
Categorical and Categorical – When both the variables are categorical.
Numerical and Categorical – When one variable is numerical and one is categorical.

2.12 Multivariate Data Analysis and Visualization


Multivariate data
When the data involves three or more variables, it is categorized under multivariate. Example
of this type of data is suppose an advertiser wants to compare the popularity of four
advertisements on a website, then their click rates could be measured for both men and women
and relationships between variables can then be examined. It is similar to bivariate but contains
more than one dependent variable. The ways to perform analysis on this data depends on the
goals to be achieved. Some of the techniques are regression analysis, path analysis, factor
analysis and multivariate analysis of variance (MANOVA).
There are a lots of different tools, techniques and methods that can be used to conduct your
analysis. You could use software libraries, visualization tools and statistic testing methods.
However, this blog we will be compare Univariate, Bivariate and Multivariate analysis.

Multivariate Analysis
Multivariate analysis is required when more than two variables have to be analyzed
simultaneously. It is a tremendously hard task for the human brain to visualize a relationship
among 4 variables in a graph and thus multivariate analysis is used to study more complex sets
of data. Types of Multivariate Analysis include Cluster Analysis, Factor Analysis, Multiple
Regression Analysis, Principal Component Analysis, etc. More than 20 different ways to

24 Dept of CSE, Geetha N


Module 2 AI & ML

perform multivariate analysis exist and which one to choose depends upon the type of data and
the end goal to achieve. The most common ways are:

Ø Cluster Analysis
Cluster Analysis classifies different objects into clusters in a way that the similarity between two
objects from the same group is maximum and minimal otherwise. It is used when rows and
columns of the data table represent the same units and the measure represents distance or
similarity.

Ø Principal Component Analysis (PCA)


Principal Components Analysis (or PCA) is used for reducing the dimensionality of a data table
with a large number of interrelated measures. Here, the original variables are converted into a
new set of variables, which are known as the “Principal Components” of Principal Component
Analysis.
PCA is used for the dataset that shows multicollinearity. Although least squares estimates are
biased, the distance between variances and their actual value can be really large. So, PCA adds
some bias and reduces standard error for the regression model.

Ø Correspondence Analysis
Correspondence Analysis using the data from a contingency table shows relative relationships
between and among two different groups of variables. A contingency table is a 2D table with
rows and columns as groups of variables.

2.13 Feature Engineering and Dimensionality Reduction Techniques


In Machine Learning subset selection or feature selection is the process of selecting a subset of
relevant features for use in model construction. The central premise when using a feature
selection technique is that the data contains many features that are either redundant or irrelevant,
and can thus be removed without incurring much loss of information. The various techniques
used to reduce the dimensionality are defined below:

1. Stepwise Forward Selection


We use the following notations for forward selection of features:
n : number of input variables
x1, …, xn : input variables
𝐹𝑖 : a subset of the set of input variables
E (𝐹𝑖 ) : error incurred on the validation sample when only the inputs in 𝐹𝑖 are used
Steps:
1. Set 𝐹0 = null and E(𝐹0 ) = ∞
2. For i = 0,1,.., repeat the following until E(𝐹𝑖 +1) >= E(𝐹𝑖 ):

25 Dept of CSE, Geetha N


Module 2 AI & ML

(a) For all possible input variables xj, train the model with the input variables 𝐹𝑖 U
{𝑥𝑗 } and calculate E(𝐹𝑖 U {𝑥𝑗 }) on the validation set.
(b) Choose that input variable 𝑥𝑚 that causes the least error E(𝐹𝑖 U {𝑥𝑗 }):
m = 𝑎𝑟𝑔𝑚𝑖𝑛𝑗 E (𝐹𝑖 U 𝑥𝑗 )
(c) Set 𝐹𝑖 +1 = 𝐹𝑖 U {𝑥𝑚 }
3. The set 𝐹𝑖 is outputted as the best subset.
In this procedure, we stop if adding any feature does not decrease the error E. We may even
decide to stop if the decrease in error is too small, where there is a user-defined threshold that
depends on the application constraints. This process may be costly because to decrease the
dimensions from n to k, we need to train and test the system n + (n-1) + (n-2) + … + (n-k)
times, which is O(𝑛2 )

2. Stepwise Backward Elimination

We use the following notations:


n : number of input variables
x1, …, xn : input variables
𝐹𝑖 : a subset of the set of input variables
E (𝐹𝑖 ) : error incurred on the validation sample when only the inputs in 𝐹𝑖 are used
Steps:
1. Set 𝐹0 = {x1,…,xn} and E(𝐹0 ) = ∞
2. For i = 0,1,.., repeat the following until E(𝐹𝑖 +1) >= E(𝐹𝑖 ):
(a) For all possible input variables xj, train the model with the input variables 𝐹𝑖 -
{𝑥𝑗 } and calculate E(𝐹𝑖 - {𝑥𝑗 }) on the validation set.
(b) Choose that input variable 𝑥𝑚 that causes the least error E(𝐹𝑖 - {𝑥𝑗 }):
m = 𝑎𝑟𝑔𝑚𝑖𝑛𝑗 E (𝐹𝑖 − 𝑥𝑗 )
(c) Set 𝐹𝑖 +1 = 𝐹𝑖 U {𝑥𝑚 }
3. The set 𝐹𝑖 is outputted as the best subset.

3. Principal Component Analysis

Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features with the help of
orthogonal transformation. These new transformed features are called the Principal
Components.

Some common terms used in PCA algorithm:


Dimensionality: It is the number of features or variables present in the given dataset. More
easily, it is the number of columns present in the dataset.
Correlation: It signifies that how strongly two variables are related to each other. Such as if one
changes, the other variable also gets changed. The correlation value ranges from -1 to +1. Here, -

26 Dept of CSE, Geetha N


Module 2 AI & ML

1 occurs if variables are inversely proportional to each other, and +1 indicates that variables are
directly proportional to each other.
Orthogonal: It defines that variables are not correlated to each other, and hence the correlation
between the pair of variables is zero.
Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v will be
eigenvector if Av is the scalar multiple of v.
Covariance Matrix: A matrix containing the covariance between the pair of variables is called
the Covariance Matrix.

4.Single value Decompisition

The Singular-Value Decomposition, or SVD for short, is a matrix decomposition method for
reducing a matrix to its constituent parts in order to make certain subsequent matrix calculations
simpler.
For the case of simplicity we will focus on the SVD for real-valued matrices and ignore the case
for complex numbers.
A = U . Sigma . V^T
Where A is the real m x n matrix that we wish to decompose, U is an m x m matrix, Sigma (often
represented by the uppercase Greek letter Sigma) is an m x n diagonal matrix, and V^T is the
transpose of an n x n matrix where T is a superscript.
The diagonal values in the Sigma matrix are known as the singular values of the original matrix
A. The columns of the U matrix are called the left-singular vectors of A, and the columns of V
are called the right-singular vectors of A.
The SVD is calculated via iterative numerical methods. We will not go into the details of these
methods. Every rectangular matrix has a singular value decomposition, although the resulting
matrices may contain complex numbers and the limitations of floating point arithmetic may
cause some matrices to fail to decompose neatly.
The singular value decomposition (SVD) provides another way to factorize a matrix, into
singular vectors and singular values. The SVD allows us to discover some of the same kind of
information as the eigen decomposition. However, the SVD is more generally applicable.
The SVD is used widely both in the calculation of other matrix operations, such as matrix
inverse, but also as a data reduction method in machine learning. SVD can also be used in least
squares linear regression, image compression, and denoising data.

27 Dept of CSE, Geetha N

You might also like