You are on page 1of 25

GROW MORE FACULTY OF DIPLOMA ENGINEERING

SUBJECT - FUNDAMENTALS OF MACHINE LEARNING


CODE – 4341603

QUESTION BANK

UNIT 2 - Preparing to Model

2.1.1 Machine Learning activities

1. Data Collection: The first step in any machine learning project is to collect
relevant data. This can be done using a variety of methods, including web
scraping, surveys, and data APIs.

2. Data Preprocessing: Once data has been collected, it must be cleaned and
preprocessed. This involves removing duplicates, filling in missing values, and
transforming the data into a format suitable for machine learning algorithms.

3. Feature Engineering: Feature engineering involves selecting and extracting the


most important features from the data, which will be used as inputs to the
machine learning algorithm.
4. Model Selection: There are various machine learning models available. each with
their own strengths and weaknesses. The choice of model will depend on the
specific problem being addressed and the characteristics of the data.

5. Model Training: This involves training the machine learning model on the data to
learn patterns and relationships between the features and the target variable.

6. Model Evaluation: Once the model has been trained, it must be evaluated to
determine its accuracy and performance. This can be done using various metrics
such as accuracy, precision, recall, and F1 score.

7. Model Deployment: After the model has been evaluated. it can be deployed in
production to make predictions on new data.

8. Model Monitoring and Maintenance: Finally, the deployed model must be


monitored and maintained to ensure it continues to perform well over time. This
involves monitoring the model's accuracy, retraining it on new data as needed,
and making updates or improvements as necessary.

2.1.2 Types of data in Machine Learning

Almost anything can be turned into DATA. Building a deep understanding of the different

data types is a crucial prerequisite for doing Exploratory Data Analysis (EDA) and

Feature Engineering for Machine Learning models. You also need to convert data types

of some variables in order to make appropriate choices for visual encodings in data

visualization and storytelling.


Most data can be categorized into 4 basic types from a Machine Learning perspective:

numerical data, categorical data, time-series data, and text.

Data Types From A Machine Learning Perspective

Numerical Data

Numerical data is any data where data points are exact numbers. Statisticians also might
call numerical data, quantitative data. This data has meaning as a measurement such

as house prices or as a count, such as a number of residential properties in Los Angeles

or how many houses sold in the past year.

Numerical data can be characterized by continuous or discrete data. Continuous data

can assume any value within a range whereas discrete data has distinct values.
Numerical Data

For example, the number of students taking Python class would be a discrete data set.

You can only have discrete whole number values like 10, 25, or 33. A class cannot have

12.75 students enrolled. A student either join a class or he doesn’t. On the other hand,

continuous data are numbers that can fall anywhere within a range. Like a student could

have an average score of 88.25 which falls between 0 and 100.

The takeaway here is that numerical data is not ordered in time. They are just numbers

that we have collected.

Categorical Data

Categorical data represents characteristics, such as a hockey player’s position, team,

hometown. Categorical data can take numerical values. For example, maybe we would

use 1 for the colour red and 2 for blue. But these numbers don’t have a mathematical

meaning. That is, we can’t add them together or take the average.
In the context of super classification, categorical data would be the class label. This

would also be something like if a person is a man or woman, or property is residential or

commercial.

There is also something called ordinal data, which in some sense is a mix of numerical

and categorical data. In ordinal data, the data still falls into categories, but those

categories are ordered or ranked in some particular way. An example would be class

difficulty, such as beginner, intermediate, and advanced. Those three types of classes
would be a way that we could label the classes, and they have a natural order in

increasing difficulty.

Another example is that we just take quantitative data, and splitting it into groups, so we

have bins or categories of other types of data.


Ordinal Data

For plotting purposes, ordinal data is treated much in the same way as categorical data.

But groups are usually ordered from lowest to highest so that we can preserve this
ordering.

Time Series Data

Time series data is a sequence of numbers collected at regular intervals over some

period of time. It is very important, especially in particular fields like finance. Time series

data has a temporal value attached to it, so this would be something like a date or a

timestamp that you can look for trends in time.

For example, we might measure the average number of home sales for many years. The

difference of time series data and numerical data is that rather than having a bunch of

numerical values that don’t have any time ordering, time-series data does have some

implied ordering. There is a first data point collected and the last data point collected.
Text

Text data is basically just words. A lot of the time the first thing that you do with text is

you turn it into numbers using some interesting functions like the bag of words

formulation.

These are four types of data from a Machine Learning perspective.


2.1.3 Structures of data

Data Structure for Machine Learning

Machine Learning is one of the hottest technologies used by data scientists or ML


experts to deploy a real-time project.

However, only skills of machine learning are not sufficient for solving real-world
problems and designing a better product, but also you have to gain good exposure to
the data structure.

The data structure used for machine learning is quite similar to other software
development fields where it is often used.

Machine Learning is a subset of artificial intelligence that includes various


complex algorithms to solve mathematical problems to a great extent.

Data structure helps to build and understand these complex problems.

Understanding the data structure also helps you to build ML models and algorithms in a
much more efficient way than other ML professionals. I

What is Data Structure?

The data structure is defined as the basic building block of computer


programming that helps us to organize, manage and store data for efficient
search and retrieval.

In other words, the data structure is the collection of data type 'values' which are stored
and organized in such a way that it allows for efficient access and modification.

Types of Data Structure

The data structure is the ordered sequence of data, and it tells the compiler how a
programmer is using the data such as Integer, String, Boolean, etc.

There are two different types of data structures: Linear and Non-linear data structures.
1. Linear Data structure:

The linear data structure is a special type of data structure that helps to organize and
manage data in a specific order where the elements are attached adjacently.

There are mainly 4 types of linear data structure as follows:

Array:

An array is one of the most basic and common data structures used in Machine
Learning. It is also used in linear algebra to solve complex mathematical problems. You
will use arrays constantly in machine learning, whether it's:

o To convert the column of a data frame into a list format in pre-processing


analysis
o To order the frequency of words present in datasets.
o Using a list of tokenized words to begin clustering topics.
o In word embedding, by creating multi-dimensional matrices.

An array contains index numbers to represent an element starting from 0. The lowest
index is arr[0] and corresponds to the first element.

Let's take an example of a Python array used in machine learning. Although the Python
array is quite different from than array in other programming languages, the Python list
is more popular as it includes the flexibility of data types and their length. If anyone is
using Python in ML algorithms, then it's better to kick your journey from array initially.

Python Array method:

Method Description

Append() It is used to add an element at the end of the list.

Clear() It is used to remove/clear all elements in the list.

Copy() It returns a copy of the list.


Count() It returns the count or total available element with an integer value.

Extend() It is used to add the element of a list to the end of the current list.

Index() It returns the index of the first element with the specified value.

Insert() It is used to add an element at a specific position using an index number.

Pop() It is used to remove an element from a specified position using an index number.

Remove() Used to remove the elements with specified values.

Reverse() Used to show list in reverse order

Sort() Used to sort the list in an array.

Stacks:

Stacks are based on the concept of LIFO (Last in First out) or FILO (First In Last Out).

It is used for binary classification in deep learning.

Although stacks are easy to learn and implement in ML models but having a good grasp
can help in many computer science aspects such as parsing grammar, etc.

Stacks enable the undo and redo buttons on your computer as they function similar to
a stack of blog content.

There is no sense in adding a blog at the bottom of the stack.

However, we can only check the most recent one that has been added. Addition and
removal occur at the top of the stack.
Linked List:

A linked list is the type of collection having several separately allocated nodes. Or
in other words, a list is the type of collection of data elements that consist of a
value and pointer that point to the next node in the list.

In a linked list, insertion and deletion are constant time operations and are very efficient,
but accessing a value is slow and often requires scanning.

So, a linked list is very significant for a dynamic array where the shifting of elements is
required.

Although insertion of an element can be done at the head, middle or tail position, it is
relatively cost consuming. However, linked lists are easy to splice together and split
apart. Also, the list can be converted to a fixed-length array for fast access.

Queue:

A Queue is defined as the "FIFO" (first in, first out).

It is useful to predict a queuing scenario in real-time programs, such as people waiting


in line to withdraw cash in the bank.

Hence, the queue is significant in a program where multiple lists of codes need to be
processed.

The queue data structure can be used to record the split time of a car in F1 racing.

2. Non-linear Data Structures

As the name suggests, in Non-linear data structures, elements are not arranged in any
sequence.

All the elements are arranged and linked with each other in a hierarchal manner, where
one element can be linked with one or more elements.
1) Trees

Binary Tree:

The concept of a binary tree is very much similar to a linked list, but the only difference
of nodes and their pointers.

In a linked list, each node contains a data value with a pointer that points to the next
node in the list,

whereas; in a binary tree, each node has two pointers to subsequent nodes
instead of just one.

Binary trees are sorted, so insertion and deletion operations can be easily done with
O(log N) time complexity.

Similar to the linked list, a binary tree can also be converted to an array on the basis of
tree sorting.

2) Graphs

A graph data structure is also very much useful in machine learning for link
prediction.

Graphs are directed or undirected concepts with nodes and ordered or unordered pairs.
Hence, you must have good exposure to the graph data structure for machine learning
and deep learning.

3) Maps

Maps are the popular data structure in the programming world, which are mostly useful
for minimizing the run-time algorithms and fast searching the data.

It stores data in the form of (key, value) pair, where the key must be unique; however,
the value can be duplicated. Each key corresponds to or maps a value; hence it is
named a Map.

In different programming languages, core libraries have built-in maps or, rather,
HashMaps with different names for each implementation.
o In Java: Maps
o In Python: Dictionaries
o C++: hash_map, unordered_map, etc.

Python Dictionaries are very useful in machine learning and data science as various
functions and algorithms return the dictionary as an output. Dictionaries are also much
used for implementing sparse matrices, which is very common in Machine Learning.

4) Heap data structure:

Heap is a hierarchically ordered data structure. Heap data structure is also very much
similar to a tree, but it consists of vertical ordering instead of horizontal ordering.

Ordering in a heap DS is applied along the hierarchy but not across it, where the value
of the parent node is always more than that of child nodes either on the left or right side.

Here, the insertion and deletion operations are performed on the basis of promotion.

It means, firstly, the element is inserted at the highest available position.

After that, it gets compared with its parent and promoted until it reaches the correct
ranking position. Most of the heaps data structures can be stored in an array along with
the relationships between the elements.

Dynamic array data structure:

This is one of the most important types of data structure used in linear algebra to solve
1-D, 2-D, 3-D as well as 4-D arrays for matrix arithmetic.

Further, it requires good exposure to Python libraries such as Python NumPy for
programming in deep learning.

2.1.3 Data quality and remediation


Principles for Data Quality

On starting a data quality project an organisation needs to decide the principles to


underpin its approach.

These guide any remediation project from beginning to end.

Of course, the rule for principles is that there should not be many, but they should be
unbreakable - the below are critical to the successful execution of a data quality project.

Principle 1: The Business is in the driving seat

Clarity is required at the start that data quality is a business problem and must be
solved by the business.

The IT department cannot and should not be running a data quality project. At the very
start both business and IT need to understand:

 The business is responsible for the data.

 The business is responsible for the quality of the data.

 The business is responsible for the remediation of the data.

 The business is responsible for defining the quality of the data needed.

If a data quality project is run by IT then it is most likely to fail.

However, the business needs to work in concert with IT to achieve their aims.

A data quality implementation needs to bring together the business and IT professionals
to work together for the benefit of a common goal.

This brings me to the next principle.

Principle 2: We’re all in this together.

The business cannot do it alone.

Data sits on IT systems and they are normally the only department with direct access.
Data quality remediation work normally divides into that work done manually, and that
done either via a bulk update or via a data quality application.

If a data quality improvement requires a bulk update of data then IT are the only ones
placed to perform the work.

If an application needs installation and configuration then IT are likely the best placed to
do the work.

Involvement of IT will be critical when it comes to remediation.

Equally, technology underpins processes, as most processes run at least partly on


technology systems.

If the data quality remediation is looking to change process this will also require the buy-
in of IT.

A Data Quality project should be a healthy partnership between the business and IT.

Principle 3: Do it once, do it well

A Data Quality project should only ever be implemented once.

It is often a large endeavour which will draw on resources from all areas of the
organisation.

It is not something that any organisation should want to do twice.

There is no point cleaning up the data for it to revert to a poor state a few months later.

This is frustratingly common, and why data quality is often seen as an insurmountable
problem.

The reason poor data quality keeps coming back is precisely because organisations,
and data quality projects, fail to think about the problem holistically.

The whole point about quality is that quickly, cheaply and badly costs you money. A
mantra for any data quality project should be;

“Do Not Run A Poor Quality Data Quality Project.”

Not understanding the above means you’ll be running another data quality project in a
couple of years, and will have wasted a lot of resources – both time and money - on the
way.
Principle 4: Treat data as an asset

Data should be treated as an asset, but what does this mean for a data quality project?
It means treat every bit of data as if it is a valuable, physical asset.

It has taken time and effort for the customer to tell you their address.

It has taken time and effort for the call centre agent or branch staff member to type it
into the information systems.

This data has then been lovingly preserved for years, religiously backed-up and used
many times for verification.

It has cost money, probably quite a lot of money. Do not discard unless you are certain
it will not be valuable now or in the future.

Take time to update data with care. Look to understand data and why it is in its present
state before deciding on a solution.

Even if you have an obvious error, do not rush to remediate as this data error may be
an example of a process or data failure that will affect many thousands of records, and
the other examples may not be as obvious.

Do not treat your existing data as simply trash to deserve obliteration and replacement
with something shiny and new.

Principle 5: People are the key

Many organisations treat data quality as a technical problem to be solved by technical


people in technical ways.

However poor data quality is a people and process problem with technological
elements, not a technological problem.

What’s more, in order to solve data quality, it is necessary to win hearts and minds of
the organisation. It is necessary to engage with people, not computers.
 It is necessary to persuade the executive that data quality is causing them to
lose money on a day to day basis.

 It is necessary to persuade business leaders that despite the word “Data”


being in the phrase “Data Quality” that solving the problem needs them to
define "what data", and "what quality".

 It is necessary to persuade the IT department that they really do want to help


solve the data quality problem as it will make their lives easier too.

 It is about persuading people there is a business case for a data quality


project.

 It is about training people to recognise poor quality when they see it.

 It is about empowering people to do something about data quality.

 It is about managing the knowledge (generally in people’s heads) of how


information flows through an organisation.

 It is about understanding how people manage, and use, information, and


how they make decisions.

A data quality project needs to understand both its people, and the people in the
organisation, what they are doing with information and how they are doing it. People are
not technology.

They have hopes, fears, and aspirations.

They are irrational and cantankerous, and are not always open to change. Their
involvement needs to be nurtured.

Principle 6: Embed data quality in the organisation


I have always advocated that data quality process should become embedded in the
organisation.

After the initial pain of cleaning historical data is complete, then the organisation must
embed a good-quality mindset - rather than poor-or-irrelevant quality mindset.

At this point the data quality project should disappear. It has now become part of the
organisation.

Whilst some degree of monitoring is necessary to gently steer the process onward, what
is not needed is a large data quality department.

The objective of the approach is to make data quality endemic in the organisation.

Quality management needs to be in place so that data quality issues can be identified
and addressed, but that is all. Data Quality should become just another operational
measurement, and only require a brief look at the dials to make sure they are in the
green.

Principle 7: Do as little work as possible

The objective of a data quality project is neither to boil the ocean, nor to make data
quality perfect.

The objective is to do the minimum possible that allows the organisation to meet its
information needs.

The end state of the data should be described as “good enough”, not “perfect”.

Once this state of affairs is reached, a data quality project is complete and should stop
work and stop spending the organisation's money.

Fundamentally, the approach should be based around minimum necessary work. Work
needs to be undertaken in as effective and efficient a manner as possible, and only
ever done once.
2.1.4 Data Pre-Processing
 Dimensionality reduction
 Feature subset selection

Data Pre-Processing

Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model.

It is the first and crucial step while creating a machine learning model.

When creating a machine learning project, it is not always a case that we come across
the clean and formatted data.

And while doing any operation with data, it is mandatory to clean it and put in a
formatted way. So for this, we use data preprocessing task.

Why do we need Data Preprocessing?


A real-world data generally contains noises, missing values, and maybe in an unusable
format which cannot be directly used for machine learning models.

Data preprocessing is required tasks for cleaning the data and making it suitable for a
machine learning model which also increases the accuracy and efficiency of a machine
learning model.

It involves below steps:

o Getting the dataset


o Importing libraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling

 Dimensionality reduction

What is Dimensionality Reduction?

The number of input features, variables, or columns present in a given dataset is known
as dimensionality, and the process to reduce these features is called dimensionality
reduction.

A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated. Because it is very difficult to visualize or
make predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.

Dimensionality reduction technique can be defined as, "It is a way of converting the
higher dimensions dataset into lesser dimensions dataset ensuring that it
provides similar information." These techniques are widely used in machine
learning for obtaining a better fit predictive model while solving the classification and
regression problems.

It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc.

It can also be used for data visualization, noise reduction, cluster analysis, etc.
The Curse of Dimensionality

Handling the high-dimensional data is very difficult in practice, commonly known as


the curse of dimensionality.

If the dimensionality of the input dataset increases, any machine learning algorithm and
model becomes more complex.

As the number of features increases, the number of samples also gets increased
proportionally, and the chance of overfitting also increases.

If the machine learning model is trained on high-dimensional data, it becomes overfitted


and results in poor performance.

Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.
 Feature subset selection

Feature Selection is the most critical pre-processing activity in any machine learning
process.

It intends to select a subset of attributes or features that makes the most meaningful
contribution to a machine learning activity.

In order to understand it, let us consider a small example i.e. Predict the weight of
students based on the past information about similar students, which is captured
inside a ‘Student Weight’ data set.

The data set has 04 features like Roll Number, Age, Height & Weight. Roll Number
has no effect on the weight of the students, so we eliminate this feature.

So now the new data set will be having only 03 features.

This subset of the data set is expected to give better results than the full set.

Age Height Weight

12 1.1 23

11 1.05 21.6

13 1.2 24.7

11 1.07 21.3

14 1.24 25.2

12 1.12 23.4

The above data set is a reduced dataset.


Before proceeding further, we should look at the fact why we have reduced the
dimensionality of the above dataset OR what are the issues in High Dimensional
Data?

High Dimensional refers to the high number of variables or attributes or features


present in certain data sets, more so in the domains like DNA analysis, geographic
information system (GIS), etc.

It may have sometimes hundreds or thousands of dimensions which is not good from
the machine learning aspect because it may be a big challenge for any ML algorithm
to handle that.

On the other hand, a high quantity of computational and a high amount of time will be
required.

Also, a model built on an extremely high number of features may be very difficult to
understand.

For these reasons, it is necessary to take a subset of the features instead of the
full set.

So we can deduce that the objectives of feature selection are:


1. Having a faster and more cost-effective (less need for computational resources)
learning model
2. Having a better understanding of the underlying model that generates the data.
3. Improving the efficacy of the learning model.

Main Factors Affecting Feature Selection

a. Feature Relevance:
In the case of supervised learning, the input data set (which is the training data set),
has a class label attached.

A model is inducted based on the training data set — so that the inducted model can
assign class labels to new, unlabeled data.

Each of the predictor variables, ie expected to contribute information to decide the


value of the class label.

In case of a variable is not contributing any information, it is said to be irrelevant. In


case the information contribution for prediction is very little, the variable is said to be
weakly relevant.

The remaining variables, which make a significant contribution to the prediction task
are said to be strongly relevant variables.
In the case of unsupervised learning, there is no training data set or labelled data.
Grouping of similar data instances are done and the similarity of data instances are
evaluated based on the value of different variables.
Certain variables do not contribute any useful information for deciding the similarity of
dissimilar data instances.
Hence, those variable makes no significant contribution to the grouping process.
These variables are marked as irrelevant variables in the context of the unsupervised
machine learning task.
We can understand the concept by taking a real-world example:

At the start of the article, we took a random dataset of the student.

In that, Roll Number doesn’t contribute any significant information in predicting what
the Weight of a student would be.

Similarly, if we are trying to group together students with similar academic


capabilities, Roll No can really not contribute any information.

So, in the context of grouping students with similar academic merit, the variable Roll
No is quite irrelevant.

Any feature which is irrelevant in the context of a machine learning task is a candidate
for rejection when we are selecting a subset of features.

b. Feature Redundancy:

A feature may contribute to information that is similar to the information contributed by


one or more features.

For example, in the Student Data-set, both the features Age & Height contribute
similar information.

This is because, with an increase in age, weight is expected to increase. Similarly,


with the increase in Height also weight is expected to increase.

So, in context to that problem, Age and Height contribute similar information. In other
words, irrespective of whether the feature Height is present or not, the learning model
will give the same results.

In this kind of situation where one feature is similar to another feature, the feature is
said to be potentially redundant in the context of a machine learning problem.
All features having potential redundancy are candidates for rejection in the final
feature subset.

Only a few representative features out of a set of potentially redundant features are
considered for being a part of the final feature subset.
So in short, the main objective of feature selection is to remove all features which are
irrelevant and take a representative subset of the features which are potentially
redundant.
This leads to a meaningful feature subset in the context of a specific learning task.

You might also like