You are on page 1of 10

Short Brief – Machine Learning

Robert Serena

August 2022

Page 1 of 10
Executive Summary
The purpose of this brief is to provide some basic background and terminology on a key technological
area that is widely mentioned in the news media and popular culture - Machine Learning (ML). ML and
a closely related technology – Artificial Intelligence (AI) - are often use interchangeably in the modern-
day technology lexicon, but there are some substantive differences. So before delving into greater
detail, we first need to define each term.

Alternative definitions of AI:


• Tech Target → The simulation of human intelligence processes by machines, especially computer
systems. Specific applications of AI include expert systems, natural language processing, speech
recognition and machine vision.
• Britannica → The ability of a digital computer or computer-controlled robot to perform tasks
commonly associated with intelligent beings.
• IBM → The science and engineering of making intelligent machines, especially intelligent computer
programs. It is related to the similar task of using computers to understand human intelligence, but
AI does not have to confine itself to methods that are biologically observable.

Alternative definitions of ML:


• MIT Sloan → A type of artificial intelligence (AI) that allows software applications to become more
accurate at predicting outcomes without being explicitly programmed to do so. Machine learning
algorithms use historical data as input to predict new output values.
• IBM → A branch of artificial intelligence (AI) and computer science which focuses on the use of data
and algorithms to imitate the way that humans learn, gradually improving its accuracy.
• DeepAI → A field of computer science that aims to teach computers how to learn and act without
being explicitly programmed. More specifically, machine learning is an approach to data analysis
that involves building and adapting models, which allow programs to "learn" through experience.
Machine learning involves the construction of algorithms that adapt their models to improve their
ability to make predictions.

We can see from these definitions that ML is a subset of AI. Whereas AI is a term that applies broadly
to any type of computer technology – both hardware and software – ML is specifically focused on
software technology and involves the iterative application of a mathematical model/algorithm to an input
data set to optimize the predictive accuracy of the model for a given output variable of interest.

Basic concepts and terminology


Before delving into the specifics of ML, a good first step is to set the foundation by describing some
basic concepts and defining key terminology.

A model is a mathematical representation of some real-world process and can only be developed using
input data sets. Depending on the complexity of the real-world process being modeled, the input data
set used to develop the model can range from very simple in structure (all structured data, low
dimensionality → limited number of features/variables and observations) to very complex (a broad mix
of structured and unstructured data, high dimensionality → thousands or millions of observations and
features/variables). For the reader new to model development, while the complexity of input data sets

Page 2 of 10
can vary widely, the basic layout of any data set will be the same – Exhibit 1 below illustrates the
structure of a basic data set. The input variables can be referred to as either features or independent
variables (I’ll use features), and the output variables can be referred to as targets or dependent
variables (I’ll use targets).

In our sample input data set, the columns represent both the features and the target(s). The rows
represent the observations or unique combinations of values for each feature or target variable.

EXHIBIT 1 – Illustration of a generic data set

The next important topic is what I’ll call “Traditional Statistical Modeling” (TSM). TSM has been around
for decades and involves the application of well-developed statistical techniques (e.g., linear
regression, polynomial regression, generalized linear modeling) to input data sets of low dimensionality
(limited number of features and target(s), limited number of observations).

Exhibit 2 below is a graphic that illustrates the TSM processing flow – the model developer provides
the input data set and the mathematical formulation of the “black box”, and then a statistical
methodology is used to both initially estimate the parameter values and calculate any measures of
predictive accuracy (Mean Absolute Percentage Error (MAPE) being one example).

EXHIBIT 2 – Process flow for a Traditional Statistical Model

Page 3 of 10
For a TSM, the model developer will typically split an input data set into two segments – (1) The training
set that is used to estimate the parameter values of the model, and (2) The testing or evaluation set
that the model developer will use to test the predictive accuracy of the model. This approach is a bit
different than that used for ML models – in addition to the training and testing sets, the model developer
will also specify a 3rd segment referred to as the validation set. We’ll briefly discuss the ML approach
to data segmentation in the ML section below.

The relationships amongst and between the features themselves, as well as the relationships between
the features and the target variables are typically well-defined and relatively stable through time. The
goal of TSM is to fit a mathematical relationship to the data using the training set, and then evaluate
the predictive accuracy of the model using the testing set. Linear regression is a commonly used form
of TSM, and the model development process involves estimating the coefficients of the functional form
given below:

• Yi = β0 + β1 Xi,1 + β2 Xi,2+β3 Xi,3 + ⋯ + β0 Xi,k


• Yi = the modeled or estimated value of the target variable for observation i in the input data set.
• β0 , β1 , β2 , β3 , … , βk = the (k+1) estimated parameters that map the feature variables (X’s) to the
target variable (Y).
• Xi,1 , Xi,2 , Xi,3 , … , Xi,k = the values of the k feature variables for observation i.

Finally, models of any type are used to predict two different types of values – (1) Regression values or
(2) Classification values. Each of these are defined below.

• Regression
• Using historical data to estimate the future numerical value of a target variable.
• It’s important to note that the target is typically a continuous variable, meaning that it can take
on an infinite number of values in a range between the minimum and maximum permitted
values.
• EXAMPLE – John Smith is a data scientist that works for Ford Motor Company. John
specializes in developing mathematical models that predict the fuel economy of a given
automobile given a range of input features (e.g., engine size, gross vehicle weight, tire size,
gasoline quality grade, etc). John uses historical fuel mileage data that Ford has compiled over
the past 30 years on a wide range of car models to develop the model.
• Classification
• Using historical data to classify the value of a categorical variable (e.g., a variable that can take
on a finite set of enumerated values).
• EXAMPLE – Jane Smith is a data scientist hired by a department store chain to develop a model
that can predict the types of purchases (across a fixed set of goods available in the store –
furniture, clothing, appliances, electronics, food, luxury goods, etc) that a given customer will
make based on a set of features (e.g., type of job, gender, income, where they live, age, etc).
The company has historical data on customer purchasing preferences for the past 30 years.

Page 4 of 10
Machine Learning – Key Points
As a discrete technology, the usage of ML has increased dramatically over the past 10 years, and that
trend will likely increase even more dramatically over the next 10 years. There are several drivers for
this trend:

• Dramatic increases in the amount and variety of data available, particularly unstructured data.
• Increases in the amount of internet bandwidth and data storage options available, coupled with
more cost-effective pricing for both technologies.
• Finally, ongoing dramatic increases in the computational processing power of both desktop-grade
and server-grade computers.

ML techniques are particularly well suited to solving the following types of problems:
• Identifying classifications and casual relationships in large, complex data sets without specified
target values.
• The target values in a complex data set are known (labeled), but the phenomenon being modeled
has high dimensionality and complexity…there are many heterogeneous input variables that impact
the result, and the relationships amongst and between these input variables are not stable and
change over time.
• When the predictive accuracy of a constructed model is more important than the interpretability or
simplicity of the model structure.

There are several different classes of ML models…these are illustrated in Exhibit 3 below.

Model Description Sub-classes


class
Reinforcement • Involves a “trial and error” approach. • Agent-based
Learning • This approach has 3 components – (1) An agent that makes decisions, (2) The
environment that the agent interacts with, and (3) Specified actions that the agent
can take.
• There are no features or target values provided.
• The intent of Reinforcement Learning is for the agent to perform actions that lead
to maximum reward or drive the most optimal outcome.
• Much like the other types of ML models, this approach employs a “stopping rule” in
that the process continues until the agent reaches a pre-specified optimal outcome
value.

Semi-Supervised • A combination of supervised and unsupervised learning techniques. • Image


Learning • Typically applied when the data acquisition cost to perform supervised learning is classification
too high. • Text classification

Supervised • The model is “trained” on an input data set that is complete with values of the target • Linear regression
Learning variables of interest (may be > 1 target variable). • Logistic
• The term used to describe input data sets used to develop supervised learning regression
models is “labeled” → the values of the target are labeled and can immediately be • Decision trees
classified. • Random forests
• Then the trained model is used to predict these variable(s) of interest, and these • Support vector
predictions are compared to the actual values. machines
• The model developer will specify a target performance metric (one example is • Neural networks
MAPE – Mean Absolute Performance Error) to gauge the predictive performance of • Gradient boosting
the model. machines

Page 5 of 10
• This iterative “train and refine” process is repeated until the value of the selected
performance metric is within tolerance.

Unsupervised • The input data set does not have target variables, and there are no rules applied to • K-means
Learning the input data. Clustering
• The goal here is for the model to detect previously unknown patterns and • Hierarchical
relationships in the input data set. clustering
• One use case for unsupervised learning is in transaction monitoring for firms that • Spectral
trade securities or derivatives in different organized markets (e.g., investment banks clustering
in bonds & equities, oil & gas firms in natural gas and oil derivatives, hedge funds • Affinity Analysis
in structured mortgage securities). • Dimensionality
• The model developer will look to group traders into homogeneous groups based on reduction
specific behavioral attributes.

EXHIBIT 3 – Different classes of ML models

Exhibit 4 below illustrates the processing flow for a typical ML model. Whereas a TSM model has the
input features, output targets, and the mathematical relationship pre-specified by the model developer,
the development process for an ML model is a bit different. The model developer will provide the input
data set (potentially without target variables), a set of potential mathematical relationships to test, a
learning algorithm, and initial values for the parameters. Then the exercise is for the model to “train
itself” to find the optimal parameter value settings and mathematical formulation.

EXHIBIT 4 - Process flow for an ML Model

Like the model development process for a TSM model, the ML model developer will segment their
input data set, but into 3 vs. 2 segments – (1) Training, (2) Validation, and (3) Testing. Exhibit 5
illustrates this data segmentation process.

Page 6 of 10
EXHIBIT 5 – Data segmentation approach during ML model development

Finally, any model developer considering the use of ML techniques to model a real-world phenomenon
should consider the following basic questions:

• What phenomenon are we trying to model, and what are the target variable(s) of interest? Of the
target variables of interest, which are categorical, and which are numerical?
• How complete and accurate is the available data set that we’ll use to initially estimate the model?
Do we have enough internally generated data, or do we need to supplement the internal data with
external data?
• Have there been an appropriate level of exploratory data analysis (EDAs) performed on the input
data set to detect any meaningful relationships among the features in the data set?
• Does the input data set require large amounts of cleaning and transforming to make it useable?
• Has the input data set been split into training, validation, and testing segments to ensure that the
model with the best predictive accuracy is developed?
• Does the model’s structure robustly reflect the complexity and mechanics of the real-world
phenomenon being modeled, or were simplifying assumptions required to make the model
mathematically tractable?
• Are the model predictions accurate and interpretable? Can they be trusted, and by extension, does
the model lend itself to implementation into established business processes?

Terms of Reference

Term Source Definition


Categorical variable DeepAI.org • A variable that assumes a limited and fixed number of
possible values, allowing a data unit to be assigned to a
broad category for classification.

Clustering Nvidia.com • The grouping of objects such that objects in the same
cluster are more like each other than they are to objects in
another cluster.
• The classification into clusters is done using criteria such
as smallest distances, density of data points, graphs, or
various statistical distributions.

Page 7 of 10
Continuous variable ScienceDirect.com • A variable that can take on an unlimited number of values
within the {minimum, maximum} range permitted.
• EXAMPLES – Speed, heart rate, body weight, income,
gross domestic product, etc.

Data Engineering Miscellaneous • Data engineering is the complex task of making raw data
usable to data scientists and groups within an organization.
• This topic covers multiple technical aspects of data
science, including hardware procurement and set-up, data
source identification and compilation, data preparation
(data cleaning, data transformation).

Data Model Wikipedia.org • An abstract model that organizes elements of data and
standardizes how they relate to one another and to the
properties of real-world entities.
• For instance, a data model may specify that the data
element representing a car be composed of several other
elements which, in turn, represent the color and size of the
car and define its owner.

Data Science IBM • A cross-functional specialty that combines math and


statistics, specialized programming, advanced analytics,
artificial intelligence (AI), and machine learning with
specific subject matter expertise to uncover actionable
insights hidden in an organization’s data.
• These insights can be used to guide decision making and
strategic planning.

Deep learning Techopedia • An iterative approach to artificial intelligence (AI) that


stacks machine learning algorithms in a hierarchy of
increasing complexity and abstraction.
• Each deep learning level is created with knowledge gained
from the preceding layer of the hierarchy.

Dev Ops Amazon • The combination of cultural philosophies, practices, and


tools that increases an organization’s ability to deliver
applications and services at high velocity: evolving and
improving products at a faster pace than organizations
using traditional software development and infrastructure
management processes.
• This speed enables organizations to better serve their
customers and compete more effectively in the market.

Dimensionality MachineLearningMastery.com • The number of attributes/features that exist in a dataset.


• A dataset with many attributes, generally of the order of a
hundred or more, is referred to as high dimensional data.

Exploratory Data Miscellaneous • The process of performing initial investigations on an input


Analysis (EDA) data set being used to train an ML model.
• The goal of EDA is to use summary statistics and data
visualizations to discover patterns and/or anomalies in the
input data set.

Feature DataRobot.com • A measurable property of the object you’re trying to


analyze.
• In datasets, features appear as columns.

Page 8 of 10
Information Gartner • The specification of decision rights and an accountability
Governance framework to ensure appropriate behavior in the valuation,
creation, storage, use, archiving and deletion of
information.
• It includes the processes, roles and policies, standards and
metrics that ensure the effective and efficient use of
information in enabling an organization to achieve its goals.

Mean Absolute • A measure of prediction accuracy of a forecasting method


Percentage Error in statistics. It usually expresses the accuracy as a ratio
(MAPE) defined by the formula:
100% t=n |At −Ft |
• MAPE = ∑t=1
n At
• At = the actual value of the target variable.
• Ft = the forecasted or predicted value of the target
variable.
• n = the number of observations in the input data set.

Model Office of the Comptroller of the • Refers to a quantitative method, system, or approach that
Currency (OCC) applies statistical, economic, financial, or mathematical
theories, techniques, and assumptions to process input
data into quantitative estimates.
• A model consists of three components: (1) An information
input component, which delivers assumptions and data to
the model; (2) A processing component, which transforms
inputs into estimates; and (3) A reporting component,
which translates the estimates into useful business
information.

Model MachineLearningMastery.com • A model hyperparameter is a configuration that is external


hyperparameters to the model and whose value cannot be estimated from
data.
• They are often used in processes to help estimate
model parameters.
• They are often specified by the practitioner.
• They can often be set using heuristics.
• They are often tuned for a given predictive modeling
problem.

Model overfitting IBM • Overfitting is a concept in data science, which occurs when
a statistical model fits exactly against its training data.
• When this happens, the algorithm developed will typically
not perform well against any data set that is outside of its
initial training set.
• Generalization of a model to new data is ultimately what
allows us to use machine learning algorithms every day to
make predictions and classify data.

Model parameters Miscellaneous • Parameters in the model that are estimated using the
training data set and are then used to generate the model’s
predictions.
• Also referred to as the weights, coefficients, or fitted
parameters.

Model Training C3.ai. • Model training is the phase in the data science
development lifecycle where practitioners try to fit the best

Page 9 of 10
combination of weights and bias to a machine learning
algorithm to minimize a loss function over the prediction
range.
• The purpose of model training is to build the best
mathematical representation of the relationship between
data features and a target label (in supervised learning) or
among the features themselves (unsupervised learning).
Loss functions are a critical aspect of model training since
they define how to optimize the machine learning
algorithms.

Natural Language Investopedia.com • A field of artificial intelligence (AI) that enables computers
Processing to analyze and understand human language, both written
and spoken.

Neural Networks Merriam Webster • Computer architecture in which several processors are
interconnected in a manner suggestive of the connections
between neurons in a human brain and which can learn by
a process of trial and error.

Principle Programmathically.com • A technique commonly used for reducing the


Component dimensionality of data while preserving as much as
Analysis possible of the information contained in the original data.
• PCA achieves this goal by projecting data onto a lower-
dimensional subspace that retains most of the variance
among the data points.

Traditional Miscellaneous • An analytical approach that has been is use for many years
statistical modeling and one that works when applied to data sets of low to
medium dimensionality (# of features in the input data set)
in which the relationships amongst and between the
features and outputs are relatively stable through time.

Unstructured data Wikipedia.org • Unstructured data (or unstructured information) is


information that either does not have a pre-defined data
model or is not organized in a pre-defined manner.
• Unstructured information is typically text-heavy, but may
contain data such as dates, numbers, and facts as well.

Page 10 of 10

You might also like