Professional Documents
Culture Documents
Robert Serena
August 2022
Page 1 of 10
Executive Summary
The purpose of this brief is to provide some basic background and terminology on a key technological
area that is widely mentioned in the news media and popular culture - Machine Learning (ML). ML and
a closely related technology – Artificial Intelligence (AI) - are often use interchangeably in the modern-
day technology lexicon, but there are some substantive differences. So before delving into greater
detail, we first need to define each term.
We can see from these definitions that ML is a subset of AI. Whereas AI is a term that applies broadly
to any type of computer technology – both hardware and software – ML is specifically focused on
software technology and involves the iterative application of a mathematical model/algorithm to an input
data set to optimize the predictive accuracy of the model for a given output variable of interest.
A model is a mathematical representation of some real-world process and can only be developed using
input data sets. Depending on the complexity of the real-world process being modeled, the input data
set used to develop the model can range from very simple in structure (all structured data, low
dimensionality → limited number of features/variables and observations) to very complex (a broad mix
of structured and unstructured data, high dimensionality → thousands or millions of observations and
features/variables). For the reader new to model development, while the complexity of input data sets
Page 2 of 10
can vary widely, the basic layout of any data set will be the same – Exhibit 1 below illustrates the
structure of a basic data set. The input variables can be referred to as either features or independent
variables (I’ll use features), and the output variables can be referred to as targets or dependent
variables (I’ll use targets).
In our sample input data set, the columns represent both the features and the target(s). The rows
represent the observations or unique combinations of values for each feature or target variable.
The next important topic is what I’ll call “Traditional Statistical Modeling” (TSM). TSM has been around
for decades and involves the application of well-developed statistical techniques (e.g., linear
regression, polynomial regression, generalized linear modeling) to input data sets of low dimensionality
(limited number of features and target(s), limited number of observations).
Exhibit 2 below is a graphic that illustrates the TSM processing flow – the model developer provides
the input data set and the mathematical formulation of the “black box”, and then a statistical
methodology is used to both initially estimate the parameter values and calculate any measures of
predictive accuracy (Mean Absolute Percentage Error (MAPE) being one example).
Page 3 of 10
For a TSM, the model developer will typically split an input data set into two segments – (1) The training
set that is used to estimate the parameter values of the model, and (2) The testing or evaluation set
that the model developer will use to test the predictive accuracy of the model. This approach is a bit
different than that used for ML models – in addition to the training and testing sets, the model developer
will also specify a 3rd segment referred to as the validation set. We’ll briefly discuss the ML approach
to data segmentation in the ML section below.
The relationships amongst and between the features themselves, as well as the relationships between
the features and the target variables are typically well-defined and relatively stable through time. The
goal of TSM is to fit a mathematical relationship to the data using the training set, and then evaluate
the predictive accuracy of the model using the testing set. Linear regression is a commonly used form
of TSM, and the model development process involves estimating the coefficients of the functional form
given below:
Finally, models of any type are used to predict two different types of values – (1) Regression values or
(2) Classification values. Each of these are defined below.
• Regression
• Using historical data to estimate the future numerical value of a target variable.
• It’s important to note that the target is typically a continuous variable, meaning that it can take
on an infinite number of values in a range between the minimum and maximum permitted
values.
• EXAMPLE – John Smith is a data scientist that works for Ford Motor Company. John
specializes in developing mathematical models that predict the fuel economy of a given
automobile given a range of input features (e.g., engine size, gross vehicle weight, tire size,
gasoline quality grade, etc). John uses historical fuel mileage data that Ford has compiled over
the past 30 years on a wide range of car models to develop the model.
• Classification
• Using historical data to classify the value of a categorical variable (e.g., a variable that can take
on a finite set of enumerated values).
• EXAMPLE – Jane Smith is a data scientist hired by a department store chain to develop a model
that can predict the types of purchases (across a fixed set of goods available in the store –
furniture, clothing, appliances, electronics, food, luxury goods, etc) that a given customer will
make based on a set of features (e.g., type of job, gender, income, where they live, age, etc).
The company has historical data on customer purchasing preferences for the past 30 years.
Page 4 of 10
Machine Learning – Key Points
As a discrete technology, the usage of ML has increased dramatically over the past 10 years, and that
trend will likely increase even more dramatically over the next 10 years. There are several drivers for
this trend:
• Dramatic increases in the amount and variety of data available, particularly unstructured data.
• Increases in the amount of internet bandwidth and data storage options available, coupled with
more cost-effective pricing for both technologies.
• Finally, ongoing dramatic increases in the computational processing power of both desktop-grade
and server-grade computers.
ML techniques are particularly well suited to solving the following types of problems:
• Identifying classifications and casual relationships in large, complex data sets without specified
target values.
• The target values in a complex data set are known (labeled), but the phenomenon being modeled
has high dimensionality and complexity…there are many heterogeneous input variables that impact
the result, and the relationships amongst and between these input variables are not stable and
change over time.
• When the predictive accuracy of a constructed model is more important than the interpretability or
simplicity of the model structure.
There are several different classes of ML models…these are illustrated in Exhibit 3 below.
Supervised • The model is “trained” on an input data set that is complete with values of the target • Linear regression
Learning variables of interest (may be > 1 target variable). • Logistic
• The term used to describe input data sets used to develop supervised learning regression
models is “labeled” → the values of the target are labeled and can immediately be • Decision trees
classified. • Random forests
• Then the trained model is used to predict these variable(s) of interest, and these • Support vector
predictions are compared to the actual values. machines
• The model developer will specify a target performance metric (one example is • Neural networks
MAPE – Mean Absolute Performance Error) to gauge the predictive performance of • Gradient boosting
the model. machines
Page 5 of 10
• This iterative “train and refine” process is repeated until the value of the selected
performance metric is within tolerance.
Unsupervised • The input data set does not have target variables, and there are no rules applied to • K-means
Learning the input data. Clustering
• The goal here is for the model to detect previously unknown patterns and • Hierarchical
relationships in the input data set. clustering
• One use case for unsupervised learning is in transaction monitoring for firms that • Spectral
trade securities or derivatives in different organized markets (e.g., investment banks clustering
in bonds & equities, oil & gas firms in natural gas and oil derivatives, hedge funds • Affinity Analysis
in structured mortgage securities). • Dimensionality
• The model developer will look to group traders into homogeneous groups based on reduction
specific behavioral attributes.
Exhibit 4 below illustrates the processing flow for a typical ML model. Whereas a TSM model has the
input features, output targets, and the mathematical relationship pre-specified by the model developer,
the development process for an ML model is a bit different. The model developer will provide the input
data set (potentially without target variables), a set of potential mathematical relationships to test, a
learning algorithm, and initial values for the parameters. Then the exercise is for the model to “train
itself” to find the optimal parameter value settings and mathematical formulation.
Like the model development process for a TSM model, the ML model developer will segment their
input data set, but into 3 vs. 2 segments – (1) Training, (2) Validation, and (3) Testing. Exhibit 5
illustrates this data segmentation process.
Page 6 of 10
EXHIBIT 5 – Data segmentation approach during ML model development
Finally, any model developer considering the use of ML techniques to model a real-world phenomenon
should consider the following basic questions:
• What phenomenon are we trying to model, and what are the target variable(s) of interest? Of the
target variables of interest, which are categorical, and which are numerical?
• How complete and accurate is the available data set that we’ll use to initially estimate the model?
Do we have enough internally generated data, or do we need to supplement the internal data with
external data?
• Have there been an appropriate level of exploratory data analysis (EDAs) performed on the input
data set to detect any meaningful relationships among the features in the data set?
• Does the input data set require large amounts of cleaning and transforming to make it useable?
• Has the input data set been split into training, validation, and testing segments to ensure that the
model with the best predictive accuracy is developed?
• Does the model’s structure robustly reflect the complexity and mechanics of the real-world
phenomenon being modeled, or were simplifying assumptions required to make the model
mathematically tractable?
• Are the model predictions accurate and interpretable? Can they be trusted, and by extension, does
the model lend itself to implementation into established business processes?
Terms of Reference
Clustering Nvidia.com • The grouping of objects such that objects in the same
cluster are more like each other than they are to objects in
another cluster.
• The classification into clusters is done using criteria such
as smallest distances, density of data points, graphs, or
various statistical distributions.
Page 7 of 10
Continuous variable ScienceDirect.com • A variable that can take on an unlimited number of values
within the {minimum, maximum} range permitted.
• EXAMPLES – Speed, heart rate, body weight, income,
gross domestic product, etc.
Data Engineering Miscellaneous • Data engineering is the complex task of making raw data
usable to data scientists and groups within an organization.
• This topic covers multiple technical aspects of data
science, including hardware procurement and set-up, data
source identification and compilation, data preparation
(data cleaning, data transformation).
Data Model Wikipedia.org • An abstract model that organizes elements of data and
standardizes how they relate to one another and to the
properties of real-world entities.
• For instance, a data model may specify that the data
element representing a car be composed of several other
elements which, in turn, represent the color and size of the
car and define its owner.
Page 8 of 10
Information Gartner • The specification of decision rights and an accountability
Governance framework to ensure appropriate behavior in the valuation,
creation, storage, use, archiving and deletion of
information.
• It includes the processes, roles and policies, standards and
metrics that ensure the effective and efficient use of
information in enabling an organization to achieve its goals.
Model Office of the Comptroller of the • Refers to a quantitative method, system, or approach that
Currency (OCC) applies statistical, economic, financial, or mathematical
theories, techniques, and assumptions to process input
data into quantitative estimates.
• A model consists of three components: (1) An information
input component, which delivers assumptions and data to
the model; (2) A processing component, which transforms
inputs into estimates; and (3) A reporting component,
which translates the estimates into useful business
information.
Model overfitting IBM • Overfitting is a concept in data science, which occurs when
a statistical model fits exactly against its training data.
• When this happens, the algorithm developed will typically
not perform well against any data set that is outside of its
initial training set.
• Generalization of a model to new data is ultimately what
allows us to use machine learning algorithms every day to
make predictions and classify data.
Model parameters Miscellaneous • Parameters in the model that are estimated using the
training data set and are then used to generate the model’s
predictions.
• Also referred to as the weights, coefficients, or fitted
parameters.
Model Training C3.ai. • Model training is the phase in the data science
development lifecycle where practitioners try to fit the best
Page 9 of 10
combination of weights and bias to a machine learning
algorithm to minimize a loss function over the prediction
range.
• The purpose of model training is to build the best
mathematical representation of the relationship between
data features and a target label (in supervised learning) or
among the features themselves (unsupervised learning).
Loss functions are a critical aspect of model training since
they define how to optimize the machine learning
algorithms.
Natural Language Investopedia.com • A field of artificial intelligence (AI) that enables computers
Processing to analyze and understand human language, both written
and spoken.
Neural Networks Merriam Webster • Computer architecture in which several processors are
interconnected in a manner suggestive of the connections
between neurons in a human brain and which can learn by
a process of trial and error.
Traditional Miscellaneous • An analytical approach that has been is use for many years
statistical modeling and one that works when applied to data sets of low to
medium dimensionality (# of features in the input data set)
in which the relationships amongst and between the
features and outputs are relatively stable through time.
Page 10 of 10