Artificial Intelligence and Machine Learning (AI) - Noter

ARTIFICIAL INTELLIGENCE AND
MACHINE LEARNING (AI)

NOTES
1. INTRODUCTION: LINEAR MODELS AND TREE MODELS
Topic Introduction: linear models and tree models

Readings MLP 2.3.3 (46-70), 2.3.5 (72-85), 2.3.6 (85-94)
Activities to be Lab1, readings for next lecture
done before next
class
TAKING THIS CLASS
GOING FURTHER WITH MACHINE LEARNING
More Powerful Models

More complex model, more diversed group of models
 Random forest and other ensemble models

 Multilayer Perceptron (MLP)/Neural Networks
 Pretrained Models (BERT, GPT, . . . )
Going in more depth

 More detail about models
 Tuning: search for the best hyperparameter settings for a model
 Systematic ways to understanding the importance of model features
o When building a model for a certain domain. Most interesting not the solution represent, but the view
that the model has taken. This can be done by looking at feature importance. Tendency; model as a
solution to a problem, use it as a black box. But better to open op the black box and learn from it, e.g.
feature importance.
 Customer churn: Attributes on customers. More valuable if the model can show the most
important features that can show whether a customer will churn  Look at the coefficient of
each feature.

 Evaluating: exploring model metrics systematically.
o Classification: Accuracy most used. Precision, recall.
o The choice is related to the domain and problem
 Medical: FP and FN
o Many metrics to evaluate model performance in sklearn
ML Models and Business Value

 Expected Value Framework: what is the right metric for a given business problem?
 Classifier Thresholds: how “careful” should a classifier be, based on the business context?
o Specify for a given context, difficult with a certain classification compared to another by modifying
thresholds.
Pre-trained models: a new opportunity

 Pre-trained Models in Language and Vision create a new opportunity to build very powerful models with a
small amount of training data
 Language:
o BERT (Google)
o GPT-3 (OpenAI)
 Image Classification:
o VGG-16
o ResNet
CLASS PROJECT
 Select/construct dataset
 Pose interesting questions, with societal/business relevance
o Pick a dataset, question etc. that we believe in. driven by own curiosity
 Produce a research-style paper.
o Detailed comparisons to relevant recent work
o Propose ways to build on / improve recent work
o Present new results and discuss in light of previous work
o Find relevant recent work, paper should do comparisons or build or improve on it, discuss our project
in the light of this work. Not expected to produce new research! But a step in that direction. Having a
complete picture of what is happening within the domain, and situating our work in this area – how
does it add on? – Business view.
Final Project
 Interesting Datasets
 Interesting Questions
 Use ML techniques studied in class
 Best practices for training, tuning and evaluating
 Models
 Take advantage of large pre-trained models
 Systematically explore different model evaluation metrics to connect to business value
 Write research-style paper, where you connect with current research
Foundation to master thesis

Business idea
COLLECTING DATA
 The most important thing is not

o the model you use
o or how much training data there is or how you tune it
o ...
 It’s how interesting the dataset is!
Interesting Datasets
 Reddit
 Twitter
 Our World in Data
o Covid-19.
 Combining different datasets, to pose questions like:

o Can Twitter data predict Covid-19 case levels?
 ML could predict Covid better than the human experts.
o How are different travel destinations described on Reddit?
o Does weather affect the level of crime in different areas?
o Does AirBnb data in different areas tell you about social conditions in different cities
 Combine with crime? Increase rental prices?
 Hidden insights in their collected data
o ...
THE CLASS
Weekly sessions
 Readings
 Lecture
 Activities
o Programming/building models
o Finding and constructing interesting datasets
o Developing project ideas
o Submit results regularly
 Feedback
o Written feedback on submitted results
o Feedback on project ideas and workplan
 Syllabus
Main Reading
 Introduction to Machine Learning with Python
AI, MACHINE LEARNING AND BUSINESS
AI AND MACHINE LEARNING

Machine learning is the power behind AI
AI is the power behind the most important companies in the world

 Big Tech: AI and ML:
o Google
o FaceBook
o Amazon
 Intelligent Search
o Google
 Machine Translation
 Question Answering
 Image Recognition
 ChatGPT
Example of chatGPT answering an exam Q from

AI and ML
WHY AI/MACHINE LEARNING AT CBS?

Different ways of thinking of ML and AI
 Research
 Startup
Tech companies are running the world (with AI) for

good and bad.
Everyone: understand how AI and ML will affect

business, the economy, society and our common future.
This class builds on Big Data Management

SUPERVISED MACHINE LEARNING
Supervised Machine Learning: Supervised ML means there is a label or target value

 Regression: target is numerical
 Classification: target is selected among a finite set of choices
REVIEW: BASIC CONCEPTS
Generalization, Overfitting and Underfitting
 Generalization: how well model performs on data other than the training data
o The goal. Training data set – summarizing the training. We’ll go beyond these specific observaitons.
 Overfitting: Model is much better on training data than on test data
o Model is too complex, and too closely tied to specific details of training data
 Underfitting: Model doesn’t perform well on training data or test data
o Model is too simple
LINEAR MODELS
LINEAR REGRESSION
Linear Model: a weighted sum of input features. Basic model

Prediction is a line for a single feature, a plane for two features, or a hyperplane for more features
 A line (two dimensions)
 Linear Equation (p + 2 dimensions)
An example for 2 dimensions.
LINEAR CLASSIFIER
Sum of weighted features. Same computation, but the result is a binary question. The models search for the best
combination of weights – what features contribute to the best xx.
LINEAR MODELS AND REGULARIZATION
 Regularization: make model simpler to reduce overfitting, by pushing coefficients closer to 0

 Support Vector Classifier
 Logistic Regression
 C parameter
o C = 1: default
o C = 100: more complex model – possibility of overfitting
o C = .01: simpler model – possibility of underfitting
DECISION TREE
 Widely used for classification and regression

 Builds a hierarchy of if/else questions, leading to a decision
 Intuitive
Similar to linear model, but takes each feature and make a simple
assessment. How does it contribute to the target variable?
Decision Tree – Binary and Continuous Tests

Binary Test: is feature i True or False? Continuous Test: is feature i larger than value a?
Linear models limited in the way they can – trees more powerful than linear models.
Two Moons Dataset
Decision Tree – Controlling Complexity

 Build until leaves are pure
 If all leaves are pure, tree will be 100% accurate on training data
 To prevent overfitting:
o Limit the complexity
o limit depth of tree
o limit maximum number of leaves - splitting
o require minimum number of points in a node to keep splitting
Decision Tree – Feature Importance

Coeffecients for each feature. Positive or negative showing their importance. Examine the importance of each feature.
The tree building algorithm. Looks at each feature; which has the biggest power In terms of increasing the purity. Finds
split that increases the purity the most. Once the tree is built; look at each feature and see how much information it gives
about the target, can provide information about the domain.
 Feature Importance: between 0 and 1 for each feature 0 means not used at all
 1 means perfectly predict target
 A weighted measure of impurity reduction
Example: Which features are most important
Feature importance
RANDOM FOREST
 Build many decision trees

o Address problem of overfitting of decision trees.
 Each tree differs in random ways.
o RF extremely powerful and improvement of decision trees. Randomly erases different data in the
different trees.
o Select different data points used to build tree
o Select different features in split tests
Data for Random Forest

 For each tree, create bootstrap sample
 Randomly select n items from orginal dataset, allowing repetitions
 Each tree will have same size dataset, but randomly different, because of repetitions
Random Feature Subsets for Random Forest

 Parameter max_features
 Select random subset of features of size max_features
 If max_features is high, more chance of overfitting
Parameters for Random Forest

 Number of trees –
o The more trees, the more powerful
 max_features
NEURAL NETWORKS
The Perceptron
Linear models are a weighted sum of inputs.
A perceptron is a weighted sum of inputs – a linear

model
Multilayer Perceptron/Neural Network

When putting perceptrons.together; we get a neural
network
Connect multiple linear models; still have one big

linear model. In principle more powerful than one.
Activation functions
Not linear functions
Chopping off parts of the input
Neural Network – Equation
Tuning Neural Networks

 Many parameters to adjust Number of hidden layers
 Number of units in each hidden layer Regularization
 Scaling of inputs is important
o Not true for all models (Decision trees etc.)
o Also important for KNN etc.
 Work with multi-layer perceptrons
o Most straight forward way of neural network
THE END OF PROGRAMMING

ML as a New Way to Program
Machine learning vs. rule based
 Traditional Approach: Rule based /programming approach

o Define Problem – ie input-output function
o Inspect Data, reason about problem – come up with rules based on your inspection. Use intelligence
as a human.
o Construct program that solves problem
o Pros:
 Easier to see what you are doing, interact with data
 Better understanding when you write the program
 Ethics: easier to prevent bias.
 Human also brings knowledge to the process, where the machine learn from scratch from the
dataset.
o Cons:
 Human can create biased models.
 ML Approach:
o Collect Data
o Automatically build model based on data, that solves problem
o Skip the programming part. Programming is done by the machine learning algorithm. The solution are
done by the algorithms, not produced by a human programmer.
o Cons:
 Can be a black box
 Bias. Reproduces bias.
 Starts from zero with a data. Changing; large language models.
 Few-shot Learning: ML changing – not learning from scratch

o Pre-trained model brings understanding
o Collect small amount of data
o Quickly tune model to solve problem
o  Programming is changing
TAKEAWAYS
 Building on basics of ML from Big Data Management

 More powerful models
 Explore different ways to evaluate models – Expected Value, thresholds
 Large pre-trained models for language and vision
 Final project – research style paper
LAB1
Random state - for reproducable results when repeating

requirement for the distribution will be the same when splitting data set. Same distribution of target values within both
training and test data.
Q1
Dummy classifier
All 10 are equally frequent
3 way split: Train-val-test

Tuning: using the train-val set, that is split into train- and val.
Fit on train-set
Score on val-set
When the best parameters have been found:

Train the model on the entire training set, trainval consisting of both the train and val set.
Then score on the test-set.

2. NEURAL NETWORKS (MLPS) AND UNCERTAINTY ESTIMATES
Topic Neural networks; uncertainty estimates

Readings MLP 2.3.8 (106-120), 2.4 (121-129)
done before next
class
1 Lab1
2 Neural Networks: Multilayer Perceptrons History
Linear Model vs. Deep Learning Examples
3 Uncertainty Estimates
NEURAL NETWORKS: MULTILAYER PERCEPTRONS
HISTORY
How a perceptron looks like
The perceptron – “the first machine which is capable of having an original idea”, according to Frank Rosenblatt
1. All inputs (similar to x[0]…x[3]
Weights, links with the weights
Feedback
LINEAR MODEL VS. DEEP LEARNING
LINEAR MODEL
A weighted sum of input features – learn the coefficients of the links
 ^y is a weighted sum of input features x[0] to x[p]

 Coefficients w[0] to w[p] are learned in training
The Perceptron is a Linear Model

NEURAL NETWORK MODEL
 Perceptron computes weighted sums

 In MLP, process of Perceptron is repeated multiple times
o Repeating the process multiple times
 Hidden units are an intermediate processing step
o Hidden because you can’t observe them, like the input or output can
 These are in turn combined using weighted sums to produce output
Multiple Layers of Perceptrons

When putting perceptrons.together; we get a neural
network
Connect multiple linear models; still have one big

linear model. In principle more powerful than one.
Still linear
When input goes through a non-linear model, becomes
more power
Activation Functions
Non-linear in some way. When linear input comes through this function it becomes more powerful.
Input is passed through an activation function.
A way of ignoring stuff
Not linear functions

Chopping off parts of the input
Each neuron figures out its weights differently. Its activation

function tells which parts of the possible output values it should
ignore.
Image recognition for each neurons: by using ‘chopping off’

functions, some neurons starts to specialize in finding special
patterns. Allow each neuron to develop an expertise. Gradually
graduate a neural network where each neuron is specialized.
Activation Functions: relu and tanh

 relu: cuts off values below zero
 tanh: goes to -1 for low values, +1 for high values
Neural Network – Equation

Where do act funcitons apply
Y hat is the outputs, the hidden neurons
MLP with Two Hidden Layers

EXAMPLES
Two Moons Dataset

 Two classes
 Two features
MLP Decision Boundary
Decision boundary for neural network with two features
10 hidden units
Much sharper boundary, not as many distinctions
2 hidden layers, and 10 hidden units, with relu activation function:

2 hidden layers, and 10 hidden units, with tanh activation function
MLP Different Settings

Controlling the sizde / complexity
Different Initializations
For smaller networks, can make a difference with random
Cancer Data: Scaling

Not scaled data
Scaling in general – neural networks care a lot about scaling!
The test score is markedly higher when data is scaled
If we scale the data : Mean value for every dater, comp sta devi for each, center
all values around the mean
UNCERTAINTY ESTIMATES
Knowing what you know
Knowing what we know (or don’t know)

UNCERTAINTY IN CLASSIFICATION
How certain is classifier for a given prediction?
How close is output to decision boundary?

For human, decisions requires reflection
For machine learning they generate a score, make a decision based on a given
output
For any kind of model,
Uncertainty Estimates for Models

Always calculated for models.:
 Two methods: decision_function and predict_proba
 Most models have one or both of these
Example: Decision_Function
 Decision_Function returns floating point number for each sample
 Value encode how strongly model believes a data point belongs a class
o Strength of the classification, how strong the model believes in it
Examples of values we might get
Floating point numbers, unbounded in their values
Decision function values into classification
Example: Predict_proba
 predict_proba outputs a probability for each class
O
 For binary classification, shape is (n_samples, 2)
As they are proba – values sum up to 1
How do we recover the classification from this?
 The class with probability above .50 is the one predicted – if binary
o Multiple:
 A calibrated model is one where probabilities align with accuracy – predictions with probability .70 are correct
70% of the time.
MULTICLASS AND UNCERTAINTY
 decision_function and predict_proba have shape (n_samples, n_classes)
 High score means class is more likely and low score means class is less likely
Multiclass: Iris Example

Gradient boost
Apply decision function on test data
First entries
Largest values
Recover Predictions from decision_function scores

Argmax function > find the highest
Unvertainty estimtes: looking at the previous steps for the predict

functions
Get the same values if we apply the predict values
Gradient boost has both
Why predict rather than decision function? Most models have only
one of them. Big problem with ml > no matter what it will predict
something. Using predict problem > tool to decide if its under a
certain threshold we might not want to predict anything at all
PREDICTIONS WITH THRESHOLDS
Xx effect:
 Can recover predictions from uncertainty estimates using different thresholds
 This can reflect the different costs and benefits of different errors
o In some domains some errors are lot worse than others. Churn prediction:
 Lab2: cancer classification of benign vs. malignant
o Maximize recall or precision:
 Recall: Minimize FN
 TP / TP + FN
 Precision: Minimize FP
 TP / TP + FP
 By modifying threshold, can selectively improve precision/recall of one or the other
 Relevant for social applications
TAKEAWAYS
 Perceptron is a linear model

 MLP generalizes this by linking perceptrons with nonlinear activation funcdtions
 Deep neural network has multiple layers
 Many parameters to tune
 Uncertainty: models can output one or both of decision_function and predict_proba
 Relates to classification predictions by reference to threshold
 We can modify threshold to alter classifications in interesting ways
LAB2
Wisconsin Breast Cancer dataset – supervised ML
 What’s the point

o Inductive statistical modeling via ML
o Modeling in a highly-sensitive domain
 AI in mammography – it’s already applied in industry

3. UNSUPERVISED MACHINE LEARNING
Topic Unsupervised machine learning – representing data and feature engineering

Readings MLP 3.1-3.4, Ch. 4 (133-169)
done before next
class
Agenda 1. Unsupervised Machine Learning
o Preprocessing and Scaling
o Dimensionality Reduction
o Clustering
2. Representing Data and Engineering Features
o Categorical Variables and Dummy Values
o Automatic Feature Selection
3. Model Improvement
o Cross Validation
o Grid Search
UNSUPERVISED MACHINE LEARNING
PREPROCESSING AND SCALING
Supervised vs. Unsupervised Machine Learning

Supervised ML: there is a target value (also called a “label”)
Unsupervised ML: there is no target value
SCALING
Can be a problem if different features have very different ranges

For example, house price ranges from 50,000 to 5,000,000, while number of bathrooms ranges from 1 to 4.
Important for SVMs and Neural Networks
Can be important
Features in supervised setting – can be a problem if trying to calculate relative importance

Predict something about houses. Price – up to 5 million. Another feature can be types of homes; number of bathrooms
only ranges from 1-4. Trying dif ways to calc output. House prices night take over because of – nu of bathroms might
look indifferent. Make a fair starting point. All features are equally important by putting them on the same scale.
vector macheines, neural networks

Not tree models
Can be done manually
 Several different scalers in scikit-learn

 MinMaxScaler ensures that all features are between 0 and 1
 Scaling usually applied before doing supervised ML
 Scale training and test data the same way
Scale Test data same as training data
Scalers; same syntax as models. Instantiate it, fit the scaler to datasaet, training data. Once a scaler is fit to training data,
examined all data, foind min and max value, then transforming the data. The scaler knows all relative evalues, then
transform it.
Scaling in Python
Scaled between 0 and 1
Relative differences are the same after scaling,

which is the basic idea
We have the scaler that is fit to training data. Not

fit it ot test data. We transform train data that we –
for test data, not scaled between 0 and 1.
Not same distribution in test data as training data –

we don’t know what
Retain same scale that we get from training data

wen apply to test data
e.g. looking at houses. Max no bathrooms is 4 –

becomes a 1 value after scaling. Test data might be
max of 5 barthroom. We want to treat it as bigger
than 1 – better than the best house in trianin gdata –
we want to retain that.
Different ypes of models, that does the same thingt
DIMENSIONALITY REDUCTION
PCA
 Principle Component Analysis (PCA) – a popular form of dimensionlaity reduction. In genereal the idea of
reduction. Unsupervised process of trying to find the most interesting ways that data varies. Each feature
o Some features might correlate with each other.
 Rotates dataset, create features that are uncorrelated
 First component contains the most information, ie accounts for most of the variance
 Can select a subset of the most informative principle components
o New verisons of features ordered after how much information they provide.
o Converts data, features so first geatue is most informative, ordered by informativeness
 Visualizing high-dimensional datasets
o Useful for vizualising – pick a couple of the most informative features
PCA- and Synthetic Data

Start with original dataset
We can then transform the data and onky look at the first or second
component
Cancer Histogram
 Overlay two histograms for benign and malignant
 Texture error looks quite uninformative
 Mean Concave points looks much more informative
o How they vary in respect to the target
How PCA is applied to the cancer dataset
Take each feature
Features where they are completely overlapping; not help predict

target
Concave; separated. Better a predicting.
Give an idea about how different features relate to target value
Doesn’t know what the target value is. Doesn’t care, but can see how
much the features overlap.
Used in same way as we use models;
Here we select 2 components.

Fit to scaled data; then the data is transformed. X_pca.
The shape is now 2 featues.
Convenient to visualize as we now only have 2 featues. Gives a

better sense of the dataset.
Good at finding 2 important features
Might do better with a classifier when reducing featues.

Avoid overfitting. A subset of original components; generalizing
data instead of specifying on specific things in the data. Get better
generalization. A good idea to try.
PCA AND IMAGES
Used for face recogniztion
Baseline – Without PCA

 Labeled Faces in the Wild
 Faces of celebrities from early 2000’s 3,023 images
 87x65 pixels
 62 different people
PCA a natural thing to do with this kind of task

Low level input data; abstract features of the images.
Baseline – take the data as it is. Applying a KNN.
Look at every single pixel.
 Classification Results
o Accuracy of only 0.23
 Serves as a “random guessing” baseline.
 For that many classes, random guesses – relatively good.
o But, it is a 62-way classification problem
 It is learning something
 Gives us a starting point.
Classify using PCA

 Use PCA to construct first 100 Principal Components
 Use these features in KNN classification
When applying same knn model after using PCA;
improvement from 23% to 31% accuracy.
The features are saying more stuff about the faces although
we only use 100.
First Component: contrast between face and background

Second Component: differences in lighting between left and
right side
...
Taking individual components and treating them as X

dimensional arrays and visualizing them. Each component
captures different aspects of the picture
Some contrasting left and right side,
 Classification Results
o By using 100 principal components instead of pixel features:
o Accuracy improves from 0.23 to 0.31
NMF
Other ways of reducting dimensionlity
 Non-negative Matrix Factorization
 Like PCA, can extract useful features
 Can be used for dimensionality reduction
 Each feature must be non-negative
NMF and faces

Quality of back-transformed images similar to PCA but not as good
NMF can find interesting patterns
 Component 3 shows face rotated to the right

 Component 7 shows face rotated to the left
Out ability as humans to recognize faces,
Faces with highest vals for Component 3

All from a similar angle – illustrate that these components focus on
specific some orientation.
Faces with highest vals for Component 7

We do get benefits from dimensionality reduction; more
informative features than original pixel values. What we really
want; learning approach – in a targeted way.
Why neural network are good for this:

Gets feedback in the learning process. Modifying, different
versions. Target helps finding the right types of abstractions.
Get really good at image recognizion; extract right kinds of higher level features from the picture.
CLUSTERING
 Partition data into clusters

o Make up classes as you go along – view of similarity. Come up with clusters in which similarity is
maximized within a cluster and minimized across clusters.
 Data items within a cluster should be similar, and items in different clusters should be different
 Clustering algorithm assigns a number to each data item
 Similar to classifier – but there is no ground truth
K-MEANS CLUSTERING
 Finds cluster centers that represent specific regions of the data
o For certain number of clusters find the mean value that max similarity within clusters.
 Alternates between two steps:
o Assign each data point to closest cluster center
o Compute center
TAKEAWAYS: UNSUPERVISED ML
 Unsupervised ML often preparation for Supervised ML

o Scaling
o Prep – prelude for supervised machine learning.
 Clustering
o Truly unsupervised approach
REPRESENTING DATA AND ENGINEERING FEATURES
CATEGORICAL VARIABLES AND DUMMY VARIABLES

 So far: we’ve assumed that our data consists of floating-point numbers – continuous feature
 Also want categorical features
 Similar to distinction between classification and regression
 Continuous feature: size measurement of flower; income of individual
 Categorical feature: color of flower; gender of individual
Income Dataset
 Creating dummy values
Categorical Variables
 Only makes sense if features are numerical’
Workclass Feature
 Workclass is categorical feature
 Has four possible values:
o Government
o Employee
o Private Employee
o Self-Employed Self-Employed Incorporated
 Create four new features
Dummy Variables
 Also called One-hot-encoding, or one-out-of-N encoding
 If a feature F has three values, a,b, and c
o Create three new features, Fa, Fb and Fc
o If Fi has value a, then Fai =1, Fbi =0, Fci =0
Dummy Variables with Pandas

Get dummies method – does it automatically. Ignores
numerical featues, automatically applies dummy method
to categorical
Check Values
Get dummies method – does it
automatically. Ignores numerical featues,
automatically applies dummy method to
categorical
GetDummies method
Might think its problematic; create

dataset with higher dimensionality which
can be challenging for the model
Alternative: Label encoding

Rules out something that the model might learn – e.g. rating movies
AUTOMATIC FEATURE SELECTION
Three Feature Selection Methods

 Univariate: look for statistically significant relationship between each feature and target
o Looking at featues one by one, how they relate to the targte
 Model-Based: uses a supervised model to judge importance of each feature
o Asks the model what it thinkgs are the most important features
 Iterative: build a series of models, and try adding/subtracting features
More effective modeling, more information about the domain
UNIVARIATE FEATURE SELECTION

 Test each feature for how informative it is about the target (can be for classification or regression)
o Statistical correlation between features and target values.
 Threshold: discard features based on p-value (likelihood that feature is correlated with target)
 SelectKBest: selects best k features
 SelectPercentile: selects percentage of best features
Fit selector to training data
For the selector, select 50th percentile. Top 50% of the
features
Getting a statistical selection
Get better result with logr model with reduced set of

features on our test data
MODEL-BASED FEATURE SELECTION
 Uses supervised ML model to judge importance of each feature

 Tree Models compute feature importance
 Linear models have coefficients
 Unlike Univariate, Model-Based Feature Selection can capture interactions between features
o Understand more about importance of features.
Use random forest classifier to select the

featues
Then use logr to actually apply that to

the data – don’t have to do feature
selection with the same model, you can
use a different model
ITERATIVE FEATURE SELECTION METHODS
Two General Methods:

1. Start with no features, add them one by one
2. Start with all features, remove them one by one – Recursive Feature Elimination (RFE)
Bad idea to look at data, certain features and just getting rid of features because it might look unimportant. Up to the
ML model to figure that out – use systematic ML methods. Don’t assume ahead of time!
Select Features
Does iterative selecting – using RFE
Transform and Score

Then fitting the selected features to the
model
Score with Model Inside RFE
TAKEAWAYS: REPRESENTING DATA

 Convert categorial data to dummy (0/1) values
 Several methods for feature selection Data
MODEL IMPROVEMENT
CROSS VALIDATION
Fitting Model to Data

 The point is to find models that generalize to data beyond training data
 That’s why we split data into training and test data Test data score gives a better assessment of the model
 than training data score
 Cross Validation: do multiple splits between training and
 test data to get better assessment of model
 Grid Search: search for best parameter values, to get a better model
Cross Validation
 Instead of one train-test split, multiple splits
 For example with Five-fold CV, pick one fifth of data as test, and the other four fifths as training data
o Each fifth used sequentially as test data
 Gives a better basis for assessing model – with one split, might be “lucky” or “unlucky” with test data
Five-fold Cross Validation

Cross Validation in SciKit-Learn
Gives scores of all the splits
Cross_val_score performs
 split of train and test data
 fits model to train data for each of the splits
 scores model on test data
CV doesn’t improve the model, but a way to assess the model.

The more splits; more accurate VIEW of the models performance
GRID SEARCH
Assessing vs Improving Models

 Cross validation is simply a method to assess a model – does nothing to improve the model
 Grid search is a way to improve your model
 Combines well with cross validation
Example: Tuning an SVM

 C controls regularization – higher values mean less regularization, just as with logistic regression
 Gamma controls complexity of model in a different way
o A low value of gamma means that the decision boundary will vary slowly, which yields a model of
low complexity, while a high value of gamma yields a more complex model. (p 102 in text)
 Want to try these values for both gamma and C: .001, .01, 0.1, 1, 10, 100 (Regularization or complexity)
o Nested for loop to try each of 36 combinations
o
Wrong: Tuning on the test data! Like cheating!
Alternatives:
1. Bayesian Search
2. Random Search
Why We Need a Validation Set

 It’s wrong to tune a model using scores from test set
 This won’t give valid indication of how model generalizes
 Need to define a separate Validation Set which is used to tune model
 Scores on test data are only given once tuning is finished, and best model is selected
 Use to check hyperparameters – applying test-train split twize.
A Three-Way Split of Data
First split:
Train-val contains both training and validation set
Second split:
Splitting the train-val set into train and val.
Note on kwargs
 best_parameters = {’C’: C, ’gamma’ : gamma}
 Define best_parameters as a dict
 svm = SVC(best_parameters)
 Can pass a dict to a function expecting keyword arguments
GRID SEARCH WITH CROSS-VALIDATION

 Model results can be very sensitive to how data is split
 Rather than use grid search with one split, can use it together with cross-validation for a better tuning process
The more folds  Validation will be smaller. The more folds the more valid picture you have. But 3-5 would be
sufficient.
 Use cross_val_score instead of score

 Take mean of scores returned by cross_val_score Can use GridSearchCV class to implement this
Finally trains on both train and validation set so it
can be trained on more data – important to know
Summary - tuning
Tuning on Validation Set
Tuning on Validation Set using Cross-Validation
TAKEAWAYS: MODEL IMPROVEMENT

 Split train and test data to assess a model
 Cross validation is a better way to assess a model
 Grid Search improves model by finding best hyperparameter values
 Need three-way split of data: train, validation and test Can combine Grid Search and Cross Validation
LAB3
Doesn’t care about the data – whether it’s pictures
Important - method
- Tuning using training data, cv on part of data
- After finding our best classifier, we retrain on all the data – both val and train!
- Then create a classification report
GridSearch exhaustive way to run through all combinations

ALternativeS:
randomsearch – shown to perform equally as good,
Bayesian search –
4. MODEL EVALUATION AND IMPROVEMENT
Topic Model Evaluation and Improvement

Readings MLP Ch. 5 (257-310)

done before next
class
METRICS FOR MODEL EVALUATION
BASIC METRICS: BINARY CLASSIFICATION
Classification: Accuracy
 Default metric
Correctly classification samples

Total number of samples
Regression: R2
Coefficient of Determination
ExplainedVariation
TotalVariation
BUSINESS IMPACT
What is the goal? What is the business impact of using the model?
Many different metrics for models
BASIC CLASSIFIER METRICS

Binary Classification
 Start with Binary Classification

 Can call two choices Positive and Negative
 This is an arbitrary choice
 By convention, Positive might be the choice we are most interested in
Positive: Yes – has disease

Negative: No – doesn’t have disease
Confusion Matrix
Accuracy
Accuracy
Precision
Being precise about the POSITIVE value
Divided by all the times the model says it has the disease
We can cheat, not same way as recall. High precision strategy would be to – saying less is good. Only choose the
positives in cases with highest confidence.
Precision
Recall:
True positives
ALL TIMES we say
Actually does have the diseacse
Easy ways of cheating: everyone have the disease > perfect recall. Precision might be bad.
Recall
F score
Because we can “cheat” with both recall and precision we need the F score to balance thing out
Get a value between 0 and 1. Only evaluating a model with either recall or precision is not good, as it might give a
skewed picture.
F
Digit Classification
Convert digits into unbalanced binary classification task
Target value y is true if digits.target is 9, false otherwise

Unbalanced: 10% of target values are true, 90% false
DummyClassifier
Confusion matrices
Dummy Classifier: Most frequent
Classification Report: Logreg

Logistic Regression Classifier
UNCERTAINTY IN PREDICTIONS
CLASSIFICATION AND THRESHOLDS

Linear Classifier

 Classifiers compute a value that is compared against threshold – for linear classifiers, default is 0
Lower Default Threshold

 Another tool we have to try to push the model in direction that relates to our goal and building the model
instead of picking the standard setup
 Problem with unbalanced data: Fewer examples of one class, model tend not to pick that class
Default Threshold (0)
Lower Threshold (-.8)

PRECISION-RECALL CURVE
 Tradeoff between precision and recall

 Can explore this with different thresholds
 Precision_recall_curve gives precision and recall values for different threshold values
If lowering it enough we’ll get perfect recall?

Plot curve with different threshold - a way of
exploring the tradeoff
ROC CURVE
 Plots True Positive Rate (recall) against False Positive Rate

 Best values are higher (more true positives) and to the left (fewer false positives)
False Positive Rate


True Positive Rate (Recall)

o True positives against all potential positives
ROC Curve
Takes different thresholds and plots the two against
each other.
Indicate where we might get the best results
AUC
 AUC is Area Under (ROC) Curve

 Single number to summarize ROC curve
 Ranges from 0 to 1
 Random guessing always gives 0.5, even with unbalanced datasets
 Can be more revealing with unbalanced data
METRICS FOR MULTICLASS CLASSIFICATION

Binary: in respect to the target class. For multiclass we look with respect to any class.
Confusion matrix
Recall: 37 / 37 = 1.00
- Perfect
Precision: 37 / 37 = 1.00
Precision: 43 / 46 = .93
Looking at predicitons. Not all are correct. The

recall is 93%.
Recall: 43 / 48 = .90
Classification report
For every class we get the metrics.
A way of putting it all together:

 n
USING METRICS IN MODEL SELECTION

Using different metrics
 We select models by tuning hyperparameters

 Use GridSearchCV, cross_val_score
 By default, accuracy is optimized
 Can change this to other metrics, such as average precision
Example – using accuracy which

is the default
We can change the scoring metric

in which the CV is optimizing
with respect to.
All the possible scores we can use

in CV:
Ideally optimize the model for the

value/metric that is most valuable
in our case, not necessarily
accuracy
Challenge: what is the most

valuable and relevant metric in
our case?
Aligning metrics to our case
TAKEAWAYS
 Metrics for Evaluation There are lots of ways to evaluate models
 We should select a metric that corresponds to the goals and business impact of building the model
 Can explore thresholds for classifier models
 We select models by tuning hyperparameters – can do this wrt. the metric best suited to your case
Project
 Readings: find a relevant research paper
 Workplan: submit description in week 10 Use techniques discussed in class:
o Diverse metrics
o Thresholds
o Pipelines
o Grid Search CV/Tuning
LAB4
Like sentiment analysis

– look after positive
words to predict a
positive label or vice
verca for negative
NB: Large dataset; start

with a small sample.
1000 or 10.000 instances
Pandas has a great

sample method; give
randomly selected.
Fb posts labeled to
certain emotions
Value counts on emotion
column
Diversity in how many
instances – emotion
classification, quite
unbalanced
Preprocessing function,
converts text into bag of
features (words) – 1
feature for each word.
Many options for thisr
Train test split

Countervectorizer; turn
text into features
Use bag of words
algorithm to turn words
into features –
preprocessing data.
X_train vec is the result
of the counervectorizer.
Build the model

Different options; length
Creating LR model.
Score of 94% train and
54% test.
5 way classification
What are the right kind

of features of the data
Size of ngrams, different

lengths.
Train and test results are
rather different
Increasing ngrams;
increasing number of
features.  varies with
size of data whether it is
good to have a high
amount of ngrams.
Use dummy classifier;

random guessing.
TFIDF
Use a different scoring;
instead of how many
times a word occurs in
text, use TFIDF
Use tfidf algorithm on
the outcome vectorizer.
Create pipeline
Do gridsearch
Explore arbitrary
axpects,
Make a pipeline;
preprocssing important
for language processing,
bag of words models.
Look at characters
instead of words;
individual letters.
Default is ‘word’.
Use pipeline with
gridsearchCV. Model
choices,
Specify different options
for hyperparameters, do
gridsearch
Done 2 types.
One get better scores

using characters.
Classification report.
5. ALGORITHM CHAINS AND PIPELINES
Topic Algorithm Chains and Pipelines

Readings MLP Ch. 6 (311-328)
done before next
class
 Lab4
 2 NLP: Some Background
o The Revolution in NLP
o Language is Hard
 Some NLP Basics
o Bag of Words
o Additional Topics
o Movie Reviews and Sentiment Analysis
 Naive Bayes and Sentiment Classification
 Logistic Regression
 Lab 5
NLP: BACKGROUND
THE REVOLUTION IN NLP
The current revolution in NLP

Human language ability has often been the defining of human being, compared to e.g. animals
 CHatGPT
ChatGPT: What it can do

 model by OpenAI which interacts in a conversational way Answers follow-up questions
 Challenges incorrect premises and reject inappropriate requests
Based on GPT3
 large language model
 trained on missing word prediction transformer model – certain type of neural network good at learning in this
way
 Further training through Reinforcement Learning for Human Feedback
Growth of ChatGPT
Took 5 days to get 1 mil users
LANGUAGE IS HARD
Descartes and AI
 Could a machine imitate a human?
o No – you would always be able to tell the difference
 Rene Descartes: Discourse on the Method, Part V (1637)
Descartes: Machines can’t Imitate Humans

 “. . . they could never use speech or other signs as we do when placing our thoughts on record for the benefit of
others.”
 “. . . we can easily understand a machine’s being constituted so that it can utter words, and even emit some
responses to action on it . . . But it never happens that it arranges its speech in various ways, in order to reply
appropriately to everything that may be said in its presence, as even the lowest type of man can do.
o Like siri. There can be fixed responses planned out, but limited repitorie, not diversity as humans
 Right back then;
The Turing Test

 (1950) Computing Machinery and Intelligence
 Test of a machine’s ability to exhibit intelligent behaviour
 Human judge engages in a natural language conversation with a human and a machine
 If the judge cannot reliably tell the machine from the human, the machine passes the test
 The original question, "Can machines think?" I believe to be too meaningless to deserve discussion.
Nevertheless, I believe that at the end of the century the use of words and general educated opinion will have
altered so much that one will be able to speak of machines thinking without expecting to be contradicted.
A way to decide whether a machine can think like a human.

Show diversity in behavior like humans; same argument as Descartes.
Why is Language Hard?

 2 challenges: infinite and ambiguous
 Language is infinite – infinite set of senses
 Most sentences you hear –you have never heard them before, and will never hear them again
Descartes: Machines can’t Imitate Humans
Language is ambiguous
 Many words have multiple meanings
o Lexical Ambiguity
 Phrases and sentence can have multiple meanings
o Structural Ambiguity
 A single sentence can have many different meanings
 Need Context to resolve ambiguities
NLP BASICS
BAG OF WORDS
Types of Data
 Numerical Data
 Categorical Data
 Text data is different
o Content of an email
o A Headline
o Text of political speeches
Can text be treated as structured data?
Making Language into Structured Data

 The solution: Dummy values for words
o One feature for each word
o Value is 1 if word occurs in text, 0 otherwise
o Alternative values: number of word occurrences, or TFIDF score
o For any text, value of most features will be 0
 Gives a lot of information of what the text is about.
Bag of Words Processing

 Treat text as a Bag of Words
Basic steps
Start with a string

Divide into words (features)
The tokenizer looks at white space –
Order into a vocabulary
Unigrams
Bigrams
ADDITIONAL TOPICS
Stopwords
 Words that are “too frequent to be informative”
 Built-in Stopwords List: above, elsewhere, into, well, fifteen, . . .
 Could also discard words that appear too frequently
 Common stopwords and less common
TF-IDF
 Words that are frequent in a document tell a lot about that document
 Words that appear in lots of documents are less interesting
 TF-IDF
o increases as term frequency increases decreases as document frequency increases
o term frequency and inverse document frequency – frequency of word occurring acorss all documents
MOVIE REVIEWS AND SENTIMENT ANALYSIS

Rise of user generated data in web 2.0
Sentiment analysis became popular due to lot of review data online
Exploit Big Data for Text

 Use Supervised ML for text processing
 Can we get labeled text data?
 Build Classifiers
o Spam Detection
o Sentiment Analysis
o Topic Detection
o ...
Sentiment Analysis
 Is a text Positive or Negative?
 Used for Social Media Analysis
 Marketing
 Impact of new product
Movie Reviews as Data

 Online movie reviews
 Texts paired with ratings
 IMDB reviews
o Positive: Rating of 7-10
o Negative: Rating of 1-4
Bag of Words with More Than One Word

 nGrams
Look at coefficients from the tdfidf scores
Domain specific > ‘boring’ is bad for movies,
‘preditctable’ might be bad for movie domain but
good for others, maybe weather.
Language Modeling and Ngrams

 Assign probability to a sequence of words
 What is p(He went to the store)?

He went to the store
 1-grams (unigrams): He, went, to, the, store (5)
 2-grams (bigrams): He went, went to, to the, the store (4)
 3-grams (trigrams): He went to, went to the, to the store (3)
 4-grams: He went to the, went to the store (2)
 5-grams: He went to the store (1)

 Bigram approximation:
o p(He went to the store) =
o p(went|He) * p(to|went) *p(the|to) * p(store|the)
 Trigram approximation:
o p(He went to the store) =
o p(to|He went) *p(the|went to) * p(store|to the)
NAÏVE BAYES AND SENTIMENT CLASSIFICATION
TEXT CLASSIFICATION
 Sentiment analysis Spam detection
 Language identification
Explain the difference between generative and discriminative classifiers:
GENERATIVE VS. DISCRIMINATIVE CLASSIFIERS

 Naive Bayes: Generative
o Models how a class could generate data
o A certain class; what would be the features
 Logistic Regression: Discriminative
o Models which features are useful to discriminate
o What features do we need to discriminate?
BAG OF WORDS
Document Classification and Bayes’ Rule
 Pick most likely class c, given document d
 Bayes’ Rule:
Classification Using Bayes’ Rule

Features are Independent
Prepend prefix NOT to every word following negation until the next punctuation mark
LOGISTIC REGRESSION
DISCIMINATIVE VS. GENERATIVE CLASSIFIER

Classifier: Cat or Dog?
 Generative Classifier (like NB) looks at all features to understand what dogs look like and what cats look like
 Discriminative Classifier (like logreg) would be satisfied with a single feature: “dogs have collars”
NAIVE BAYES VS. LOGISTIC REGRESSION

 Naive Bayes computes a likelihood and a prior:
 Discriminative model like Logistic Regression computes P(c|d) directly
Weighted Sum of Features
Sigmoid Function Creates Probabilities
Weighted sum of features:

Sigmond creates probabilities:
Decision Threshold:
Designing Features for Sentiment Analysis
Designing Features for Period Disambiguation
TAKEAWAYS
 ChatGPT reflects a revolution in NLP and AI
 Language is Hard: infinite and ambiguous
 Basic NLP Techniques:
 Bag of Words, nGrams
 Naive Bayes and Logistic Regression: examples of generative and discriminative classifier models
LAB5
6. TECHNIQUES IN PRACTICAL ML
Topic Techniques in Practical ML

Readings
done before next
class
Agenda Market adoptions
Data science profiles
Data science projects
Tools
QA
What are your reflections on AI Act?

In the dataset of face images we practiced with all of the images are of the same size. Let's say in our dataset we have
images of a various size, how would we need to preprocess them and/or adjust the model?
Do you agree that if we as a mankind would sacrifice copyright law and/or GDPR and use protected data for training
ChatGPT-like conversational (or not) AI we would be able to create even more powerful AI?
How can AI be used to better understand customers and their needs, as well as personalizing their experiences?
How can AI be used to optimize business processes and increase efficiency?
In which types of projects have you applied AI? For what reasons did you use AI?
INTRODUCTION
WHAT’S THE POINT?

An introduction to the aspects of data science that you don’t pick up in Jupyter notebooks (or on Kaggle)
Motivation
● Data science is mostly taught with a focus on the hard technical skills: Statistics, algorithms and coding
● There’s also a heavy focus of modelling and things like hyperparameter tuning.
● While these things are important, they are only part of the equation.
● This often leads to a culture shock when aspiring data scientists move to industry.
● Case: Myself, studied the data science programme at CBS
○ This is the lecture I’d would have liked to have had.
● So, today, I hope to give you guys an introduction to the aspects of data science that you don’t pick up in a
Jupyter notebook.
● I also hope to show you why a business background can be super valuable in this space.
● What if you don’t aspire to become a data scientist?
Well, if you are interested in how value is actually created from AI, I hope that you will find this useful too.
Practical definitions
So, there’s a lot of different definitions of data science. This field is constantly evolving, and so is the perception of it.
Notice in Arthur Samuel’s definition, there’s no mention of data.
 Artificial intelligence
o Artificial narrow intelligence (ANI)
 e.g., smart speaker, self-driving car, web search, AI in farming and factories
o Artificial general intelligence (AGI)

 Do anything a human can do
 Machine learning (In practice: ANI)

o “Field of study that gives computers the ability to learn without being explicitly programmed.” -
Arthur Samuel, 1959
 Data science
o “The science of extracting knowledge
and insights from data.” - Andrew Ng, DeepLearningAI, 2020
o Notice that Samuel doesn’t talk about how computers should learn
The modern answer: Data - and lots of it
It’s all analytics

Descriptive analysis: What happended (e.g.
business intelligence, dashboards)
Diagnostics: why did it happen?
Predictive analytics: predicting, classification
Presprecptive analytics: Recommendations,
forcing something in a certain direction
MLAI is a tool, not a solution -
 It might be important to think of how AI and ML fits into the ecosystem of analytics.
 Where do you think ML and AI fits in?
 I hope that you’ll agree that we can’t really start thinking about what will happen before we know what
happened and why it happened
 Really, the important takeaway here is that ML and AI is based on descriptive and diagnostic analytics.
o -> We can’t hope to model anything if we don’t understand it
 This is an important perspective.
 When speaking with a client, it can be very helpful to understand which kind of insights they are after.
 If diagnostic analytics is the solve, the is no reason to start thinking about predictive analytics.
MARKET ADOPTION
In terms of some research
ML-AI ADOPTION
Global AI adoption globally is 2.5x higher today than in 2017 but may have reached a plateau.
 A very common trend for new technology - Gartner’s Hype Cycle

 Every technology goes through a hype phase, then all get disappointed and the hype goes down. Here
we can focus building the technology.
 Reality is sinking in: Organizations are starting to recognize the level of organizational change it takes to
successfully embed this technology.
 Some companies that get discouraged because they went into AI thinking it would be a quick exercise
 Those taking a longer view have made steady progress by transforming themselves into learning organizations
that build their AI muscles over time.
 MLAI maturity
Although ai global adoption is higher it

reached a plateau
What do AI achievers do differently?

Strategy and sponsorship as an example
● According to McKinsey research:
○ More indications that AI leaders are expanding their competitive advantage than evidence that others
are catching up
Most differ in strategy and sponsorship – 12 % doing very well
Mapped very good at business; think about business first or at least

why they are doing it
AI MATURITY
● AI maturity measures the degree to which organizations have mastered AI-related capabilities in the right
combination to achieve high performance for customers, shareholders and employees.
● What does all this mean for you?
It means that the critical component is not being or having the cleverest data scientists (as everyone thought a
few years ago)
● Instead, we need ambidexterity: Multiple people, with different skills.
 People that can understand the business and the technology.
Broad categories
● MLAI maturity is multivariate and varies significantly between organisations

● MLAI is not just technical know-how.
● Major barriers to success:
○ Data and data mgmt.
○ Strategy and vision
○ Sponsorship / buy-in
○ Governance
○ Talent
○ (Concise value propositions)
● What does all this mean for you?
few years ago)
○ Instead, we need ambidexterity: Multiple people, with different skills.
-> People that can understand the business and the technology.
About ai maturity:
AI Innovators: Have mature AI strategies,
but struggle to operationalize
AI experimenters: Lack mature AI strategies

and the capabilities to operationalize
Don’t know what they wanna do, and lack of

technical knowhow – most of the companies
AI Builders: Have mature foundational

capabilities that exceed their AI strategies
AI Achievers: Have differentiated AI

strategies and the ability to operationalize
DATA SCIENCE PROFILES
SKILL GAPS IN DATA SCIENCE/MLAI

 Most important skills/areas of expertise missing
 What does all this mean for you?

few years ago)
o Instead, we need ambidexterity: Multiple people, with different skills.
-> People that can understand the business and the technology.
Anaconda made. Huge survey– skill gaps in AI.
Engineering; strong coders
It seems like the field is looking for people that

can take this into a business context; from
notebooks into a business strategy
 The fields needs data science

professionals with different skill sets.
 Think about what you bring to the
table.
 Too often, companies expect the

superset of all elements, not the union
(middle)
Union: Data science
DATA SCIENCE PROJECTS

Many experimenters: making PoC, MVP  Establish they can get value from ai short term
AI happens in pp – ML happens in python
How MLAI projects are different

● MLAI projects are IT projects - with an additional risk dimension: Uncertainty of signals in data.
● “The combination of some data and an aching desire for an answer does not ensure that a reasonable
answer can be extracted from a given body of data” - John Tukey, 1986
Traditional IT projects MLAI projects
Characteristics  Rigid functional requirements.  Inherently experimental. All data

 Defined product or service. science projects are experiments –
therefore ‘science’
 Default hypothesis: “We can predict y
from features X”
 Inductive reasoning: What can the data
tell us?
 Start out with a hypothesis, found out
other also applies
Timeline  (Often) clear timeline.  (Often) unclear timeline.
Iterative  (Often) non-iterative*  Highly iterative - Tasks depend on

experiments.
Challenge Challenge: Efficiency and Challenge: Evaluating the hypothesis - Then

implementation. implementation.
SCOPING
Because of high uncertainty, careful scoping is critical - Don’t jump to modelling.
● Specifications
○ The WHY: How will this help?
○ What are the user stories?
○ What is the timeframe?
○ Evaluation criteria
■ Qualitative (Business) – end user’s idea,
■ Quantitative (Business, technical)
● Considerations
○ Is this a duct tape solve?
○ What level of analytics is required?
○ Are we reinventing the wheel?
○ The cold start problem (Logging the right data?)
○ Build vs. buy
● A small change to current ways-of-working may be better than a duct tape AI solution
● A user story:
○ “As a …. (persona)
○ I want to … (function)
○ in order to … (reason / user wish)”
● Qualitative: We would like to be able to detect credit card fraud cases more easily
● Quantitative: We would like to be able to detect at least 85% of card card fraud cases
 Technical diligence - do not focus on this part

o Can AI system meet desired performance
o How much data is needed
o Engineering timeline
 Business diligence
o Lower costs [current business]
o Increase revenue [current business]
o Launch new product or business [new business]
Ask questions, quantify, approximate and discuss

 Low hanging fruits – email classification, sentiment analysis
DATA SCIENCE DEVELOPMENT
Three phases of data scientific projects

 MLOps:
MLAI lifecycle – MLOps

 Our clever models have zero value until we publish them.
 MLOps is engineering - Many data scientists struggle with these steps.
TOOLS
Let’s look into one of our projects
Which tools do we use?

https://github.dev/NTTDATAInnovation/documentai
Data handling ● Pandas

● NumPy
● DocArray
Deployment ● Docker
● DVC
● FastAPI
● Streamlit
● Gunicorn
● GitHub Actions
● TerraForm
Testing ● PyTest
Environment ● PyEnv
● Poetry
● Pre-commit
● Jupyter notebooks
● Git + GitHub
Documentation ● MkDocs
● MkDocs-materials
MLAI ● Scikit-learn
● PyTorch (or Keras)
● LightGBM
● SpaCy
WRAPPING UP
General
 MLAI is a tool, not a solution
Market adoption
 MLAI maturity is multivariate and varies significantly.
 The major barriers to success with MLAI are often not technical know-how.
Data science profiles
 Business knowledge is a sought-after skill in data science
 MLAI is not just technical know-how.
Data science projects
 MLAI projects are inherently experimental
 MLAI project have an additional risk dimension: Uncertainty of signals in data
 Because of high uncertainty, careful scoping is critical - Don’t jump to modelling
 When scoping, try to ask questions, quantify, approximate and discuss
 Our clever models have zero value until we publish them

7. IMAGE PROCESSING
Topic Image Processing: Pre-trained models

Readings Russakovsky et al. 2013
done before next
class
REVIEW OF BOW MODEL
A lawsuit against using chat-gpt in US because it can make legal documents without paying high fee
Who do you sue? If it suggests something but it is wrong
MAKING LANGUAGE INTO STRUCTURED DATA
 The solution: Dummy values for words

o One feature for each word
o Value is 1 if word occurs in text, 0 otherwise
o Alternative values: number of word occurrences, or TFIDF score
o For any text, value of most features will be 0
o Treating each model as features. Inputting
BOW – COUNT VECTORIZER
Create bag of words representations
Produces as sparse representation a

standard way in computer science.
Default – has a token pattern with a

regular expression. A pattern of
characters. The pattern means it has
some boundaries – something that
limits a string, such as white space.
Ignoring things like punctuation entirely
Text Input Example with text – last 3 text in the

collection. From movie reviews.
Specify max no of features and
document frequencies. A way to restrict
features. Not occur more than 30%.
How does it choose features? Maybe
the most frequent according to Dan.
Max_df wont allow features with a
document frequency above a certain
threshold – here 30%. A good idea to
have a max_df, idea is the same like
stopwords, if it occurs in most docs it
might not provide that much
information. Like in a news paper the
by lyne ‘written by’ is not informative,
but might occur in all articles.
Restricting Texts, only 7 features
Features
. Shows the occurrence of words across

documents
Word
Ngrams
Char Analyzer parameter default is ‘words’.

Ngrams Interesting reasons why we might look
a something different like character
engrams instead of word ngrams.
Ignores the boundaries of words
Subparts of words that are meaningful

in its own right.
Compound words – e.g. when words are

put together.
Need to use a longer range, sequence of
characters.
BOW – TFIDF VECTORIZER
Variant of the count vectorizer

Will give the tf-idf score.
Good to be familiar with. Ability to

use different options, in a pipeline
and use gridsearch to try out different
combinations.
TF-IDF
 Words that are frequent in a document tell a lot about that document
 Words that appear in lots of documents are less interesting TF-IDF
o increases as term frequency increases
o decreases as document frequency increases
 numerator – term frequence, denominator – document frequeny
o “the” has a high document frequency, not focus on this
o Tf-idf is good to encapsulate that
Choose between tf-idf and countvectorizer? Yes, pick one.

Its an either or. Bag-of-words either or.
Tf-idf is conceptually more interesting than. Put it into gridsearch and explore that.
He has experienced worse results with tf-idf than a simpler representation. Can’t ecplore all options, but do vary
exploration of different options, what could be interesting to explore, explain why we do as we do
Do the same thing as before.
Tokenizer Standard token pattern. Simple apprach
Convert a string to a list of tokens.

We don’t want words with symbols.
Separate pucntuations – then add rules
to make it more sensible.
Done with tandardly used tokenizers.
Subword New approach – is subword p

Tokenizer Many reasons why to look beyond the
level of the word.
An illustration of the BertTokenizer to

tokenize a string. Most are ordinary
words and punctuation.
Whats cool about it;
Breaking words into interesting
subparts
Complex linguistic process.
Word morphology? How words are

pout pogether. Words has its own
internal grammer. Like happy and
unhappy – un is negation.
First look at counts for each token
Then split into separate characters, but

remember their count they occurred in.
Not most interesting way to represent.
We need something in between- the
whole word and separate words.
Instead; take the most frequent pair and
merge it.
Concept of white space
BOW AND PIPELINE/GRIDSEARCH
 Large space of possibilities for text feature representations

 Design GridSearch with Pipeline to search possibilities – can be things. Only do in combination for pipeline
and grudsearch. Which model
o Word vs Char analyzer
o Binary, Counts, TfIDF
o Number of features
o Min/Max Doc Frequency
o Model: Logistic Regression, Naive Bayes, MLP, . . .
o BERT features
 This idea goes idea with the current deep learning models
o Current models: don’t look at different ways of defining features. We let the model itself to choose
what features to look at. Its given the sequence of tokens, we don’t do features.
o A lot is feature engineering
 Taken over by deep learning model. Also learns about – engineers its own features, therefore
it has different layers. Each layer has different features with its own level of abstractions
 Its obsolete to make feature selection this way
o A way of getting a look into the black box of deep learning models
MORE ON LANGUAGE
SPACCY AND LANGUAGE

A collection of convenient tools. All sorts of different topics. Linguistic
annotations, can find structures in language – parts of speech
Linguistic Annotation
Can easily be download and work with in the pipelines.
Work with different languages.
Convert the text with spacy, convert into list with annotated
tokens
Output of Spacy
Basic of grammar. A recursive three structure.
Named Entities
doesn’t have a perfect
Text Similarity
Word embeddings
Take individual words or phrases.

Convert them into fixed length
vectors of real numbers. Done in a
training process similar to training a
model.
A way to get a sense of what a word

is about.
Spacy has a version of word

embeddings – used to compute
similarity of texts
LANGUAGE TASKS
SENTIMENT ANALYSIS
Early days – work with simple tasks. Very defined in a

way. A text; give it a 1 or 0 depending if it is positive or
negative.
This is a simple classification tasks
INFERENCE
More interesting task
In some ways the ask is the whole thing in nlp
What you need to do is to understand the meaning of sentences

and how they relate to each other.
This is a classification tasks.
Divide into 3 categories:

1. Entailment: Sentence A entails sentence B. Therefore
if A is true, B has to be true
2. Contradiction: The inverse of entailment; sentence A is

true. B has to be false.
3. Neutral
QUESTION ANSWERING
Specific ways to setup a QA task.
Setup so there is some kind of text and some kind of

answers. Find the answers to the text
General type of task, seems to require deep understanding.
Think of it: finding a span . a limited view. Needs a text to

get a context in which it finds the answer.
No context: the system has to possess the answer from its

training. Like the current gpt-systems. Don’t have to
provide texts
Prompt engineering – new area in the field
Variations of current models. E.g. in how you pose a
question
Different versions of the squad system.
Leaderboards for different tasks. Not relevant more.

THE BERT MODEL
DEVLIN ET AL. 2019
Should read it, not necessarily understand all

details
Using features that BERT can produce for text

based on its pre-training.
MAIN POINTS
 Bidirectional Encoder Representations from Transformers.

 Pre-trained language model
o Trained on a large amount of text
o Not feasible for most students or organizations.
o Too resource intensive.
o Idea: the model has a lot of knowledge of language. Give it a text, then it makes a representations
selects the features based on its pretraining. Then we can finetune the model
 Can be fine-tuned for

o Single Sentence Classification: Sentiment Analysis, Emotion Classification
 Unlimited data for. The task encompasses all we care about in language.
o
o Pair Classification: Question-Answering, Inference
 Give it a pair of sentences. Sometimes find random sentences, other time two sentences that
fit together in an actual text; Bert is to predict which is which – supervised tasks , but we can
easily produce as much data we want.
 Better for question answering: predicting related sentences
 Bert is different than gpt models in the way it divides the above
o Bert pre-training:
INPUT-OUTPUT REPRESENTATION
 Can represent a single sentence or pair of sentences
 “Sentence” can be any span of text
 Add [CLS] symbol to beginning of input sequence
 [SEP] symbol at end of first sentence
 If input is a pair of sentences, they are separated by [SEP]
 Use WordPiece embeddings – variant subword tokenizing. Then we have emebddings – representations that
give numerical values, that can say something about the meaning of the words.
PRE-TRAINING
 Input is pair of sentences
 Two Tasks:
o Masked LM
 Predict masked tokens, based on surrounding tokens
o Next Sentence Prediction
 Predict whether second sentence naturally follows first one
Missing words: predicting what words occurs in specific context
Supervised? You have the actual answer
Unsupervised: you have unlimited amount of data. Any text you find.
FINE-TUNING
 Much less resource intensive
 Since the pretrained models know so much about language in advance we can use it
 Different for different tasks
 Sentence-pair tasks (Question Answering, Inference, . . . )
o Input is sentence pairs, with same representation as training
 Single-sentence tasks: (Sentiment analysis, emotion classification, . . . )
o Input is single sentence, ending with [SEP] symbol
Results from the article
LAB7
Based on an exercise that he linked to in

canvas.
Idea is to take the BERT model 
Produce features for text  we use the
features  logistic regression to train on
the featues no fine tuning  sentiment
analysis.
Two Feature Sets

Compare it to a BoW model.
 DistilBERT 
o Take a text and make a representation of features. Based on its pretraining. Rich representation of
thext. Give the representation to the logreg model in the sentiment analysis. Tries to learn how to do
sentiment. Should be good due to the BERT representation
 Bag of Words
o Instead of Bert features: give the LogReg model a bag of words model as input.
o
DistilBERT
 Smaller version of BERT
 Almost matches BERT performance
 Produces sentence embedding – vector of size 768
Logistic Regression
 Classifies each sentence as Positive or Negative Features are 768 real numbers – sentence embedding
Dataset
 SST2
 Movie Review texts, classified as Positive or Negative
MODEL 1: DATA PREPARATION

 Tokenization
 Padding
 Masking
 Do some prepreparation
Bert tokenizer breaks words into tokens

Not everything gets broken, due to
frequency.
For convenience, turn each sentence not

a fixed length - make every sentence
the same length. A lot of
That is done by padding the sentences

with extra zeros: find the max length
and add 0s to the remaining.
Then tell bert which parts of the

sentences we can ignote – that is what
masking does. It is done automatically
MODEL 2: SENTENCE EMBEDDINGS
 All sentences are input to BERT
 Output of interest corresponds to first token ([CLS])
 Classifying a text: out
A sentence labaled as either 1 or 0
For eachsentence we get a sentence

embedding, array of X numbers.
Use as black boz to make embedings for

each sentence
Model Produces Sentence Embeddings
Last hidden state

Embedding for each position in the input
Assign Features and Labels

Throwing a way a lot of information.
BoW model ca be a vector of 30.000
feature; knows every word in a text. The
Bert boils the features down. Advantages
on both.
Logistic Regression Model
Train as usual
Scoring the model
Best Scores
8. AI IN PRODUCTION – GUEST SPEAKER
Topic TBD
Readings SLP, Ch. 5, 6
done before next
class
GUEST SPEAKER: PRAYSON WILFRED DANIEL, NTT DATA
https://github.com/Proteusiq/hadithi/tree/main/talks
 Confirmation bias
 Ask questions before opening the dataset to avoid bias
 How was the data generated?
 How was it collected?
 What contains in the data
 EDA
 How to avoid domain experts’ bias?
 Highlight if there is differences
 Challenge their assumptions
 Ethics
o IN Europe moving to very controlled AI space
o Not in China or US
o Transparency – personal information?
 Some say ‘keep them’ – sometimes it does not make sense to remove
protected attributes
 E.g. predicting houses; The price is impacted by the number of non-danes in
the area. However, ‘race’ as an attribute can impose bias and should maybe
be removed.
 Solution: Highlight that you have kept the attribute to show that
you’re mindful about it
 Talk to your clients
o Iteratively improve and adjust the model making it more complex. But start very
simple, then talk to your clients and adjust it.
 Code to abstraction
o Reading any type of file
o Example from lab: Making classes with gridsearch that takes in an algorithm
 In setting where you need to be iterative and make experiments
 Have it running in functions or classes so we only have to change it one place
o Always abstract the things we do
 Do fit and predict proper
 Then check if
 Ideally build a system by having this abstraction
Import reader
Import Path
Def read(file: Path, reader: Any, **kwargs) -> pd.Dataframe:

Return reader(file, **kwargs)
Clf = logR(params)
= sgd(params)
Def model(clf, **kwargs (doesn’t care what params)

Return clf, **kwargs
Clf = model
Clf.fit(x,y)
Dependency injection
Yaml > read in file and run
Separate what is changes and what is not

e.g. a reader is not changing. Identify what elements of the code that is changing, and make it
abstract
features typically change

target change
9. RECENT DEVELOPMENTS: GPT-3 AND OTHER MODELS
Topic Recent Developments: GPT3 and other models

Readings Devlin et al. 2019
done before next
class
Where is the field heading.

The development that is happening – we are in a position here we can engage with the state of the art of the field.
Research, business application
Schedule an online meeting before the end of next week to get answers to our project
Agenda:
1. The Revolution in AI/NLP: Large Language Models
a. GPT-3
2. BERT – Bidirectional Transformer Encoders
a. Training Bidirectional Encoders
b. Next Sentence Prediction
3. Transformers
4. 4 GPT-3 and the OpenAI API
THE REVOLUTION IN AI/NLP: LARGE LANGUAGE MODELS
GPT-3
Using LLM – optimize it in with interacting with people –

thus “chat”. The gpt-3 is inside it.
Illustrating the model in a good way, displayed on their

own website.
Illustrating the power of the model – correct grammar,

formulation etc.
Not connected to anything; be online, cannot send mails,

run programs etc. Why? It would be convenient – keep
responsibility on the user.
Why does OpenAI not think it is a good idea? There are

add-ons that use the models?
Dan: “ It wouldn’t be safe”
ChatGPT: What it can do
 Model by OpenAI which interacts in a conversational way

 Answers follow-up questions
 Challenges incorrect premises and reject inappropriate requests
ChatGPT: the Technology

 Based on GPT3
o transformer model
o trained on missing word prediction
 Further training through Reinforcement Learning for Human Feedback
o The main way chapt-gpt is a version of gpt-3 whichc is optimized for dialogue. The gpt-3 model
inself is trained on information on th web, missing word prediction, but chat-GPT is further trained
for interaction optimization with humans.
Growth of ChatGPT
Took only 5 days to reach 1mil – historical impact in terms of SW. reflect
how powerful true AI is and is going to be.
Discussion: Have we reached genuine AI? – deep question is it actually

intelligent?
To a large extent the sw does represent a solution to the main ai problems.
Does it become inferior by time within the next 10 years– along with
quantum computing is developing?
- DAN: nobody knows. Chat-gpt is already inferior compared with gpt-
4. SW will continue to evolve. He talks about how ai can match
human intelligence.
Turing test? Used in the past to evaluate AI intelligence. Blown past the
turing test.
New goal is to reach AGI – Artificial general intelligence. Not specific task,
but all kinds of task humans solve with their intelligence.
Still many things humans are better to do than AI with our human
intelligence, that is the new tasks.
GPT-4
More effective, hopefully safer
Report – how it is performing, how it has been tested

Human tests – many different from chatGPT3.
Unform Bar Exam: the exam after finishing law school to

become a lawyer – indicating that chatgpt4 would be better
being a lawyer than most human lawyers
Better to translation although they weren’t trained for

translation.
Good at coding. Nothing specifically done to programming
Khan Academy is making interaction - integrating gpt-4 –

very pedagogical – not giving answers directly but helping
to find the answer
Current language models – not reliable in being thruthful

all the time. Good at making things sound possible
although they are not true, sounds plausible; concern in an
educational setting. Not solution for that
GPT-3: BROWN ET. AL 2020 – LANGUAG MODELS ARE FEW-SHOT LEARNERS

GPT-3 is old.
GPT-3 AND FEW SHOT LEARNING
 They observed: “scaling up language models greatly improves task-agnostic, few-shot performance,
sometimes even becoming competitive with prior state-of-the-art fine-tuning approaches.”
o Task agnostic = the model itself not built to do QA, sentiment analysis, summarization etc. specific
taks tha t nlp systems often are trained to do as humans can. Think of it in a task agnostic way – the
models become competitive with state of the art with fine tuning approach.
o Pre-train – the fine tune for a specific tasks for example sentiment analysis (NOT task agnostic)
o As these models become really big, we don’t have to make them task agnostic?
GPT-3:
 transformer model – like other models like BERT
 175 billion parameters
 10x more than any previous large language model
A SHIFT IN NLP
Their point
 Shift from
o learning task-specific representations and designing task-specific architectures to
o using task-agnostic pre-training and task-agnostic architectures.
 To using task-agnostic, few-shot performance competitive with prior state-of-the-art fine-tuning approaches
o We can do as well for all tasks without tuing the model for a specific task
 Previous Approach:
o Pretrain model (GPT, BERT)
o fine-tune on a large dataset of examples to adapt a task agnostic model to perform a desired task
 Recent work (Radford et. al 2019) suggested this final step may not be necessary.
o Very exiting.
IN-CONTEXT LEARNING
In paper not doing any fine tuning. Lot of training data and updating the weiths pf the network example. Instead “few
shot training”
 one- and few-shot performance is often much higher than zero-shot performance
o 0-shot: sentiment analysis; classify sentences “0-shot” – can it do it out of the box?
o One/few shot: Also give it one example before on a positive and negative sentence. Then you give it
the task of classifying. Then test the model on a few shot learning. Shows that this type is competitive
 This suggests that language models are “meta-learners” where slow outer-loop gradient descent based learning
is combined with fast in-context learning
o Standard way of training a neural network model (gradient descent) – slow process, each time
classyfing, gradually fine tuning the weighst. Very time consuming.
o Compared with fast-in context learning
 . . . one notable pattern is that the gap between zero-, one-, and few-shot performance often grows with model
capacity, perhaps suggesting that larger models are more proficient meta-learners.
o As you give the model few numbers of examples it improves. The improvement is bigger for bigger
models. Large models more proficient meta learners: better at general learning trained with few-shot
in context learning.
o Movement towards AGI? Able to say “do this thing for me” – might thing that larger models would
care less about the in context learning because they know more stuff. But getting a generalized
learning abnility – if this continues to improve with larger models it shows an interesting movement.
o Suggestion that in context learning have mad gpt-3 revolutionary. common
GPT-3 Training
 Common Crawl – common collection produced by crawling the internet.
 Known high-quality reference corpora: Combined with these
o WebText
o Books Corpora
o English Wikipedia
Tasks
Look at different tasks and compare with specific task approaches .
 LAMBADA: predict last word of paragraph
 StoryCloze: select correct ending for five-sentence long stories
 HellaSwag: pick best ending to story or set of instructions
Results:
compare with state of the art results
Sometimes gpt is performing better than state of the art, even with 0-shot
approach.
Often nice improvement from
Question Answering Tasks

 We evaluate in the “closed-book” setting (meaning no conditioning information/articles)
 TriviaQA
 Natural Questions (NQs): tests fine-grained Wikipedia knowledge
 ARC: a common sense reasoning dataset of multiple-choice questions collected from 3rd to 9th grade science
exams.
Results: Open-Domain QA
Results: QA and RC Tasks

BERT – BIDIRECTIONAL TRANSFORMER ENCODERS
BERT - Predecessor to GPT
Bidirectional Transformer Encoders

First big transformer model from 2018
 Computes contextualized representations of tokens in an input sequence
o Issue with BERT. For every token
 Useful in many downstream applications
 Uses self-attention to map sequences of input embeddings
 (x1, . . . , xn) to sequences of output embeddings (y1, . . . , yn)
 Output vectors are contextualized – using information from the entire input sequence
BERT:
 Subword vocabulary consisting of 30,000 tokens generated using the Word-Piece algorithm
o Don’t deal with whole words all the time. Break up words in useful ways, common simple words are
left whole (like and, or etc), while other words have meaningful sub-parts (un-happy) – important part
of these models.
 Hidden layers of size of 768
 model with over 100M parameters
 Fixed input size of 512 subword tokens
TRAINING BIDIRECTIONAL ENCODERS
CLOZE TASK
All models trained on a part of this word predicion
 Predict missing words in input
o Please turn _____ homework in.
MASKED LANGUAGE MODELING (MLM)
NEXT SENTENCE PREDICTION
BERT - NEXT SENTENCE PREDICTION

 Two new tokens to the input representation
o CLS prepended to the input sentence pair
o SEP placed between the sentences and after the final token of the second sentence
 NSP relevant for tasks relating pairs of sentences
o paraphrase detection
o inference
o discourse coherence
 Model is presented with pairs of sentences

 Task is to predict whether each pair consists of an actual pair of adjacent sentences from
training corpus or a pair of unrelated sentences
 50% of training pairs are positive pairs, and 50% are where second sentence was randomly
selected from elsewhere in the corpus
CONTEXTUAL EMBEDDINGS
 Output of the model is contextual embeddings for each token in the input
 Can be used as a contextual representation of the meaning of the input token for any task
requiring the meaning of word.
o Big improvement
o Words change meaning in context.
 Static embeddings represent the meaning of word types: vocabulary entries

 Contextual embeddings represent the meaning of word tokens: instances of a particular
word type in a particular context.
 Contextual embeddings used for tasks like measuring semantic similarity of two words in
context
 Useful in linguistic tasks that require models of word meaning
 Most common use: as embeddings of word or even entire sentences that are the inputs to
classifiers in the fine-tuning process for downstream applications.
TRANSFORMERS
Transformer: basic model for all large language models
Key characteristics
Words Can Attend to all Other Words in Input

How is it doing that?
Transformer models: create tension – dependencies or

linkages between any word to other word. When it is
deciding whether – compare a word to any other word at
any point, deviced about the word based on any other
word. Can decide to ignore certain words and find other
interesting.
First neural network;

Encder-devoder model
First: encoding in embeddings; looking individually then

comparing
Then decoding, producing output. That becomes input for

the next.
Autoregressive.
ChatBOT trying to answer the user.
Convert each word into embeddings.
Combined with positional encodings; code where the

word is in the input.
Crucial; as humans, we just take a word one at a time, not
thinking about what the – processing sequentially.
Transformers does not do this; all words at same time,
processes each word in parallel.
Knows where the words are in relation to eachother
because of this positional encoding
Each word has linkages to each other

For any given word embedding, how relevant it is to any
other words.
Relevant to themselves; then a matrix showing how big

relevance words have tot each other.
System learning to produce
Learn linkages in self-attention.
Not well understood but the key why language models

work well
GPT-3 AND THE OPENAI API
 Introduction to API
o https://platform.openai.com/docs/ api-reference/introduction?lang=python
o
 Examples:
o Q&A
 https: //platform.openai.com/examples/default-qa
 suggestion: few shot learning. Give it a prompt. General description.
o Classification:
 https://platform.openai.com/examples/ default-classification
 no training needed
potential for building useful apps with the API. Need to pay, but inexpensive.
initially free but you need to pay?
QA
Do we have to talk about our projects to the exam?
Show that we master approaches to machine learning.

Talk about what we’ve one in our project
Consider alternative approaches – models, metrics etc.
They can ask about specific topics
He will do a review of each session about key topics.

10. CONCLUSIONS
SESSION 2: MLPS AND UNCERTAINTY ESTIMATES
 Perceptron is a linear model

o Linking together
 MLP generalizes this by linking perceptrons with nonlinear activation functions (Neural networks)
o That’s what makes them different from linear models
 Deep neural network has multiple layers
o Interesting learning devices
 Many parameters to tune
 Uncertainty: models can output one or both of decision_function and predict_proba
o Interesting to look at the certainty of a model. Models can give how strong they think they are
 Relates to classification predictions by reference to threshold
o Threshold. Interesting for practical implication of a model. In practice we measure the model in
accuracy, but often there are big differences between errors. Put in thresholds related to the certainty
or uncertainty to the models – relates to business relevance of a model
 We can modify threshold to alter classifications in interesting ways
SESSION 3: UNSUPERVISED MACHINE LEARNING/ REPRESENTING DATA/ MODEL

IMPROVEMENT
 Unsupervised ML often preparation for Supervised ML

o Scaling
 Clustering
 In practice – although they are two different categories; uncertainty uses as preprocsing for supervised machine
learning
SESSION 4: MODEL EVALUATION AND METRICS
 There are lots of ways to evaluate models

o Scikit gives many build in metrics
o Try the best
 We should select a metric that corresponds to the goals and business impact of building the model
 Can explore thresholds for classifier models
 We select models by tuning hyperparameters – can do this wrt. the metric best suited to your case
o What metric are we trying to optimize; not use default – can pick a different or create your own wrt
what is important
SESSION 5: NATURAL LANGUAGE PROCESSING

Important area in machine learning
 ChatGPT reflects a revolution in NLP and AI
 Language is Hard: infinite and ambiguous
o Give appreciation of what nlp is all about, challenge of NLP. A profiund challenge it is to process
language. No way to hardcode a system that can process language perfectly
o Language programs are also Infinite sets of expressions
 Basic NLP Techniques:
o Bag of Words model, looking at nGrams – take language which seems unstructured and turn it into
some kind of structured representation that we can address with standard ML.
 Limited in many ways but can be surprisingly effective
 Many applications apply – sentiment analysis, QA
 Naive Bayes and Logistic Regression: examples of generative and discriminative classifier models
SESSION 6: PRACTICAL ML
Guest lecture: Nicolai Blegvad Thomsen - NTT DATA

 General: MLAI is a tool, not a solution
 Market adoption
o MLAI maturity is multivariate and varies significantly.
o The major barriers to success with MLAI are often not technical know-how.
 Data science profiles
o Business knowledge is a sought-after skill in data science
o MLAI is not just technical know-how.
 Data science projects
o MLAI projects are inherently experimental
o Uncertainty of signals in data
o Careful scoping is critical
o Ask questions, quantify, approximate and discuss
SESSION 7: NATURAL LANGUAGE PROCESSING: BERT AND LARGE LANGUAGE

MODELS
 More about NLP

 Review of BoW Model
o Interesting to keep in mind when having a BoW model; interesting variations of it; normaly count
vectorizer – features.
o Explore differences like looking at individual words (feature) - default, or ngrams – can make a big
difference
o Fault – don’t take context into account. Linguistics are not just bow, small changes in the order can
have a big meaning, but BoW doesn’t take this into account.
 E.g. “sam killed joe” and ”joe killed sam” would be seen as the same in BoW
o Can use characters instead of words as basic feature
 Ngrams between 1-3 is reasonable

 More on Language: SpaCy, linguistic annotation
o Look at linguistic features – tools like SPaCy: gives functions to do different linguistic annotation;
each word label with “adjetives, verbs” etc, produce dependency graph for you. Fundamental to an
intelligent processing of language.
o Interesting to do – do linguistic annotation; take this as additional features.
o
 Language Tasks: sentiment analysis, inference, QA
o Sentiment is just classification – limited aspect of language. Interesting because it does get aspects of
understanding. But is quite limited; there’s much more to text. Difficult to
o Inference: figure out whether one sentence logically entails another sentence.
 Interesting: convert this to a classification as well
 Entailment: “She saw a dog” and “she saw an animal” would be entailment
 Contradiction
 Independent
 Seems to get fundamental aspects of language understanding
 BoW would not get us far with that
o QA:
 All tasks can be X by QA.
 Related to the idea of Turing Test. Now blown past the Turing test now.
 BERT model
o 2018. Huge advance of other approaches; can take context into account. Built to learn arbitrary
connection between different words in the input.
o BoW: each word is independent feature.
o Bert: looks at every word; how does it relate to any other word. Transformer.
o In principle; we have a chance to address the fundamental problem of ambiguity in ML.
SESSION 8: GUEST LECTURE – “THE HITCHHIKER’S GUIDE TO MACHINE
LEARNING PROJECTS”
Prayson Wilfred Daniel: https://github.com/Proteusiq
SESSION 9: RECENT DEVELOPMENTS: GPT-3 AND OTHER MODELS
 The Revolution in AI/NLP: Large Language Models

o GPT-3
o Already on GPT-4 which is already an amazing improvement from gpt3
 BERT – Bidirectional Transformer Encoders
o Training Bidirectional Encoders
o Next Sentence Prediction
 Transformers
 GPT-3 and the OpenAI API - a great thing to get familiar with and experiment with
 Programming is disappearing
o Model is built by a machine.
o We can train and tune a model; interact with natural language
Concern:
 Loose control of AI?
 Alignment problem of AI
o As ai begins to pursue goals, might not do yet, but you can give it a goal and it can try to pursue it. It
can interact with users;
o Make sure it pursues goals that align with human
Able to talk intelligently about to the exam:
 Project
 Cover other topics besides specific problem which might be narrow
11. FINAL PROJECT
Topic Area
We are focusing on two main topic areas in this class
Text Analysis/Natural Language Processing

Image Classification
We expect most of you to do a project within one of these topic areas -- all of which involve advanced machine learning
topics. If you wish to choose a different topic area, you should check with us, to make sure it also involves advanced
machine learning topics.
Relevant Literature
Find at least one recent research paper that deals with issues similar to your project. Summarize the main points of the
paper and compare it with your project.
Data Description
How many instances? What are the features? Give a few lines of the data and/or mention the range of values for
important featues. This could be a dataset you have found, or that you have constructed by combining different datasets.
ML task
What is the target value? Is this classification, regression or some other type of problem? It is important that you use
techniques and concepts covered in this course, such as Pipelines, Grid Search, and different Metrics and Thresholds.
Relevance
Why is it interesting?
The oral exam will include a discussion of the project, and can also cover other important topics from the course
readings and lectures.
Python resources
 GitHub Page Links to an external site.with notebooks and code for the book "Introduction to Machine
Learning with Python".
 Pandas TutorialsLinks to an external site.
 Tips and Tricks for Jupyter NotebookLinks to an external site.
 Learn PythonLinks to an external site.
Install Anaconda
If you don't have a Python installation, you should install Anaconda Links to an external site..
(You can use mamba, poetry, pyenv, virtualenv and a lot more. If in doubt, please use Anaconda)
Make sure to select Python 3.6 or above
When the download is complete, find the .exe or .dmg file, and install
Find Jupyter Notebook. In Windows look for Recently added software; on Mac look in the Applications folder
Datasets
 Top 23 Best Public Datasets for Practicing Machine LearningLinks to an external site.
 GoEmotionsLinks to an external site.
 Reddit News
 Here is a general introduction to Pushshift Links to an external site., which you can use to get Reddit data.
 Ekstra Bladet
 Danish Parliament (Folketinget)
 AirBnB dataLinks to an external site.
 COVID-19 data from OurWorldInData
EXAM
FINAL PROJECT
 Use techniques and concepts covered in this course, such as Pipelines, Grid Search, and different Metrics and
Thresholds.
o Improving and experimenting the setup of your model
o Ideally done as it reflects the perception of the relevance of our model; what would be most relevant
 Pretrained models
o Include pretrained models like BERT
 Relevant research paper(s)
o Use as comparison or inspiration
 Show how your dataset is interesting!
ORAL EXAM
 Group exam: optional brief presentation

o Coordinate in the group
o 5 minutes approx.. nice to keep it short
o Think about it as a sales pitch – why is it nice and interesting what we have done?
o Doesn’t find interesting: what could we have done differently or what have we done wrong?
 He doesn’t care about this
 Discuss Final Project – have your paper available during exam!
 Questions about other topics from course
o Refer to the paper,
 Still graded individually
o Common that the group gets the same grade.
 Make sure that all can talk about everything – if you have done something other persons have
 Don’t have to provide your code?
o Probably in the appendix
o Not obligated to read the appendix
o Or put it on github – data and project
o They will NOT ask directly into the code
o Reproducible and validate results: Describe what you did in the paper to an extent so someone can
reproduce it
 Describe and focus on what/how and why we did what we have done
o Concise representation of the results
 Make a table with results
 Questions about the topics on the couse
o Sometimes there wont be any
o If project is very specific and narrow to some certain areas in the course; probably ask to areas
o Broad projects; probably not ask into other topics
o Might ask basic things; to make sure you know basic things
o Often the project is the main focus of the exam
 If we have made a project on text, would they expect that we talk/know about image processing?
o He is a “text” person – doesn’t know much about image processing
o They will probably focus on what the project is on
o But of course its good if we know about image processing as well
o Only to the extent to what we have been talking about in class
The paper: what should we focus on?

Look at leat one research paper
Ideally: publish at a conference. What would people NEED to know. Doesn’t leave out key information
What is interesting about this? (contributions to the field)

Artificial Intelligence and Machine Learning (AI) - Noter

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Artificial Intelligence and Machine Learning (AI) - Noter

Uploaded by

Copyright:

Available Formats

ARTIFICIAL INTELLIGENCE AND

MACHINE LEARNING (AI)

1. INTRODUCTION: LINEAR MODELS AND TREE MODELS

Topic Introduction: linear models and tree models

TAKING THIS CLASS

GOING FURTHER WITH MACHINE LEARNING

More Powerful Models

 Random forest and other ensemble models

Going in more depth

ML Models and Business Value

Pre-trained models: a new opportunity

Foundation to master thesis

 The most important thing is not

 Combining different datasets, to pose questions like:

AI AND MACHINE LEARNING

AI is the power behind the most important companies in the world

Example of chatGPT answering an exam Q from

WHY AI/MACHINE LEARNING AT CBS?

Tech companies are running the world (with AI) for

Everyone: understand how AI and ML will affect

This class builds on Big Data Management

Supervised Machine Learning: Supervised ML means there is a label or target value

REVIEW: BASIC CONCEPTS

Generalization, Overfitting and Underfitting

Linear Model: a weighted sum of input features. Basic model

 A line (two dimensions)

 Linear Equation (p + 2 dimensions)

An example for 2 dimensions.

 Regularization: make model simpler to reduce overfitting, by pushing coefficients closer to 0

 Widely used for classification and regression

Decision Tree – Binary and Continuous Tests

Two Moons Dataset

Decision Tree – Controlling Complexity

Decision Tree – Feature Importance

Example: Which features are most important

 Build many decision trees

Data for Random Forest

Random Feature Subsets for Random Forest

Parameters for Random Forest

A perceptron is a weighted sum of inputs – a linear

Multilayer Perceptron/Neural Network

Connect multiple linear models; still have one big

Neural Network – Equation

Tuning Neural Networks

THE END OF PROGRAMMING

 Traditional Approach: Rule based /programming approach

 Few-shot Learning: ML changing – not learning from scratch

 Building on basics of ML from Big Data Management

Random state - for reproducable results when repeating

3 way split: Train-val-test

When the best parameters have been found:

Topic Neural networks; uncertainty estimates

NEURAL NETWORKS: MULTILAYER PERCEPTRONS

How a perceptron looks like

1. All inputs (similar to x[0]…x[3]

Weights, links with the weights

LINEAR MODEL VS. DEEP LEARNING

 ^y is a weighted sum of input features x[0] to x[p]

The Perceptron is a Linear Model

 Perceptron computes weighted sums

Multiple Layers of Perceptrons

Connect multiple linear models; still have one big

Not linear functions

Each neuron figures out its weights differently. Its activation