You are on page 1of 87

ARTIFICIAL INTELLIGENCE AND

MACHINE LEARNING (AI)


NOTES

1. INTRODUCTION: LINEAR MODELS AND TREE MODELS

Topic Introduction: linear models and tree models


Readings MLP 2.3.3 (46-70), 2.3.5 (72-85), 2.3.6 (85-94)
Activities to be Lab1, readings for next lecture
done before next
class

TAKING THIS CLASS

GOING FURTHER WITH MACHINE LEARNING

More Powerful Models


More complex model, more diversed group of models

 Random forest and other ensemble models


 Multilayer Perceptron (MLP)/Neural Networks
 Pretrained Models (BERT, GPT, . . . )

Going in more depth


 More detail about models
 Tuning: search for the best hyperparameter settings for a model
 Systematic ways to understanding the importance of model features
o When building a model for a certain domain. Most interesting not the solution represent, but the view
that the model has taken. This can be done by looking at feature importance. Tendency; model as a
solution to a problem, use it as a black box. But better to open op the black box and learn from it, e.g.
feature importance.
 Customer churn: Attributes on customers. More valuable if the model can show the most
important features that can show whether a customer will churn  Look at the coefficient of
each feature.

 Evaluating: exploring model metrics systematically.
o Classification: Accuracy most used. Precision, recall.
o The choice is related to the domain and problem
 Medical: FP and FN
o Many metrics to evaluate model performance in sklearn

ML Models and Business Value


 Expected Value Framework: what is the right metric for a given business problem?
 Classifier Thresholds: how “careful” should a classifier be, based on the business context?
o Specify for a given context, difficult with a certain classification compared to another by modifying
thresholds.

Pre-trained models: a new opportunity


 Pre-trained Models in Language and Vision create a new opportunity to build very powerful models with a
small amount of training data
 Language:
o BERT (Google)
o GPT-3 (OpenAI)
 Image Classification:
o VGG-16
o ResNet

CLASS PROJECT

 Select/construct dataset
 Pose interesting questions, with societal/business relevance
o Pick a dataset, question etc. that we believe in. driven by own curiosity
 Produce a research-style paper.
o Detailed comparisons to relevant recent work
o Propose ways to build on / improve recent work
o Present new results and discuss in light of previous work
o Find relevant recent work, paper should do comparisons or build or improve on it, discuss our project
in the light of this work. Not expected to produce new research! But a step in that direction. Having a
complete picture of what is happening within the domain, and situating our work in this area – how
does it add on? – Business view.

Final Project
 Interesting Datasets
 Interesting Questions
 Use ML techniques studied in class
 Best practices for training, tuning and evaluating
 Models
 Take advantage of large pre-trained models
 Systematically explore different model evaluation metrics to connect to business value
 Write research-style paper, where you connect with current research

Foundation to master thesis


Business idea

COLLECTING DATA

 The most important thing is not


o the model you use
o or how much training data there is or how you tune it
o ...
 It’s how interesting the dataset is!

Interesting Datasets
 Reddit
 Twitter
 Our World in Data
o Covid-19.

 Combining different datasets, to pose questions like:


o Can Twitter data predict Covid-19 case levels?
 ML could predict Covid better than the human experts.
o How are different travel destinations described on Reddit?
o Does weather affect the level of crime in different areas?
o Does AirBnb data in different areas tell you about social conditions in different cities
 Combine with crime? Increase rental prices?
 Hidden insights in their collected data
o ...

THE CLASS

Weekly sessions
 Readings
 Lecture
 Activities
o Programming/building models
o Finding and constructing interesting datasets
o Developing project ideas
o Submit results regularly
 Feedback
o Written feedback on submitted results
o Feedback on project ideas and workplan
 Syllabus

Main Reading
 Introduction to Machine Learning with Python
AI, MACHINE LEARNING AND BUSINESS

AI AND MACHINE LEARNING


Machine learning is the power behind AI

AI is the power behind the most important companies in the world


 Big Tech: AI and ML:
o Google
o FaceBook
o Amazon
 Intelligent Search
o Google
 Machine Translation
 Question Answering
 Image Recognition
 ChatGPT

Example of chatGPT answering an exam Q from


AI and ML

WHY AI/MACHINE LEARNING AT CBS?


Different ways of thinking of ML and AI

 Research
 Startup

Tech companies are running the world (with AI) for


good and bad.

Everyone: understand how AI and ML will affect


business, the economy, society and our common future.

This class builds on Big Data Management


SUPERVISED MACHINE LEARNING

Supervised Machine Learning: Supervised ML means there is a label or target value


 Regression: target is numerical
 Classification: target is selected among a finite set of choices

REVIEW: BASIC CONCEPTS

Generalization, Overfitting and Underfitting

 Generalization: how well model performs on data other than the training data
o The goal. Training data set – summarizing the training. We’ll go beyond these specific observaitons.
 Overfitting: Model is much better on training data than on test data
o Model is too complex, and too closely tied to specific details of training data
 Underfitting: Model doesn’t perform well on training data or test data
o Model is too simple

LINEAR MODELS

LINEAR REGRESSION

Linear Model: a weighted sum of input features. Basic model


Prediction is a line for a single feature, a plane for two features, or a hyperplane for more features

 A line (two dimensions)

 Linear Equation (p + 2 dimensions)

An example for 2 dimensions.

LINEAR CLASSIFIER

Sum of weighted features. Same computation, but the result is a binary question. The models search for the best
combination of weights – what features contribute to the best xx.
LINEAR MODELS AND REGULARIZATION

 Regularization: make model simpler to reduce overfitting, by pushing coefficients closer to 0


 Support Vector Classifier
 Logistic Regression
 C parameter
o C = 1: default
o C = 100: more complex model – possibility of overfitting
o C = .01: simpler model – possibility of underfitting

DECISION TREE

 Widely used for classification and regression


 Builds a hierarchy of if/else questions, leading to a decision
 Intuitive

Similar to linear model, but takes each feature and make a simple
assessment. How does it contribute to the target variable?

Decision Tree – Binary and Continuous Tests


Binary Test: is feature i True or False? Continuous Test: is feature i larger than value a?

Linear models limited in the way they can – trees more powerful than linear models.

Two Moons Dataset

Decision Tree – Controlling Complexity


 Build until leaves are pure
 If all leaves are pure, tree will be 100% accurate on training data
 To prevent overfitting:
o Limit the complexity
o limit depth of tree
o limit maximum number of leaves - splitting
o require minimum number of points in a node to keep splitting

Decision Tree – Feature Importance


Coeffecients for each feature. Positive or negative showing their importance. Examine the importance of each feature.
The tree building algorithm. Looks at each feature; which has the biggest power In terms of increasing the purity. Finds
split that increases the purity the most. Once the tree is built; look at each feature and see how much information it gives
about the target, can provide information about the domain.
 Feature Importance: between 0 and 1 for each feature 0 means not used at all
 1 means perfectly predict target
 A weighted measure of impurity reduction

Example: Which features are most important

Feature importance

RANDOM FOREST

 Build many decision trees


o Address problem of overfitting of decision trees.
 Each tree differs in random ways.
o RF extremely powerful and improvement of decision trees. Randomly erases different data in the
different trees.
o Select different data points used to build tree
o Select different features in split tests

Data for Random Forest


 For each tree, create bootstrap sample
 Randomly select n items from orginal dataset, allowing repetitions
 Each tree will have same size dataset, but randomly different, because of repetitions

Random Feature Subsets for Random Forest


 Parameter max_features
 Select random subset of features of size max_features
 If max_features is high, more chance of overfitting

Parameters for Random Forest


 Number of trees –
o The more trees, the more powerful
 max_features

NEURAL NETWORKS

The Perceptron
Linear models are a weighted sum of inputs.

A perceptron is a weighted sum of inputs – a linear


model

Multilayer Perceptron/Neural Network


When putting perceptrons.together; we get a neural
network

Connect multiple linear models; still have one big


linear model. In principle more powerful than one.

Activation functions
Not linear functions
Chopping off parts of the input

Neural Network – Equation

Tuning Neural Networks


 Many parameters to adjust Number of hidden layers
 Number of units in each hidden layer Regularization
 Scaling of inputs is important
o Not true for all models (Decision trees etc.)
o Also important for KNN etc.
 Work with multi-layer perceptrons
o Most straight forward way of neural network

THE END OF PROGRAMMING


ML as a New Way to Program
Machine learning vs. rule based

 Traditional Approach: Rule based /programming approach


o Define Problem – ie input-output function
o Inspect Data, reason about problem – come up with rules based on your inspection. Use intelligence
as a human.
o Construct program that solves problem
o Pros:
 Easier to see what you are doing, interact with data
 Better understanding when you write the program
 Ethics: easier to prevent bias.
 Human also brings knowledge to the process, where the machine learn from scratch from the
dataset.
o Cons:
 Human can create biased models.

 ML Approach:
o Define Problem – ie input-output function
o Collect Data
o Automatically build model based on data, that solves problem
o Skip the programming part. Programming is done by the machine learning algorithm. The solution are
done by the algorithms, not produced by a human programmer.
o Cons:
 Can be a black box
 Bias. Reproduces bias.
 Starts from zero with a data. Changing; large language models.

 Few-shot Learning: ML changing – not learning from scratch


o Define Problem – ie input-output function
o Pre-trained model brings understanding
o Collect small amount of data
o Quickly tune model to solve problem
o  Programming is changing

TAKEAWAYS

 Building on basics of ML from Big Data Management


 More powerful models
 Explore different ways to evaluate models – Expected Value, thresholds
 Large pre-trained models for language and vision
 Final project – research style paper
LAB1

Random state - for reproducable results when repeating


requirement for the distribution will be the same when splitting data set. Same distribution of target values within both
training and test data.

Q1
Dummy classifier
All 10 are equally frequent

3 way split: Train-val-test


Tuning: using the train-val set, that is split into train- and val.
Fit on train-set
Score on val-set

When the best parameters have been found:


Train the model on the entire training set, trainval consisting of both the train and val set.
Then score on the test-set.


2. NEURAL NETWORKS (MLPS) AND UNCERTAINTY ESTIMATES

Topic Neural networks; uncertainty estimates


Readings MLP 2.3.8 (106-120), 2.4 (121-129)
Activities to be Lab2, readings for next lecture
done before next
class

1 Lab1
2 Neural Networks: Multilayer Perceptrons History
Linear Model vs. Deep Learning Examples
3 Uncertainty Estimates

NEURAL NETWORKS: MULTILAYER PERCEPTRONS

HISTORY

How a perceptron looks like

The perceptron – “the first machine which is capable of having an original idea”, according to Frank Rosenblatt

1. All inputs (similar to x[0]…x[3]

Weights, links with the weights

Feedback

LINEAR MODEL VS. DEEP LEARNING

LINEAR MODEL
A weighted sum of input features – learn the coefficients of the links

 ^y is a weighted sum of input features x[0] to x[p]


 Coefficients w[0] to w[p] are learned in training

The Perceptron is a Linear Model


NEURAL NETWORK MODEL

 Perceptron computes weighted sums


 In MLP, process of Perceptron is repeated multiple times
o Repeating the process multiple times
 Hidden units are an intermediate processing step
o Hidden because you can’t observe them, like the input or output can
 These are in turn combined using weighted sums to produce output

Multiple Layers of Perceptrons


When putting perceptrons.together; we get a neural
network

Connect multiple linear models; still have one big


linear model. In principle more powerful than one.

Still linear
When input goes through a non-linear model, becomes
more power

Activation Functions
Non-linear in some way. When linear input comes through this function it becomes more powerful.
Input is passed through an activation function.
A way of ignoring stuff

Not linear functions


Chopping off parts of the input

Each neuron figures out its weights differently. Its activation


function tells which parts of the possible output values it should
ignore.

Image recognition for each neurons: by using ‘chopping off’


functions, some neurons starts to specialize in finding special
patterns. Allow each neuron to develop an expertise. Gradually
graduate a neural network where each neuron is specialized.

Activation Functions: relu and tanh


 relu: cuts off values below zero
 tanh: goes to -1 for low values, +1 for high values

Neural Network – Equation


Where do act funcitons apply
Y hat is the outputs, the hidden neurons

MLP with Two Hidden Layers


EXAMPLES

Two Moons Dataset


 Two classes
 Two features

MLP Decision Boundary

Decision boundary for neural network with two features

10 hidden units
Much sharper boundary, not as many distinctions

2 hidden layers, and 10 hidden units, with relu activation function:


2 hidden layers, and 10 hidden units, with tanh activation function

MLP Different Settings


Controlling the sizde / complexity

Different Initializations
For smaller networks, can make a difference with random

Cancer Data: Scaling


Not scaled data

Scaling in general – neural networks care a lot about scaling!

The test score is markedly higher when data is scaled

If we scale the data : Mean value for every dater, comp sta devi for each, center
all values around the mean

UNCERTAINTY ESTIMATES

Knowing what you know

Knowing what we know (or don’t know)


UNCERTAINTY IN CLASSIFICATION

How certain is classifier for a given prediction?

How close is output to decision boundary?


For human, decisions requires reflection

For machine learning they generate a score, make a decision based on a given
output

For any kind of model,

Uncertainty Estimates for Models


Always calculated for models.:
 Two methods: decision_function and predict_proba
 Most models have one or both of these

Example: Decision_Function
 Decision_Function returns floating point number for each sample
 Value encode how strongly model believes a data point belongs a class
o Strength of the classification, how strong the model believes in it

Examples of values we might get

Floating point numbers, unbounded in their values

Decision function values into classification

Example: Predict_proba
 predict_proba outputs a probability for each class
O
 For binary classification, shape is (n_samples, 2)

As they are proba – values sum up to 1

How do we recover the classification from this?

 The class with probability above .50 is the one predicted – if binary
o Multiple:
 A calibrated model is one where probabilities align with accuracy – predictions with probability .70 are correct
70% of the time.
MULTICLASS AND UNCERTAINTY
 decision_function and predict_proba have shape (n_samples, n_classes)
 High score means class is more likely and low score means class is less likely

Multiclass: Iris Example


Gradient boost

Apply decision function on test data

First entries

Largest values

Recover Predictions from decision_function scores


Argmax function > find the highest

Unvertainty estimtes: looking at the previous steps for the predict


functions

Get the same values if we apply the predict values

Gradient boost has both

Why predict rather than decision function? Most models have only
one of them. Big problem with ml > no matter what it will predict
something. Using predict problem > tool to decide if its under a
certain threshold we might not want to predict anything at all

PREDICTIONS WITH THRESHOLDS

Xx effect:
 Can recover predictions from uncertainty estimates using different thresholds
 This can reflect the different costs and benefits of different errors
o In some domains some errors are lot worse than others. Churn prediction:
 Lab2: cancer classification of benign vs. malignant
o Maximize recall or precision:
 Recall: Minimize FN
 TP / TP + FN
 Precision: Minimize FP
 TP / TP + FP
 By modifying threshold, can selectively improve precision/recall of one or the other
 Relevant for social applications

TAKEAWAYS

 Perceptron is a linear model


 MLP generalizes this by linking perceptrons with nonlinear activation funcdtions
 Deep neural network has multiple layers
 Many parameters to tune
 Uncertainty: models can output one or both of decision_function and predict_proba
 Relates to classification predictions by reference to threshold
 We can modify threshold to alter classifications in interesting ways

LAB2

Wisconsin Breast Cancer dataset – supervised ML

 What’s the point


o Inductive statistical modeling via ML
o Modeling in a highly-sensitive domain

 AI in mammography – it’s already applied in industry


3. UNSUPERVISED MACHINE LEARNING

Topic Unsupervised machine learning – representing data and feature engineering


Readings MLP 3.1-3.4, Ch. 4 (133-169)
Activities to be Lab3, readings for next lecture
done before next
class
Agenda 1. Unsupervised Machine Learning
o Preprocessing and Scaling
o Dimensionality Reduction
o Clustering
2. Representing Data and Engineering Features
o Categorical Variables and Dummy Values
o Automatic Feature Selection
3. Model Improvement
o Cross Validation
o Grid Search

UNSUPERVISED MACHINE LEARNING

PREPROCESSING AND SCALING

Supervised vs. Unsupervised Machine Learning


Supervised ML: there is a target value (also called a “label”)
Unsupervised ML: there is no target value

SCALING

Can be a problem if different features have very different ranges


For example, house price ranges from 50,000 to 5,000,000, while number of bathrooms ranges from 1 to 4.
Important for SVMs and Neural Networks
Can be important

Features in supervised setting – can be a problem if trying to calculate relative importance


Predict something about houses. Price – up to 5 million. Another feature can be types of homes; number of bathrooms
only ranges from 1-4. Trying dif ways to calc output. House prices night take over because of – nu of bathroms might
look indifferent. Make a fair starting point. All features are equally important by putting them on the same scale.

vector macheines, neural networks


Not tree models

Can be done manually

 Several different scalers in scikit-learn


 MinMaxScaler ensures that all features are between 0 and 1
 Scaling usually applied before doing supervised ML
 Scale training and test data the same way

Scale Test data same as training data

Scalers; same syntax as models. Instantiate it, fit the scaler to datasaet, training data. Once a scaler is fit to training data,
examined all data, foind min and max value, then transforming the data. The scaler knows all relative evalues, then
transform it.

Scaling in Python
Scaled between 0 and 1

Relative differences are the same after scaling,


which is the basic idea

We have the scaler that is fit to training data. Not


fit it ot test data. We transform train data that we –
for test data, not scaled between 0 and 1.

Not same distribution in test data as training data –


we don’t know what

Retain same scale that we get from training data


wen apply to test data

e.g. looking at houses. Max no bathrooms is 4 –


becomes a 1 value after scaling. Test data might be
max of 5 barthroom. We want to treat it as bigger
than 1 – better than the best house in trianin gdata –
we want to retain that.

Different ypes of models, that does the same thingt

DIMENSIONALITY REDUCTION

PCA
 Principle Component Analysis (PCA) – a popular form of dimensionlaity reduction. In genereal the idea of
reduction. Unsupervised process of trying to find the most interesting ways that data varies. Each feature
o Some features might correlate with each other.
 Rotates dataset, create features that are uncorrelated
 First component contains the most information, ie accounts for most of the variance
 Can select a subset of the most informative principle components
o New verisons of features ordered after how much information they provide.
o Converts data, features so first geatue is most informative, ordered by informativeness
 Visualizing high-dimensional datasets
o Useful for vizualising – pick a couple of the most informative features

PCA- and Synthetic Data


Start with original dataset

We can then transform the data and onky look at the first or second
component

Cancer Histogram
 Overlay two histograms for benign and malignant
 Texture error looks quite uninformative
 Mean Concave points looks much more informative
o How they vary in respect to the target
How PCA is applied to the cancer dataset

Take each feature

Features where they are completely overlapping; not help predict


target
Concave; separated. Better a predicting.

Give an idea about how different features relate to target value

Doesn’t know what the target value is. Doesn’t care, but can see how
much the features overlap.

Used in same way as we use models;

Here we select 2 components.


Fit to scaled data; then the data is transformed. X_pca.

The shape is now 2 featues.

Convenient to visualize as we now only have 2 featues. Gives a


better sense of the dataset.

Good at finding 2 important features

Might do better with a classifier when reducing featues.


Avoid overfitting. A subset of original components; generalizing
data instead of specifying on specific things in the data. Get better
generalization. A good idea to try.
PCA AND IMAGES
Used for face recogniztion

Baseline – Without PCA


 Labeled Faces in the Wild
 Faces of celebrities from early 2000’s 3,023 images
 87x65 pixels
 62 different people

PCA a natural thing to do with this kind of task


Low level input data; abstract features of the images.

Baseline – take the data as it is. Applying a KNN.

Look at every single pixel.

 Classification Results
o Accuracy of only 0.23
 Serves as a “random guessing” baseline.
 For that many classes, random guesses – relatively good.
o But, it is a 62-way classification problem
 It is learning something
 Gives us a starting point.

Classify using PCA


 Use PCA to construct first 100 Principal Components
 Use these features in KNN classification
When applying same knn model after using PCA;
improvement from 23% to 31% accuracy.

The features are saying more stuff about the faces although
we only use 100.

First Component: contrast between face and background


Second Component: differences in lighting between left and
right side
...

Taking individual components and treating them as X


dimensional arrays and visualizing them. Each component
captures different aspects of the picture
Some contrasting left and right side,

 Classification Results
o By using 100 principal components instead of pixel features:
o Accuracy improves from 0.23 to 0.31

NMF
Other ways of reducting dimensionlity
 Non-negative Matrix Factorization
 Like PCA, can extract useful features
 Can be used for dimensionality reduction
 Each feature must be non-negative

NMF and faces


Quality of back-transformed images similar to PCA but not as good
NMF can find interesting patterns

 Component 3 shows face rotated to the right


 Component 7 shows face rotated to the left

Out ability as humans to recognize faces,

Faces with highest vals for Component 3


All from a similar angle – illustrate that these components focus on
specific some orientation.

Faces with highest vals for Component 7


We do get benefits from dimensionality reduction; more
informative features than original pixel values. What we really
want; learning approach – in a targeted way.

Why neural network are good for this:


Gets feedback in the learning process. Modifying, different
versions. Target helps finding the right types of abstractions.

Get really good at image recognizion; extract right kinds of higher level features from the picture.

CLUSTERING

 Partition data into clusters


o Make up classes as you go along – view of similarity. Come up with clusters in which similarity is
maximized within a cluster and minimized across clusters.
 Data items within a cluster should be similar, and items in different clusters should be different
 Clustering algorithm assigns a number to each data item
 Similar to classifier – but there is no ground truth

K-MEANS CLUSTERING
 Finds cluster centers that represent specific regions of the data
o For certain number of clusters find the mean value that max similarity within clusters.
 Alternates between two steps:
o Assign each data point to closest cluster center
o Compute center
TAKEAWAYS: UNSUPERVISED ML

 Unsupervised ML often preparation for Supervised ML


o Scaling
o Dimensionality Reduction
o Prep – prelude for supervised machine learning.
 Clustering
o Truly unsupervised approach
REPRESENTING DATA AND ENGINEERING FEATURES

CATEGORICAL VARIABLES AND DUMMY VARIABLES


 So far: we’ve assumed that our data consists of floating-point numbers – continuous feature
 Also want categorical features
 Similar to distinction between classification and regression
 Continuous feature: size measurement of flower; income of individual
 Categorical feature: color of flower; gender of individual

Income Dataset
 Creating dummy values

Categorical Variables

 Only makes sense if features are numerical’

Workclass Feature
 Workclass is categorical feature
 Has four possible values:
o Government
o Employee
o Private Employee
o Self-Employed Self-Employed Incorporated
 Create four new features

Dummy Variables
 Also called One-hot-encoding, or one-out-of-N encoding
 If a feature F has three values, a,b, and c
o Create three new features, Fa, Fb and Fc
o If Fi has value a, then Fai =1, Fbi =0, Fci =0

Dummy Variables with Pandas


Get dummies method – does it automatically. Ignores
numerical featues, automatically applies dummy method
to categorical

Check Values
Get dummies method – does it
automatically. Ignores numerical featues,
automatically applies dummy method to
categorical

GetDummies method

Might think its problematic; create


dataset with higher dimensionality which
can be challenging for the model

Alternative: Label encoding


Rules out something that the model might learn – e.g. rating movies

AUTOMATIC FEATURE SELECTION

Three Feature Selection Methods


 Univariate: look for statistically significant relationship between each feature and target
o Looking at featues one by one, how they relate to the targte
 Model-Based: uses a supervised model to judge importance of each feature
o Asks the model what it thinkgs are the most important features
 Iterative: build a series of models, and try adding/subtracting features

More effective modeling, more information about the domain

UNIVARIATE FEATURE SELECTION


 Test each feature for how informative it is about the target (can be for classification or regression)
o Statistical correlation between features and target values.
 Threshold: discard features based on p-value (likelihood that feature is correlated with target)
 SelectKBest: selects best k features
 SelectPercentile: selects percentage of best features
Fit selector to training data
For the selector, select 50th percentile. Top 50% of the
features

Getting a statistical selection

Get better result with logr model with reduced set of


features on our test data

MODEL-BASED FEATURE SELECTION

 Uses supervised ML model to judge importance of each feature


 Tree Models compute feature importance
 Linear models have coefficients
 Unlike Univariate, Model-Based Feature Selection can capture interactions between features
o Understand more about importance of features.

Use random forest classifier to select the


featues

Then use logr to actually apply that to


the data – don’t have to do feature
selection with the same model, you can
use a different model

ITERATIVE FEATURE SELECTION METHODS

Two General Methods:


1. Start with no features, add them one by one
2. Start with all features, remove them one by one – Recursive Feature Elimination (RFE)

Bad idea to look at data, certain features and just getting rid of features because it might look unimportant. Up to the
ML model to figure that out – use systematic ML methods. Don’t assume ahead of time!
Select Features
Does iterative selecting – using RFE

Transform and Score


Then fitting the selected features to the
model

Score with Model Inside RFE

TAKEAWAYS: REPRESENTING DATA


 Convert categorial data to dummy (0/1) values
 Several methods for feature selection Data

MODEL IMPROVEMENT

CROSS VALIDATION

Fitting Model to Data


 The point is to find models that generalize to data beyond training data
 That’s why we split data into training and test data Test data score gives a better assessment of the model
 than training data score
 Cross Validation: do multiple splits between training and
 test data to get better assessment of model
 Grid Search: search for best parameter values, to get a better model

Cross Validation
 Instead of one train-test split, multiple splits
 For example with Five-fold CV, pick one fifth of data as test, and the other four fifths as training data
o Each fifth used sequentially as test data
 Gives a better basis for assessing model – with one split, might be “lucky” or “unlucky” with test data

Five-fold Cross Validation


Cross Validation in SciKit-Learn
Gives scores of all the splits

Cross_val_score performs
 split of train and test data
 fits model to train data for each of the splits
 scores model on test data

CV doesn’t improve the model, but a way to assess the model.


The more splits; more accurate VIEW of the models performance

GRID SEARCH

Assessing vs Improving Models


 Cross validation is simply a method to assess a model – does nothing to improve the model
 Grid search is a way to improve your model
 Combines well with cross validation

Example: Tuning an SVM


 C controls regularization – higher values mean less regularization, just as with logistic regression
 Gamma controls complexity of model in a different way
o A low value of gamma means that the decision boundary will vary slowly, which yields a model of
low complexity, while a high value of gamma yields a more complex model. (p 102 in text)
 Want to try these values for both gamma and C: .001, .01, 0.1, 1, 10, 100 (Regularization or complexity)
o Nested for loop to try each of 36 combinations
o

Wrong: Tuning on the test data! Like cheating!

Alternatives:
1. Bayesian Search
2. Random Search

Why We Need a Validation Set


 It’s wrong to tune a model using scores from test set
 This won’t give valid indication of how model generalizes
 Need to define a separate Validation Set which is used to tune model
 Scores on test data are only given once tuning is finished, and best model is selected
 Use to check hyperparameters – applying test-train split twize.
A Three-Way Split of Data

First split:
Train-val contains both training and validation set

Second split:
Splitting the train-val set into train and val.

Note on kwargs
 best_parameters = {’C’: C, ’gamma’ : gamma}
 Define best_parameters as a dict
 svm = SVC(best_parameters)
 Can pass a dict to a function expecting keyword arguments

GRID SEARCH WITH CROSS-VALIDATION


 Model results can be very sensitive to how data is split
 Rather than use grid search with one split, can use it together with cross-validation for a better tuning process

The more folds  Validation will be smaller. The more folds the more valid picture you have. But 3-5 would be
sufficient.

 Use cross_val_score instead of score


 Take mean of scores returned by cross_val_score Can use GridSearchCV class to implement this
Finally trains on both train and validation set so it
can be trained on more data – important to know

Summary - tuning
Tuning on Validation Set

Tuning on Validation Set using Cross-Validation

TAKEAWAYS: MODEL IMPROVEMENT


 Split train and test data to assess a model
 Cross validation is a better way to assess a model
 Grid Search improves model by finding best hyperparameter values
 Need three-way split of data: train, validation and test Can combine Grid Search and Cross Validation
LAB3

Doesn’t care about the data – whether it’s pictures

Important - method
- Tuning using training data, cv on part of data
- After finding our best classifier, we retrain on all the data – both val and train!
- Then create a classification report

GridSearch exhaustive way to run through all combinations


ALternativeS:
randomsearch – shown to perform equally as good,
Bayesian search –
4. MODEL EVALUATION AND IMPROVEMENT

Topic Model Evaluation and Improvement


Readings MLP Ch. 5 (257-310)

Activities to be Lab4, readings for next lecture


done before next
class

METRICS FOR MODEL EVALUATION

BASIC METRICS: BINARY CLASSIFICATION

Classification: Accuracy
 Default metric

Correctly classification samples


Total number of samples
Regression: R2
Coefficient of Determination
ExplainedVariation
TotalVariation

BUSINESS IMPACT

What is the goal? What is the business impact of using the model?

Many different metrics for models

BASIC CLASSIFIER METRICS


Binary Classification

 Start with Binary Classification


 Can call two choices Positive and Negative
 This is an arbitrary choice
 By convention, Positive might be the choice we are most interested in

Positive: Yes – has disease


Negative: No – doesn’t have disease

Confusion Matrix

Accuracy
Accuracy

Precision
Being precise about the POSITIVE value

Divided by all the times the model says it has the disease
We can cheat, not same way as recall. High precision strategy would be to – saying less is good. Only choose the
positives in cases with highest confidence.
Precision

Recall:
True positives
ALL TIMES we say
Actually does have the diseacse
Easy ways of cheating: everyone have the disease > perfect recall. Precision might be bad.

Recall

F score
Because we can “cheat” with both recall and precision we need the F score to balance thing out
Get a value between 0 and 1. Only evaluating a model with either recall or precision is not good, as it might give a
skewed picture.
F

Digit Classification
Convert digits into unbalanced binary classification task

Target value y is true if digits.target is 9, false otherwise


Unbalanced: 10% of target values are true, 90% false

DummyClassifier
Confusion matrices

Dummy Classifier: Most frequent

Classification Report: Logreg


Logistic Regression Classifier

UNCERTAINTY IN PREDICTIONS

CLASSIFICATION AND THRESHOLDS


Linear Classifier


 Classifiers compute a value that is compared against threshold – for linear classifiers, default is 0

Lower Default Threshold


 Another tool we have to try to push the model in direction that relates to our goal and building the model
instead of picking the standard setup
 Problem with unbalanced data: Fewer examples of one class, model tend not to pick that class

Default Threshold (0)

Lower Threshold (-.8)


PRECISION-RECALL CURVE

 Tradeoff between precision and recall


 Can explore this with different thresholds
 Precision_recall_curve gives precision and recall values for different threshold values

If lowering it enough we’ll get perfect recall?


Plot curve with different threshold - a way of
exploring the tradeoff

ROC CURVE

 Plots True Positive Rate (recall) against False Positive Rate


 Best values are higher (more true positives) and to the left (fewer false positives)

False Positive Rate


True Positive Rate (Recall)


o True positives against all potential positives

ROC Curve
Takes different thresholds and plots the two against
each other.
Indicate where we might get the best results

AUC

 AUC is Area Under (ROC) Curve


 Single number to summarize ROC curve
 Ranges from 0 to 1
 Random guessing always gives 0.5, even with unbalanced datasets
 Can be more revealing with unbalanced data

METRICS FOR MULTICLASS CLASSIFICATION


Binary: in respect to the target class. For multiclass we look with respect to any class.

Confusion matrix

Recall: 37 / 37 = 1.00
- Perfect

Precision: 37 / 37 = 1.00

Precision: 43 / 46 = .93

Looking at predicitons. Not all are correct. The


recall is 93%.

Recall: 43 / 48 = .90

Classification report
For every class we get the metrics.

A way of putting it all together:


 n

USING METRICS IN MODEL SELECTION


Using different metrics

 We select models by tuning hyperparameters


 Use GridSearchCV, cross_val_score
 By default, accuracy is optimized
 Can change this to other metrics, such as average precision

Example – using accuracy which


is the default

We can change the scoring metric


in which the CV is optimizing
with respect to.

All the possible scores we can use


in CV:

Ideally optimize the model for the


value/metric that is most valuable
in our case, not necessarily
accuracy

Challenge: what is the most


valuable and relevant metric in
our case?

Aligning metrics to our case

TAKEAWAYS
 Metrics for Evaluation There are lots of ways to evaluate models
 We should select a metric that corresponds to the goals and business impact of building the model
 Can explore thresholds for classifier models
 We select models by tuning hyperparameters – can do this wrt. the metric best suited to your case

Project
 Readings: find a relevant research paper
 Workplan: submit description in week 10 Use techniques discussed in class:
o Diverse metrics
o Thresholds
o Pipelines
o Grid Search CV/Tuning

LAB4

Like sentiment analysis


– look after positive
words to predict a
positive label or vice
verca for negative

NB: Large dataset; start


with a small sample.
1000 or 10.000 instances

Pandas has a great


sample method; give
randomly selected.

Fb posts labeled to
certain emotions
Value counts on emotion
column
Diversity in how many
instances – emotion
classification, quite
unbalanced

Preprocessing function,
converts text into bag of
features (words) – 1
feature for each word.

Many options for thisr

Train test split


Countervectorizer; turn
text into features
Use bag of words
algorithm to turn words
into features –
preprocessing data.
X_train vec is the result
of the counervectorizer.

Build the model


Different options; length
Creating LR model.
Score of 94% train and
54% test.
5 way classification

What are the right kind


of features of the data

Size of ngrams, different


lengths.
Train and test results are
rather different

Increasing ngrams;
increasing number of
features.  varies with
size of data whether it is
good to have a high
amount of ngrams.

Use dummy classifier;


random guessing.

TFIDF
Use a different scoring;
instead of how many
times a word occurs in
text, use TFIDF
Use tfidf algorithm on
the outcome vectorizer.

Create pipeline
Do gridsearch
Explore arbitrary
axpects,
Make a pipeline;
preprocssing important
for language processing,
bag of words models.
Look at characters
instead of words;
individual letters.
Default is ‘word’.
Use pipeline with
gridsearchCV. Model
choices,
Specify different options
for hyperparameters, do
gridsearch

Done 2 types.

One get better scores


using characters.

Classification report.
5. ALGORITHM CHAINS AND PIPELINES

Topic Algorithm Chains and Pipelines


Readings MLP Ch. 6 (311-328)
Activities to be Lab5, readings for next lecture
done before next
class

 Lab4
 2 NLP: Some Background
o The Revolution in NLP
o Language is Hard
 Some NLP Basics
o Bag of Words
o Additional Topics
o Movie Reviews and Sentiment Analysis
 Naive Bayes and Sentiment Classification
 Logistic Regression
 Lab 5

NLP: BACKGROUND

THE REVOLUTION IN NLP

The current revolution in NLP


Human language ability has often been the defining of human being, compared to e.g. animals
 CHatGPT

ChatGPT: What it can do


 model by OpenAI which interacts in a conversational way Answers follow-up questions
 Challenges incorrect premises and reject inappropriate requests

Based on GPT3
 large language model
 trained on missing word prediction transformer model – certain type of neural network good at learning in this
way
 Further training through Reinforcement Learning for Human Feedback

Growth of ChatGPT
Took 5 days to get 1 mil users

LANGUAGE IS HARD
Descartes and AI
 Could a machine imitate a human?
o No – you would always be able to tell the difference
 Rene Descartes: Discourse on the Method, Part V (1637)

Descartes: Machines can’t Imitate Humans


 “. . . they could never use speech or other signs as we do when placing our thoughts on record for the benefit of
others.”
 “. . . we can easily understand a machine’s being constituted so that it can utter words, and even emit some
responses to action on it . . . But it never happens that it arranges its speech in various ways, in order to reply
appropriately to everything that may be said in its presence, as even the lowest type of man can do.
o Like siri. There can be fixed responses planned out, but limited repitorie, not diversity as humans
 Right back then;

The Turing Test


 (1950) Computing Machinery and Intelligence
 Test of a machine’s ability to exhibit intelligent behaviour
 Human judge engages in a natural language conversation with a human and a machine
 If the judge cannot reliably tell the machine from the human, the machine passes the test
 The original question, "Can machines think?" I believe to be too meaningless to deserve discussion.
Nevertheless, I believe that at the end of the century the use of words and general educated opinion will have
altered so much that one will be able to speak of machines thinking without expecting to be contradicted.

A way to decide whether a machine can think like a human.


Show diversity in behavior like humans; same argument as Descartes.

Why is Language Hard?


 2 challenges: infinite and ambiguous
 Language is infinite – infinite set of senses
 Most sentences you hear –you have never heard them before, and will never hear them again

Descartes: Machines can’t Imitate Humans

Language is ambiguous
 Many words have multiple meanings
o Lexical Ambiguity
 Phrases and sentence can have multiple meanings
o Structural Ambiguity
 A single sentence can have many different meanings
 Need Context to resolve ambiguities
NLP BASICS

BAG OF WORDS

Types of Data
 Numerical Data
 Categorical Data
 Text data is different
o Content of an email
o A Headline
o Text of political speeches
Can text be treated as structured data?

Making Language into Structured Data


 The solution: Dummy values for words
o One feature for each word
o Value is 1 if word occurs in text, 0 otherwise
o Alternative values: number of word occurrences, or TFIDF score
o For any text, value of most features will be 0
 Gives a lot of information of what the text is about.

Bag of Words Processing


 Treat text as a Bag of Words
Basic steps

Start with a string


Divide into words (features)
The tokenizer looks at white space –
Order into a vocabulary

Unigrams
Bigrams

ADDITIONAL TOPICS
Stopwords
 Words that are “too frequent to be informative”
 Built-in Stopwords List: above, elsewhere, into, well, fifteen, . . .
 Could also discard words that appear too frequently
 Common stopwords and less common

TF-IDF
 Words that are frequent in a document tell a lot about that document
 Words that appear in lots of documents are less interesting
 TF-IDF
o increases as term frequency increases decreases as document frequency increases
o term frequency and inverse document frequency – frequency of word occurring acorss all documents

MOVIE REVIEWS AND SENTIMENT ANALYSIS


Rise of user generated data in web 2.0
Sentiment analysis became popular due to lot of review data online

Exploit Big Data for Text


 Use Supervised ML for text processing
 Can we get labeled text data?
 Build Classifiers
o Spam Detection
o Sentiment Analysis
o Topic Detection
o ...

Sentiment Analysis
 Is a text Positive or Negative?
 Used for Social Media Analysis
 Marketing
 Impact of new product

Movie Reviews as Data


 Online movie reviews
 Texts paired with ratings
 IMDB reviews
o Positive: Rating of 7-10
o Negative: Rating of 1-4

Bag of Words with More Than One Word


 nGrams
Look at coefficients from the tdfidf scores
Domain specific > ‘boring’ is bad for movies,
‘preditctable’ might be bad for movie domain but
good for others, maybe weather.

Language Modeling and Ngrams


 Assign probability to a sequence of words
 What is p(He went to the store)?

Language Modeling and Ngrams


He went to the store
 1-grams (unigrams): He, went, to, the, store (5)
 2-grams (bigrams): He went, went to, to the, the store (4)
 3-grams (trigrams): He went to, went to the, to the store (3)
 4-grams: He went to the, went to the store (2)
 5-grams: He went to the store (1)

Language Modeling and Ngrams


 Bigram approximation:
o p(He went to the store) =
o p(went|He) * p(to|went) *p(the|to) * p(store|the)
 Trigram approximation:
o p(He went to the store) =
o p(to|He went) *p(the|went to) * p(store|to the)

NAÏVE BAYES AND SENTIMENT CLASSIFICATION

TEXT CLASSIFICATION
 Sentiment analysis Spam detection
 Language identification

Explain the difference between generative and discriminative classifiers:

GENERATIVE VS. DISCRIMINATIVE CLASSIFIERS


 Naive Bayes: Generative
o Models how a class could generate data
o A certain class; what would be the features
 Logistic Regression: Discriminative
o Models which features are useful to discriminate
o What features do we need to discriminate?

BAG OF WORDS
Document Classification and Bayes’ Rule

 Pick most likely class c, given document d

 Bayes’ Rule:

Classification Using Bayes’ Rule


Features are Independent

Prepend prefix NOT to every word following negation until the next punctuation mark

LOGISTIC REGRESSION

DISCIMINATIVE VS. GENERATIVE CLASSIFIER


Classifier: Cat or Dog?
 Generative Classifier (like NB) looks at all features to understand what dogs look like and what cats look like
 Discriminative Classifier (like logreg) would be satisfied with a single feature: “dogs have collars”

NAIVE BAYES VS. LOGISTIC REGRESSION


 Naive Bayes computes a likelihood and a prior:

 Discriminative model like Logistic Regression computes P(c|d) directly

Weighted Sum of Features

Sigmoid Function Creates Probabilities

Weighted sum of features:


Sigmond creates probabilities:

Decision Threshold:

Designing Features for Sentiment Analysis

Designing Features for Period Disambiguation

TAKEAWAYS
 ChatGPT reflects a revolution in NLP and AI
 Language is Hard: infinite and ambiguous
 Basic NLP Techniques:
 Bag of Words, nGrams
 Naive Bayes and Logistic Regression: examples of generative and discriminative classifier models

LAB5
6. TECHNIQUES IN PRACTICAL ML

Topic Techniques in Practical ML


Readings
Activities to be Lab6, readings for next lecture
done before next
class
Agenda Market adoptions
Data science profiles
Data science projects
Tools

QA

What are your reflections on AI Act?


In the dataset of face images we practiced with all of the images are of the same size. Let's say in our dataset we have
images of a various size, how would we need to preprocess them and/or adjust the model?  
Do you agree that if we as a mankind would sacrifice copyright law and/or GDPR and use protected data for training
ChatGPT-like conversational (or not) AI we would be able to create even more powerful AI?
How can AI be used to better understand customers and their needs, as well as personalizing their experiences?
How can AI be used to optimize business processes and increase efficiency?
In which types of projects have you applied AI? For what reasons did you use AI? 

INTRODUCTION

WHAT’S THE POINT?


An introduction to the aspects of data science that you don’t pick up in Jupyter notebooks (or on Kaggle)

Motivation
● Data science is mostly taught with a focus on the hard technical skills: Statistics, algorithms and coding
● There’s also a heavy focus of modelling and things like hyperparameter tuning.
● While these things are important, they are only part of the equation.
● This often leads to a culture shock when aspiring data scientists move to industry.
● Case: Myself, studied the data science programme at CBS
○ This is the lecture I’d would have liked to have had.
● So, today, I hope to give you guys an introduction to the aspects of data science that you don’t pick up in a
Jupyter notebook.
● I also hope to show you why a business background can be super valuable in this space.
● What if you don’t aspire to become a data scientist?
Well, if you are interested in how value is actually created from AI, I hope that you will find this useful too.

Practical definitions
So, there’s a lot of different definitions of data science. This field is constantly evolving, and so is the perception of it.
Notice in Arthur Samuel’s definition, there’s no mention of data.

 Artificial intelligence
o Artificial narrow intelligence (ANI)
 e.g., smart speaker, self-driving car, web search, AI in farming and factories

o Artificial general intelligence (AGI)


 Do anything a human can do

 Machine learning (In practice: ANI)


o “Field of study that gives computers the ability to learn without being explicitly programmed.” -
Arthur Samuel, 1959
 Data science
o “The science of extracting knowledge
and insights from data.” - Andrew Ng, DeepLearningAI, 2020
o Notice that Samuel doesn’t talk about how computers should learn
The modern answer: Data - and lots of it

It’s all analytics


Descriptive analysis: What happended (e.g.
business intelligence, dashboards)
Diagnostics: why did it happen?
Predictive analytics: predicting, classification
Presprecptive analytics: Recommendations,
forcing something in a certain direction

MLAI is a tool, not a solution -

 It might be important to think of how AI and ML fits into the ecosystem of analytics.
 Where do you think ML and AI fits in?
 I hope that you’ll agree that we can’t really start thinking about what will happen before we know what
happened and why it happened
 Really, the important takeaway here is that ML and AI is based on descriptive and diagnostic analytics.
o -> We can’t hope to model anything if we don’t understand it
 This is an important perspective.
 When speaking with a client, it can be very helpful to understand which kind of insights they are after.
 If diagnostic analytics is the solve, the is no reason to start thinking about predictive analytics.
MARKET ADOPTION
In terms of some research

ML-AI ADOPTION
Global AI adoption globally is 2.5x higher today than in 2017 but may have reached a plateau.

 A very common trend for new technology - Gartner’s Hype Cycle


 Every technology goes through a hype phase, then all get disappointed and the hype goes down. Here
we can focus building the technology.
 Reality is sinking in: Organizations are starting to recognize the level of organizational change it takes to
successfully embed this technology.
 Some companies that get discouraged because they went into AI thinking it would be a quick exercise
 Those taking a longer view have made steady progress by transforming themselves into learning organizations
that build their AI muscles over time.
 MLAI maturity

Although ai global adoption is higher it


reached a plateau

What do AI achievers do differently?


Strategy and sponsorship as an example
● According to McKinsey research:
○ More indications that AI leaders are expanding their competitive advantage than evidence that others
are catching up

Most differ in strategy and sponsorship – 12 % doing very well

Mapped very good at business; think about business first or at least


why they are doing it

AI MATURITY
● AI maturity measures the degree to which organizations have mastered AI-related capabilities in the right
combination to achieve high performance for customers, shareholders and employees.
● What does all this mean for you?
It means that the critical component is not being or having the cleverest data scientists (as everyone thought a
few years ago)
● Instead, we need ambidexterity: Multiple people, with different skills.
 People that can understand the business and the technology.
Broad categories

● MLAI maturity is multivariate and varies significantly between organisations


● MLAI is not just technical know-how.
● Major barriers to success:
○ Data and data mgmt.
○ Strategy and vision
○ Sponsorship / buy-in
○ Governance
○ Talent
○ (Concise value propositions)
● What does all this mean for you?
It means that the critical component is not being or having the cleverest data scientists (as everyone thought a
few years ago)
○ Instead, we need ambidexterity: Multiple people, with different skills.
-> People that can understand the business and the technology.

About ai maturity:
AI Innovators: Have mature AI strategies,
but struggle to operationalize

AI experimenters: Lack mature AI strategies


and the capabilities to operationalize

Don’t know what they wanna do, and lack of


technical knowhow – most of the companies

AI Builders: Have mature foundational


capabilities that exceed their AI strategies

AI Achievers: Have differentiated AI


strategies and the ability to operationalize

DATA SCIENCE PROFILES

SKILL GAPS IN DATA SCIENCE/MLAI


 Most important skills/areas of expertise missing

 What does all this mean for you?


It means that the critical component is not being or having the cleverest data scientists (as everyone thought a
few years ago)
o Instead, we need ambidexterity: Multiple people, with different skills.
-> People that can understand the business and the technology.
Anaconda made. Huge survey– skill gaps in AI.

Engineering; strong coders

It seems like the field is looking for people that


can take this into a business context; from
notebooks into a business strategy

 The fields needs data science


professionals with different skill sets.
 Think about what you bring to the
table.

 Too often, companies expect the


superset of all elements, not the union
(middle)

Union: Data science

DATA SCIENCE PROJECTS


Many experimenters: making PoC, MVP  Establish they can get value from ai short term
AI happens in pp – ML happens in python

How MLAI projects are different


● MLAI projects are IT projects - with an additional risk dimension: Uncertainty of signals in data.
● “The combination of some data and an aching desire for an answer does not ensure that a reasonable
answer can be extracted from a given body of data” - John Tukey, 1986
Traditional IT projects MLAI projects

Characteristics  Rigid functional requirements.  Inherently experimental. All data


 Defined product or service. science projects are experiments –
therefore ‘science’
 Default hypothesis: “We can predict y
from features X”
 Inductive reasoning: What can the data
tell us?
 Start out with a hypothesis, found out
other also applies

Timeline  (Often) clear timeline.  (Often) unclear timeline.

Iterative  (Often) non-iterative*  Highly iterative - Tasks depend on


experiments.

Challenge Challenge: Efficiency and Challenge: Evaluating the hypothesis - Then


implementation. implementation.

SCOPING
Because of high uncertainty, careful scoping is critical - Don’t jump to modelling.
● Specifications
○ The WHY: How will this help?
○ What are the user stories?
○ What is the timeframe?
○ Evaluation criteria
■ Qualitative (Business) – end user’s idea,
■ Quantitative (Business, technical)
● Considerations
○ Is this a duct tape solve?
○ What level of analytics is required?
○ Are we reinventing the wheel?
○ The cold start problem (Logging the right data?)
○ Build vs. buy

● A small change to current ways-of-working may be better than a duct tape AI solution
● A user story:
○ “As a …. (persona)
○ I want to … (function)
○ in order to … (reason / user wish)”
● Qualitative: We would like to be able to detect credit card fraud cases more easily
● Quantitative: We would like to be able to detect at least 85% of card card fraud cases

 Technical diligence - do not focus on this part


o Can AI system meet desired performance
o How much data is needed
o Engineering timeline
 Business diligence
o Lower costs [current business]
o Increase revenue [current business]
o Launch new product or business [new business]

Ask questions, quantify, approximate and discuss


 Low hanging fruits – email classification, sentiment analysis

DATA SCIENCE DEVELOPMENT

Three phases of data scientific projects


 MLOps:

MLAI lifecycle – MLOps


 Our clever models have zero value until we publish them.
 MLOps is engineering - Many data scientists struggle with these steps.

TOOLS
Let’s look into one of our projects

Which tools do we use?


https://github.dev/NTTDATAInnovation/documentai

Data handling ● Pandas


● NumPy
● DocArray

Deployment ● Docker
● DVC
● FastAPI
● Streamlit
● Gunicorn
● GitHub Actions
● TerraForm

Testing ● PyTest

Environment ● PyEnv
● Poetry
● Pre-commit
● Jupyter notebooks
● Git + GitHub

Documentation ● MkDocs
● MkDocs-materials

MLAI ● Scikit-learn
● PyTorch (or Keras)
● LightGBM
● SpaCy

WRAPPING UP

General
 MLAI is a tool, not a solution
Market adoption
 MLAI maturity is multivariate and varies significantly.
 The major barriers to success with MLAI are often not technical know-how.
Data science profiles
 Business knowledge is a sought-after skill in data science
 MLAI is not just technical know-how.
Data science projects
 MLAI projects are inherently experimental
 MLAI project have an additional risk dimension: Uncertainty of signals in data
 Because of high uncertainty, careful scoping is critical - Don’t jump to modelling
 When scoping, try to ask questions, quantify, approximate and discuss
 Our clever models have zero value until we publish them

7. IMAGE PROCESSING

Topic Image Processing: Pre-trained models


Readings Russakovsky et al. 2013
Activities to be Lab7, readings for next lecture
done before next
class

REVIEW OF BOW MODEL

A lawsuit against using chat-gpt in US because it can make legal documents without paying high fee
Who do you sue? If it suggests something but it is wrong

MAKING LANGUAGE INTO STRUCTURED DATA

 The solution: Dummy values for words


o One feature for each word
o Value is 1 if word occurs in text, 0 otherwise
o Alternative values: number of word occurrences, or TFIDF score
o For any text, value of most features will be 0
o Treating each model as features. Inputting

BOW – COUNT VECTORIZER

Create bag of words representations

Produces as sparse representation a


standard way in computer science.

Default – has a token pattern with a


regular expression. A pattern of
characters. The pattern means it has
some boundaries – something that
limits a string, such as white space.
Ignoring things like punctuation entirely

Text Input Example with text – last 3 text in the


collection. From movie reviews.
Specify max no of features and
document frequencies. A way to restrict
features. Not occur more than 30%.
How does it choose features? Maybe
the most frequent according to Dan.
Max_df wont allow features with a
document frequency above a certain
threshold – here 30%. A good idea to
have a max_df, idea is the same like
stopwords, if it occurs in most docs it
might not provide that much
information. Like in a news paper the
by lyne ‘written by’ is not informative,
but might occur in all articles.
Restricting Texts, only 7 features
Features

. Shows the occurrence of words across


documents

Word
Ngrams

Char Analyzer parameter default is ‘words’.


Ngrams Interesting reasons why we might look
a something different like character
engrams instead of word ngrams.

Ignores the boundaries of words

Subparts of words that are meaningful


in its own right.

Compound words – e.g. when words are


put together.
Need to use a longer range, sequence of
characters.

BOW – TFIDF VECTORIZER

Variant of the count vectorizer


Will give the tf-idf score.

Good to be familiar with. Ability to


use different options, in a pipeline
and use gridsearch to try out different
combinations.
TF-IDF
 Words that are frequent in a document tell a lot about that document
 Words that appear in lots of documents are less interesting TF-IDF
o increases as term frequency increases
o decreases as document frequency increases
 numerator – term frequence, denominator – document frequeny
o “the” has a high document frequency, not focus on this
o Tf-idf is good to encapsulate that

Choose between tf-idf and countvectorizer? Yes, pick one.


Its an either or. Bag-of-words either or.
Tf-idf is conceptually more interesting than. Put it into gridsearch and explore that.
He has experienced worse results with tf-idf than a simpler representation. Can’t ecplore all options, but do vary
exploration of different options, what could be interesting to explore, explain why we do as we do

Do the same thing as before.

Tokenizer Standard token pattern. Simple apprach

Convert a string to a list of tokens.


We don’t want words with symbols.
Separate pucntuations – then add rules
to make it more sensible.
Done with tandardly used tokenizers.

Subword New approach – is subword p


Tokenizer Many reasons why to look beyond the
level of the word.

An illustration of the BertTokenizer to


tokenize a string. Most are ordinary
words and punctuation.
Whats cool about it;
Breaking words into interesting
subparts
Complex linguistic process.

Word morphology? How words are


pout pogether. Words has its own
internal grammer. Like happy and
unhappy – un is negation.

First look at counts for each token

Then split into separate characters, but


remember their count they occurred in.
Not most interesting way to represent.
We need something in between- the
whole word and separate words.
Instead; take the most frequent pair and
merge it.

Concept of white space

BOW AND PIPELINE/GRIDSEARCH

 Large space of possibilities for text feature representations


 Design GridSearch with Pipeline to search possibilities – can be things. Only do in combination for pipeline
and grudsearch. Which model
o Word vs Char analyzer
o Binary, Counts, TfIDF
o Number of features
o Min/Max Doc Frequency
o Model: Logistic Regression, Naive Bayes, MLP, . . .
o BERT features
 This idea goes idea with the current deep learning models
o Current models: don’t look at different ways of defining features. We let the model itself to choose
what features to look at. Its given the sequence of tokens, we don’t do features.
o A lot is feature engineering
 Taken over by deep learning model. Also learns about – engineers its own features, therefore
it has different layers. Each layer has different features with its own level of abstractions
 Its obsolete to make feature selection this way
o A way of getting a look into the black box of deep learning models

MORE ON LANGUAGE

SPACCY AND LANGUAGE


A collection of convenient tools. All sorts of different topics. Linguistic
annotations, can find structures in language – parts of speech

Linguistic Annotation
Can easily be download and work with in the pipelines.
Work with different languages.
Convert the text with spacy, convert into list with annotated
tokens

Output of Spacy

Basic of grammar. A recursive three structure.

Named Entities
doesn’t have a perfect

Text Similarity
Word embeddings

Take individual words or phrases.


Convert them into fixed length
vectors of real numbers. Done in a
training process similar to training a
model.

A way to get a sense of what a word


is about.

Spacy has a version of word


embeddings – used to compute
similarity of texts
LANGUAGE TASKS

SENTIMENT ANALYSIS

Early days – work with simple tasks. Very defined in a


way. A text; give it a 1 or 0 depending if it is positive or
negative.

This is a simple classification tasks

INFERENCE

More interesting task

In some ways the ask is the whole thing in nlp

What you need to do is to understand the meaning of sentences


and how they relate to each other.

This is a classification tasks.

Divide into 3 categories:


1. Entailment: Sentence A entails sentence B. Therefore
if A is true, B has to be true

2. Contradiction: The inverse of entailment; sentence A is


true. B has to be false.
3. Neutral

QUESTION ANSWERING
Specific ways to setup a QA task.

Setup so there is some kind of text and some kind of


answers. Find the answers to the text

General type of task, seems to require deep understanding.

Think of it: finding a span . a limited view. Needs a text to


get a context in which it finds the answer.

No context: the system has to possess the answer from its


training. Like the current gpt-systems. Don’t have to
provide texts
Prompt engineering – new area in the field
Variations of current models. E.g. in how you pose a
question

Different versions of the squad system.

Leaderboards for different tasks. Not relevant more.


THE BERT MODEL

DEVLIN ET AL. 2019

Should read it, not necessarily understand all


details

Using features that BERT can produce for text


based on its pre-training.

MAIN POINTS

 Bidirectional Encoder Representations from Transformers.


 Pre-trained language model
o Trained on a large amount of text
o Not feasible for most students or organizations.
o Too resource intensive.
o Idea: the model has a lot of knowledge of language. Give it a text, then it makes a representations
selects the features based on its pretraining. Then we can finetune the model
 Can be fine-tuned for

o Single Sentence Classification: Sentiment Analysis, Emotion Classification
 Unlimited data for. The task encompasses all we care about in language.
o
o Pair Classification: Question-Answering, Inference
 Give it a pair of sentences. Sometimes find random sentences, other time two sentences that
fit together in an actual text; Bert is to predict which is which – supervised tasks , but we can
easily produce as much data we want.
 Better for question answering: predicting related sentences
 Bert is different than gpt models in the way it divides the above
o Bert pre-training:

INPUT-OUTPUT REPRESENTATION
 Can represent a single sentence or pair of sentences
 “Sentence” can be any span of text
 Add [CLS] symbol to beginning of input sequence
 [SEP] symbol at end of first sentence
 If input is a pair of sentences, they are separated by [SEP]
 Use WordPiece embeddings – variant subword tokenizing. Then we have emebddings – representations that
give numerical values, that can say something about the meaning of the words.

PRE-TRAINING
 Input is pair of sentences
 Two Tasks:
o Masked LM
 Predict masked tokens, based on surrounding tokens
o Next Sentence Prediction
 Predict whether second sentence naturally follows first one
Missing words: predicting what words occurs in specific context
Supervised? You have the actual answer
Unsupervised: you have unlimited amount of data. Any text you find.

FINE-TUNING
 Much less resource intensive
 Since the pretrained models know so much about language in advance we can use it
 Different for different tasks
 Sentence-pair tasks (Question Answering, Inference, . . . )
o Input is sentence pairs, with same representation as training
 Single-sentence tasks: (Sentiment analysis, emotion classification, . . . )
o Input is single sentence, ending with [SEP] symbol

Results from the article

LAB7

Based on an exercise that he linked to in


canvas.
Idea is to take the BERT model 
Produce features for text  we use the
features  logistic regression to train on
the featues no fine tuning  sentiment
analysis.

Two Feature Sets


Compare it to a BoW model.
 DistilBERT 
o Take a text and make a representation of features. Based on its pretraining. Rich representation of
thext. Give the representation to the logreg model in the sentiment analysis. Tries to learn how to do
sentiment. Should be good due to the BERT representation
 Bag of Words
o Instead of Bert features: give the LogReg model a bag of words model as input.
o

DistilBERT
 Smaller version of BERT
 Almost matches BERT performance
 Produces sentence embedding – vector of size 768

Logistic Regression
 Classifies each sentence as Positive or Negative Features are 768 real numbers – sentence embedding
Dataset
 SST2
 Movie Review texts, classified as Positive or Negative

MODEL 1: DATA PREPARATION


 Tokenization
 Padding
 Masking
 Do some prepreparation

Bert tokenizer breaks words into tokens


Not everything gets broken, due to
frequency.

For convenience, turn each sentence not


a fixed length - make every sentence
the same length. A lot of

That is done by padding the sentences


with extra zeros: find the max length
and add 0s to the remaining.

Then tell bert which parts of the


sentences we can ignote – that is what
masking does. It is done automatically
MODEL 2: SENTENCE EMBEDDINGS
 All sentences are input to BERT
 Output of interest corresponds to first token ([CLS])
 Classifying a text: out

A sentence labaled as either 1 or 0

For eachsentence we get a sentence


embedding, array of X numbers.

Use as black boz to make embedings for


each sentence

Model Produces Sentence Embeddings

Last hidden state


Embedding for each position in the input

Assign Features and Labels


Throwing a way a lot of information.
BoW model ca be a vector of 30.000
feature; knows every word in a text. The
Bert boils the features down. Advantages
on both.

Logistic Regression Model

Train as usual
Scoring the model

Best Scores
8. AI IN PRODUCTION – GUEST SPEAKER

Topic TBD
Readings SLP, Ch. 5, 6
Activities to be Lab8, readings for next lecture
done before next
class

GUEST SPEAKER: PRAYSON WILFRED DANIEL, NTT DATA

https://github.com/Proteusiq/hadithi/tree/main/talks

 Confirmation bias
 Ask questions before opening the dataset to avoid bias
 How was the data generated?
 How was it collected?
 What contains in the data
 EDA
 How to avoid domain experts’ bias?
 Highlight if there is differences
 Challenge their assumptions

 Ethics
o IN Europe moving to very controlled AI space
o Not in China or US
o Transparency – personal information?
 Some say ‘keep them’ – sometimes it does not make sense to remove
protected attributes
 E.g. predicting houses; The price is impacted by the number of non-danes in
the area. However, ‘race’ as an attribute can impose bias and should maybe
be removed.
 Solution: Highlight that you have kept the attribute to show that
you’re mindful about it
 Talk to your clients
o Iteratively improve and adjust the model making it more complex. But start very
simple, then talk to your clients and adjust it.
 Code to abstraction
o Reading any type of file
o Example from lab: Making classes with gridsearch that takes in an algorithm
 In setting where you need to be iterative and make experiments
 Have it running in functions or classes so we only have to change it one place
o Always abstract the things we do
 Do fit and predict proper
 Then check if
 Ideally build a system by having this abstraction

Import reader
Import Path

Def read(file: Path, reader: Any, **kwargs) -> pd.Dataframe:


Return reader(file, **kwargs)
Clf = logR(params)
= sgd(params)

Def model(clf, **kwargs (doesn’t care what params)


Return clf, **kwargs
Clf = model
Clf.fit(x,y)

Dependency injection
Yaml > read in file and run

Separate what is changes and what is not


e.g. a reader is not changing. Identify what elements of the code that is changing, and make it
abstract

features typically change


target change
9. RECENT DEVELOPMENTS: GPT-3 AND OTHER MODELS

Topic Recent Developments: GPT3 and other models


Readings Devlin et al. 2019
Activities to be Lab9, readings for next lecture
done before next
class

Where is the field heading.


The development that is happening – we are in a position here we can engage with the state of the art of the field.
Research, business application

Schedule an online meeting before the end of next week to get answers to our project

Agenda:
1. The Revolution in AI/NLP: Large Language Models
a. GPT-3
2. BERT – Bidirectional Transformer Encoders
a. Training Bidirectional Encoders
b. Next Sentence Prediction
3. Transformers
4. 4 GPT-3 and the OpenAI API

THE REVOLUTION IN AI/NLP: LARGE LANGUAGE MODELS

GPT-3

Using LLM – optimize it in with interacting with people –


thus “chat”. The gpt-3 is inside it.

Illustrating the model in a good way, displayed on their


own website.

Illustrating the power of the model – correct grammar,


formulation etc.

Not connected to anything; be online, cannot send mails,


run programs etc. Why? It would be convenient – keep
responsibility on the user.

Why does OpenAI not think it is a good idea? There are


add-ons that use the models?
Dan: “ It wouldn’t be safe”
ChatGPT: What it can do

 Model by OpenAI which interacts in a conversational way


 Answers follow-up questions
 Challenges incorrect premises and reject inappropriate requests

ChatGPT: the Technology


 Based on GPT3
o transformer model
o trained on missing word prediction
 Further training through Reinforcement Learning for Human Feedback
o The main way chapt-gpt is a version of gpt-3 whichc is optimized for dialogue. The gpt-3 model
inself is trained on information on th web, missing word prediction, but chat-GPT is further trained
for interaction optimization with humans.

Growth of ChatGPT

Took only 5 days to reach 1mil – historical impact in terms of SW. reflect
how powerful true AI is and is going to be.

Discussion: Have we reached genuine AI? – deep question is it actually


intelligent?
To a large extent the sw does represent a solution to the main ai problems.

Does it become inferior by time within the next 10 years– along with
quantum computing is developing?
- DAN: nobody knows. Chat-gpt is already inferior compared with gpt-
4. SW will continue to evolve. He talks about how ai can match
human intelligence.

Turing test? Used in the past to evaluate AI intelligence. Blown past the
turing test.
New goal is to reach AGI – Artificial general intelligence. Not specific task,
but all kinds of task humans solve with their intelligence.
Still many things humans are better to do than AI with our human
intelligence, that is the new tasks.

GPT-4
More effective, hopefully safer

Report – how it is performing, how it has been tested


Human tests – many different from chatGPT3.

Unform Bar Exam: the exam after finishing law school to


become a lawyer – indicating that chatgpt4 would be better
being a lawyer than most human lawyers

Better to translation although they weren’t trained for


translation.

Good at coding. Nothing specifically done to programming

Khan Academy is making interaction - integrating gpt-4 –


very pedagogical – not giving answers directly but helping
to find the answer

Current language models – not reliable in being thruthful


all the time. Good at making things sound possible
although they are not true, sounds plausible; concern in an
educational setting. Not solution for that

GPT-3: BROWN ET. AL 2020 – LANGUAG MODELS ARE FEW-SHOT LEARNERS


GPT-3 is old.

GPT-3 AND FEW SHOT LEARNING

 They observed: “scaling up language models greatly improves task-agnostic, few-shot performance,
sometimes even becoming competitive with prior state-of-the-art fine-tuning approaches.”
o Task agnostic = the model itself not built to do QA, sentiment analysis, summarization etc. specific
taks tha t nlp systems often are trained to do as humans can. Think of it in a task agnostic way – the
models become competitive with state of the art with fine tuning approach.
o Pre-train – the fine tune for a specific tasks for example sentiment analysis (NOT task agnostic)
o As these models become really big, we don’t have to make them task agnostic?

GPT-3:
 transformer model – like other models like BERT
 175 billion parameters
 10x more than any previous large language model

A SHIFT IN NLP
Their point
 Shift from
o learning task-specific representations and designing task-specific architectures to
o using task-agnostic pre-training and task-agnostic architectures.
 To using task-agnostic, few-shot performance competitive with prior state-of-the-art fine-tuning approaches
o We can do as well for all tasks without tuing the model for a specific task

 Previous Approach:
o Pretrain model (GPT, BERT)
o fine-tune on a large dataset of examples to adapt a task agnostic model to perform a desired task
 Recent work (Radford et. al 2019) suggested this final step may not be necessary.
o Very exiting.

IN-CONTEXT LEARNING
In paper not doing any fine tuning. Lot of training data and updating the weiths pf the network example. Instead “few
shot training”

 one- and few-shot performance is often much higher than zero-shot performance
o 0-shot: sentiment analysis; classify sentences “0-shot” – can it do it out of the box?
o One/few shot: Also give it one example before on a positive and negative sentence. Then you give it
the task of classifying. Then test the model on a few shot learning. Shows that this type is competitive
 This suggests that language models are “meta-learners” where slow outer-loop gradient descent based learning
is combined with fast in-context learning
o Standard way of training a neural network model (gradient descent) – slow process, each time
classyfing, gradually fine tuning the weighst. Very time consuming.
o Compared with fast-in context learning

 . . . one notable pattern is that the gap between zero-, one-, and few-shot performance often grows with model
capacity, perhaps suggesting that larger models are more proficient meta-learners.
o As you give the model few numbers of examples it improves. The improvement is bigger for bigger
models. Large models more proficient meta learners: better at general learning trained with few-shot
in context learning.
o Movement towards AGI? Able to say “do this thing for me” – might thing that larger models would
care less about the in context learning because they know more stuff. But getting a generalized
learning abnility – if this continues to improve with larger models it shows an interesting movement.
o Suggestion that in context learning have mad gpt-3 revolutionary. common

GPT-3 Training
 Common Crawl – common collection produced by crawling the internet.
 Known high-quality reference corpora: Combined with these
o WebText
o Books Corpora
o English Wikipedia

Tasks
Look at different tasks and compare with specific task approaches .
 LAMBADA: predict last word of paragraph
 StoryCloze: select correct ending for five-sentence long stories
 HellaSwag: pick best ending to story or set of instructions

Results:
compare with state of the art results
Sometimes gpt is performing better than state of the art, even with 0-shot
approach.

Often nice improvement from

Question Answering Tasks


 We evaluate in the “closed-book” setting (meaning no conditioning information/articles)
 TriviaQA
 Natural Questions (NQs): tests fine-grained Wikipedia knowledge
 ARC: a common sense reasoning dataset of multiple-choice questions collected from 3rd to 9th grade science
exams.

Results: Open-Domain QA

Results: QA and RC Tasks


BERT – BIDIRECTIONAL TRANSFORMER ENCODERS
BERT - Predecessor to GPT

Bidirectional Transformer Encoders


First big transformer model from 2018
 Computes contextualized representations of tokens in an input sequence
o Issue with BERT. For every token
 Useful in many downstream applications
 Uses self-attention to map sequences of input embeddings
 (x1, . . . , xn) to sequences of output embeddings (y1, . . . , yn)
 Output vectors are contextualized – using information from the entire input sequence

BERT:
 Subword vocabulary consisting of 30,000 tokens generated using the Word-Piece algorithm
o Don’t deal with whole words all the time. Break up words in useful ways, common simple words are
left whole (like and, or etc), while other words have meaningful sub-parts (un-happy) – important part
of these models.
 Hidden layers of size of 768
 model with over 100M parameters
 Fixed input size of 512 subword tokens

TRAINING BIDIRECTIONAL ENCODERS

CLOZE TASK
All models trained on a part of this word predicion
 Predict missing words in input
o Please turn _____ homework in.

MASKED LANGUAGE MODELING (MLM)

NEXT SENTENCE PREDICTION

BERT - NEXT SENTENCE PREDICTION


 Two new tokens to the input representation
o CLS prepended to the input sentence pair
o SEP placed between the sentences and after the final token of the second sentence
 NSP relevant for tasks relating pairs of sentences
o paraphrase detection
o inference
o discourse coherence

 Model is presented with pairs of sentences


 Task is to predict whether each pair consists of an actual pair of adjacent sentences from
training corpus or a pair of unrelated sentences
 50% of training pairs are positive pairs, and 50% are where second sentence was randomly
selected from elsewhere in the corpus

CONTEXTUAL EMBEDDINGS

 Output of the model is contextual embeddings for each token in the input
 Can be used as a contextual representation of the meaning of the input token for any task
requiring the meaning of word.
o Big improvement
o Words change meaning in context.

 Static embeddings represent the meaning of word types: vocabulary entries


 Contextual embeddings represent the meaning of word tokens: instances of a particular
word type in a particular context.

 Contextual embeddings used for tasks like measuring semantic similarity of two words in
context
 Useful in linguistic tasks that require models of word meaning
 Most common use: as embeddings of word or even entire sentences that are the inputs to
classifiers in the fine-tuning process for downstream applications.

TRANSFORMERS

Transformer: basic model for all large language models

Key characteristics

Words Can Attend to all Other Words in Input


How is it doing that?

Transformer models: create tension – dependencies or


linkages between any word to other word. When it is
deciding whether – compare a word to any other word at
any point, deviced about the word based on any other
word. Can decide to ignore certain words and find other
interesting.

First neural network;


Encder-devoder model

First: encoding in embeddings; looking individually then


comparing

Then decoding, producing output. That becomes input for


the next.

Autoregressive.

ChatBOT trying to answer the user.

Convert each word into embeddings.

Combined with positional encodings; code where the


word is in the input.
Crucial; as humans, we just take a word one at a time, not
thinking about what the – processing sequentially.
Transformers does not do this; all words at same time,
processes each word in parallel.
Knows where the words are in relation to eachother
because of this positional encoding

Each word has linkages to each other


For any given word embedding, how relevant it is to any
other words.

Relevant to themselves; then a matrix showing how big


relevance words have tot each other.

System learning to produce

Learn linkages in self-attention.

Not well understood but the key why language models


work well

GPT-3 AND THE OPENAI API

 Introduction to API
o https://platform.openai.com/docs/ api-reference/introduction?lang=python
o
 Examples:
o Q&A
 https: //platform.openai.com/examples/default-qa
 suggestion: few shot learning. Give it a prompt. General description.
o Classification:
 https://platform.openai.com/examples/ default-classification
 no training needed
potential for building useful apps with the API. Need to pay, but inexpensive.

initially free but you need to pay?

QA

Do we have to talk about our projects to the exam?

Show that we master approaches to machine learning.


Talk about what we’ve one in our project
Consider alternative approaches – models, metrics etc.

They can ask about specific topics

He will do a review of each session about key topics.


10. CONCLUSIONS

SESSION 2: MLPS AND UNCERTAINTY ESTIMATES

 Perceptron is a linear model


o Linking together
 MLP generalizes this by linking perceptrons with nonlinear activation functions (Neural networks)
o That’s what makes them different from linear models
 Deep neural network has multiple layers
o Interesting learning devices
 Many parameters to tune
 Uncertainty: models can output one or both of decision_function and predict_proba
o Interesting to look at the certainty of a model. Models can give how strong they think they are
 Relates to classification predictions by reference to threshold
o Threshold. Interesting for practical implication of a model. In practice we measure the model in
accuracy, but often there are big differences between errors. Put in thresholds related to the certainty
or uncertainty to the models – relates to business relevance of a model
 We can modify threshold to alter classifications in interesting ways

SESSION 3: UNSUPERVISED MACHINE LEARNING/ REPRESENTING DATA/ MODEL


IMPROVEMENT

 Unsupervised ML often preparation for Supervised ML


o Scaling
o Dimensionality Reduction
 Clustering
 In practice – although they are two different categories; uncertainty uses as preprocsing for supervised machine
learning

SESSION 4: MODEL EVALUATION AND METRICS

 There are lots of ways to evaluate models


o Scikit gives many build in metrics
o Try the best
 We should select a metric that corresponds to the goals and business impact of building the model
 Can explore thresholds for classifier models
 We select models by tuning hyperparameters – can do this wrt. the metric best suited to your case
o What metric are we trying to optimize; not use default – can pick a different or create your own wrt
what is important

SESSION 5: NATURAL LANGUAGE PROCESSING


Important area in machine learning
 ChatGPT reflects a revolution in NLP and AI
 Language is Hard: infinite and ambiguous
o Give appreciation of what nlp is all about, challenge of NLP. A profiund challenge it is to process
language. No way to hardcode a system that can process language perfectly
o Language programs are also Infinite sets of expressions
 Basic NLP Techniques:
o Bag of Words model, looking at nGrams – take language which seems unstructured and turn it into
some kind of structured representation that we can address with standard ML.
 Limited in many ways but can be surprisingly effective
 Many applications apply – sentiment analysis, QA
 Naive Bayes and Logistic Regression: examples of generative and discriminative classifier models
SESSION 6: PRACTICAL ML

Guest lecture: Nicolai Blegvad Thomsen - NTT DATA


 General: MLAI is a tool, not a solution
 Market adoption
o MLAI maturity is multivariate and varies significantly.
o The major barriers to success with MLAI are often not technical know-how.
 Data science profiles
o Business knowledge is a sought-after skill in data science
o MLAI is not just technical know-how.
 Data science projects
o MLAI projects are inherently experimental
o Uncertainty of signals in data
o Careful scoping is critical
o Ask questions, quantify, approximate and discuss

SESSION 7: NATURAL LANGUAGE PROCESSING: BERT AND LARGE LANGUAGE


MODELS

 More about NLP


 Review of BoW Model
o Interesting to keep in mind when having a BoW model; interesting variations of it; normaly count
vectorizer – features.
o Explore differences like looking at individual words (feature) - default, or ngrams – can make a big
difference
o Fault – don’t take context into account. Linguistics are not just bow, small changes in the order can
have a big meaning, but BoW doesn’t take this into account.
 E.g. “sam killed joe” and ”joe killed sam” would be seen as the same in BoW
o Can use characters instead of words as basic feature
 Ngrams between 1-3 is reasonable

 More on Language: SpaCy, linguistic annotation
o Look at linguistic features – tools like SPaCy: gives functions to do different linguistic annotation;
each word label with “adjetives, verbs” etc, produce dependency graph for you. Fundamental to an
intelligent processing of language.
o Interesting to do – do linguistic annotation; take this as additional features.
o
 Language Tasks: sentiment analysis, inference, QA
o Sentiment is just classification – limited aspect of language. Interesting because it does get aspects of
understanding. But is quite limited; there’s much more to text. Difficult to
o Inference: figure out whether one sentence logically entails another sentence.
 Interesting: convert this to a classification as well
 Entailment: “She saw a dog” and “she saw an animal” would be entailment
 Contradiction
 Independent
 Seems to get fundamental aspects of language understanding
 BoW would not get us far with that
o QA:
 All tasks can be X by QA.
 Related to the idea of Turing Test. Now blown past the Turing test now.
 BERT model
o 2018. Huge advance of other approaches; can take context into account. Built to learn arbitrary
connection between different words in the input.
o BoW: each word is independent feature.
o Bert: looks at every word; how does it relate to any other word. Transformer.
o In principle; we have a chance to address the fundamental problem of ambiguity in ML.
SESSION 8: GUEST LECTURE – “THE HITCHHIKER’S GUIDE TO MACHINE
LEARNING PROJECTS”
Prayson Wilfred Daniel: https://github.com/Proteusiq

SESSION 9: RECENT DEVELOPMENTS: GPT-3 AND OTHER MODELS

 The Revolution in AI/NLP: Large Language Models


o GPT-3
o Already on GPT-4 which is already an amazing improvement from gpt3
 BERT – Bidirectional Transformer Encoders
o Training Bidirectional Encoders
o Next Sentence Prediction
 Transformers
 GPT-3 and the OpenAI API - a great thing to get familiar with and experiment with
 Programming is disappearing
o Model is built by a machine.
o We can train and tune a model; interact with natural language
Concern:
 Loose control of AI?
 Alignment problem of AI
o As ai begins to pursue goals, might not do yet, but you can give it a goal and it can try to pursue it. It
can interact with users;
o Make sure it pursues goals that align with human
Able to talk intelligently about to the exam:
 Project
 Cover other topics besides specific problem which might be narrow

11. FINAL PROJECT

Topic Area
We are focusing on two main topic areas in this class

Text Analysis/Natural Language Processing


Image Classification
We expect most of you to do a project within one of these topic areas -- all of which involve advanced machine learning
topics. If you wish to choose a different topic area, you should check with us, to make sure it also involves advanced
machine learning topics.

Relevant Literature
Find at least one recent research paper that deals with issues similar to your project. Summarize the main points of the
paper and compare it with your project.

Data Description
How many instances? What are the features? Give a few lines of the data and/or mention the range of values for
important featues. This could be a dataset you have found, or that you have constructed by combining different datasets.

ML task
What is the target value? Is this classification, regression or some other type of problem? It is important that you use
techniques and concepts covered in this course, such as Pipelines, Grid Search, and different Metrics and Thresholds.

Relevance
Why is it interesting?

The oral exam will include a discussion of the project, and can also cover other important topics from the course
readings and lectures.
Python resources
 GitHub Page Links to an external site.with notebooks and code for the book "Introduction to Machine
Learning with Python".
 Pandas TutorialsLinks to an external site.
 Tips and Tricks for Jupyter NotebookLinks to an external site.
 Learn PythonLinks to an external site.

Install Anaconda
If you don't have a Python installation, you should install Anaconda Links to an external site..
(You can use mamba, poetry, pyenv, virtualenv and a lot more. If in doubt, please use Anaconda)
Make sure to select Python 3.6 or above
When the download is complete, find the .exe or .dmg file, and install
Find Jupyter Notebook. In Windows look for Recently added software; on Mac look in the Applications folder

Datasets
 Top 23 Best Public Datasets for Practicing Machine LearningLinks to an external site.
 GoEmotionsLinks to an external site.
 Reddit News
 Here is a general introduction to Pushshift Links to an external site., which you can use to get Reddit data.
 Ekstra Bladet
 Danish Parliament (Folketinget)
 AirBnB dataLinks to an external site.
 COVID-19 data from OurWorldInData

EXAM

FINAL PROJECT

 Use techniques and concepts covered in this course, such as Pipelines, Grid Search, and different Metrics and
Thresholds.
o Improving and experimenting the setup of your model
o Ideally done as it reflects the perception of the relevance of our model; what would be most relevant
 Pretrained models
o Include pretrained models like BERT
 Relevant research paper(s)
o Use as comparison or inspiration
 Show how your dataset is interesting!

ORAL EXAM

 Group exam: optional brief presentation


o Coordinate in the group
o 5 minutes approx.. nice to keep it short
o Think about it as a sales pitch – why is it nice and interesting what we have done?
o Doesn’t find interesting: what could we have done differently or what have we done wrong?
 He doesn’t care about this
 Discuss Final Project – have your paper available during exam!
 Questions about other topics from course
o Refer to the paper,
 Still graded individually
o Common that the group gets the same grade.
 Make sure that all can talk about everything – if you have done something other persons have
 Don’t have to provide your code?
o Probably in the appendix
o Not obligated to read the appendix
o Or put it on github – data and project
o They will NOT ask directly into the code
o Reproducible and validate results: Describe what you did in the paper to an extent so someone can
reproduce it
 Describe and focus on what/how and why we did what we have done
o Concise representation of the results
 Make a table with results
 Questions about the topics on the couse
o Sometimes there wont be any
o If project is very specific and narrow to some certain areas in the course; probably ask to areas
o Broad projects; probably not ask into other topics
o Might ask basic things; to make sure you know basic things
o Often the project is the main focus of the exam
 If we have made a project on text, would they expect that we talk/know about image processing?
o He is a “text” person – doesn’t know much about image processing
o They will probably focus on what the project is on
o But of course its good if we know about image processing as well
o Only to the extent to what we have been talking about in class

The paper: what should we focus on?


Look at leat one research paper
Ideally: publish at a conference. What would people NEED to know. Doesn’t leave out key information
What is interesting about this? (contributions to the field)

You might also like