Professional Documents
Culture Documents
CLASS PROJECT
Select/construct dataset
Pose interesting questions, with societal/business relevance
o Pick a dataset, question etc. that we believe in. driven by own curiosity
Produce a research-style paper.
o Detailed comparisons to relevant recent work
o Propose ways to build on / improve recent work
o Present new results and discuss in light of previous work
o Find relevant recent work, paper should do comparisons or build or improve on it, discuss our project
in the light of this work. Not expected to produce new research! But a step in that direction. Having a
complete picture of what is happening within the domain, and situating our work in this area – how
does it add on? – Business view.
Final Project
Interesting Datasets
Interesting Questions
Use ML techniques studied in class
Best practices for training, tuning and evaluating
Models
Take advantage of large pre-trained models
Systematically explore different model evaluation metrics to connect to business value
Write research-style paper, where you connect with current research
COLLECTING DATA
Interesting Datasets
Reddit
Twitter
Our World in Data
o Covid-19.
THE CLASS
Weekly sessions
Readings
Lecture
Activities
o Programming/building models
o Finding and constructing interesting datasets
o Developing project ideas
o Submit results regularly
Feedback
o Written feedback on submitted results
o Feedback on project ideas and workplan
Syllabus
Main Reading
Introduction to Machine Learning with Python
AI, MACHINE LEARNING AND BUSINESS
Research
Startup
Generalization: how well model performs on data other than the training data
o The goal. Training data set – summarizing the training. We’ll go beyond these specific observaitons.
Overfitting: Model is much better on training data than on test data
o Model is too complex, and too closely tied to specific details of training data
Underfitting: Model doesn’t perform well on training data or test data
o Model is too simple
LINEAR MODELS
LINEAR REGRESSION
LINEAR CLASSIFIER
Sum of weighted features. Same computation, but the result is a binary question. The models search for the best
combination of weights – what features contribute to the best xx.
LINEAR MODELS AND REGULARIZATION
DECISION TREE
Similar to linear model, but takes each feature and make a simple
assessment. How does it contribute to the target variable?
Linear models limited in the way they can – trees more powerful than linear models.
Feature importance
RANDOM FOREST
NEURAL NETWORKS
The Perceptron
Linear models are a weighted sum of inputs.
Activation functions
Not linear functions
Chopping off parts of the input
ML Approach:
o Define Problem – ie input-output function
o Collect Data
o Automatically build model based on data, that solves problem
o Skip the programming part. Programming is done by the machine learning algorithm. The solution are
done by the algorithms, not produced by a human programmer.
o Cons:
Can be a black box
Bias. Reproduces bias.
Starts from zero with a data. Changing; large language models.
TAKEAWAYS
Q1
Dummy classifier
All 10 are equally frequent
2. NEURAL NETWORKS (MLPS) AND UNCERTAINTY ESTIMATES
1 Lab1
2 Neural Networks: Multilayer Perceptrons History
Linear Model vs. Deep Learning Examples
3 Uncertainty Estimates
HISTORY
The perceptron – “the first machine which is capable of having an original idea”, according to Frank Rosenblatt
Feedback
LINEAR MODEL
A weighted sum of input features – learn the coefficients of the links
Still linear
When input goes through a non-linear model, becomes
more power
Activation Functions
Non-linear in some way. When linear input comes through this function it becomes more powerful.
Input is passed through an activation function.
A way of ignoring stuff
10 hidden units
Much sharper boundary, not as many distinctions
Different Initializations
For smaller networks, can make a difference with random
If we scale the data : Mean value for every dater, comp sta devi for each, center
all values around the mean
UNCERTAINTY ESTIMATES
For machine learning they generate a score, make a decision based on a given
output
Example: Decision_Function
Decision_Function returns floating point number for each sample
Value encode how strongly model believes a data point belongs a class
o Strength of the classification, how strong the model believes in it
Example: Predict_proba
predict_proba outputs a probability for each class
O
For binary classification, shape is (n_samples, 2)
The class with probability above .50 is the one predicted – if binary
o Multiple:
A calibrated model is one where probabilities align with accuracy – predictions with probability .70 are correct
70% of the time.
MULTICLASS AND UNCERTAINTY
decision_function and predict_proba have shape (n_samples, n_classes)
High score means class is more likely and low score means class is less likely
First entries
Largest values
Why predict rather than decision function? Most models have only
one of them. Big problem with ml > no matter what it will predict
something. Using predict problem > tool to decide if its under a
certain threshold we might not want to predict anything at all
Xx effect:
Can recover predictions from uncertainty estimates using different thresholds
This can reflect the different costs and benefits of different errors
o In some domains some errors are lot worse than others. Churn prediction:
Lab2: cancer classification of benign vs. malignant
o Maximize recall or precision:
Recall: Minimize FN
TP / TP + FN
Precision: Minimize FP
TP / TP + FP
By modifying threshold, can selectively improve precision/recall of one or the other
Relevant for social applications
TAKEAWAYS
LAB2
SCALING
Scalers; same syntax as models. Instantiate it, fit the scaler to datasaet, training data. Once a scaler is fit to training data,
examined all data, foind min and max value, then transforming the data. The scaler knows all relative evalues, then
transform it.
Scaling in Python
Scaled between 0 and 1
DIMENSIONALITY REDUCTION
PCA
Principle Component Analysis (PCA) – a popular form of dimensionlaity reduction. In genereal the idea of
reduction. Unsupervised process of trying to find the most interesting ways that data varies. Each feature
o Some features might correlate with each other.
Rotates dataset, create features that are uncorrelated
First component contains the most information, ie accounts for most of the variance
Can select a subset of the most informative principle components
o New verisons of features ordered after how much information they provide.
o Converts data, features so first geatue is most informative, ordered by informativeness
Visualizing high-dimensional datasets
o Useful for vizualising – pick a couple of the most informative features
We can then transform the data and onky look at the first or second
component
Cancer Histogram
Overlay two histograms for benign and malignant
Texture error looks quite uninformative
Mean Concave points looks much more informative
o How they vary in respect to the target
How PCA is applied to the cancer dataset
Doesn’t know what the target value is. Doesn’t care, but can see how
much the features overlap.
Classification Results
o Accuracy of only 0.23
Serves as a “random guessing” baseline.
For that many classes, random guesses – relatively good.
o But, it is a 62-way classification problem
It is learning something
Gives us a starting point.
The features are saying more stuff about the faces although
we only use 100.
Classification Results
o By using 100 principal components instead of pixel features:
o Accuracy improves from 0.23 to 0.31
NMF
Other ways of reducting dimensionlity
Non-negative Matrix Factorization
Like PCA, can extract useful features
Can be used for dimensionality reduction
Each feature must be non-negative
Get really good at image recognizion; extract right kinds of higher level features from the picture.
CLUSTERING
K-MEANS CLUSTERING
Finds cluster centers that represent specific regions of the data
o For certain number of clusters find the mean value that max similarity within clusters.
Alternates between two steps:
o Assign each data point to closest cluster center
o Compute center
TAKEAWAYS: UNSUPERVISED ML
Income Dataset
Creating dummy values
Categorical Variables
Workclass Feature
Workclass is categorical feature
Has four possible values:
o Government
o Employee
o Private Employee
o Self-Employed Self-Employed Incorporated
Create four new features
Dummy Variables
Also called One-hot-encoding, or one-out-of-N encoding
If a feature F has three values, a,b, and c
o Create three new features, Fa, Fb and Fc
o If Fi has value a, then Fai =1, Fbi =0, Fci =0
Check Values
Get dummies method – does it
automatically. Ignores numerical featues,
automatically applies dummy method to
categorical
GetDummies method
Bad idea to look at data, certain features and just getting rid of features because it might look unimportant. Up to the
ML model to figure that out – use systematic ML methods. Don’t assume ahead of time!
Select Features
Does iterative selecting – using RFE
MODEL IMPROVEMENT
CROSS VALIDATION
Cross Validation
Instead of one train-test split, multiple splits
For example with Five-fold CV, pick one fifth of data as test, and the other four fifths as training data
o Each fifth used sequentially as test data
Gives a better basis for assessing model – with one split, might be “lucky” or “unlucky” with test data
Cross_val_score performs
split of train and test data
fits model to train data for each of the splits
scores model on test data
GRID SEARCH
Alternatives:
1. Bayesian Search
2. Random Search
First split:
Train-val contains both training and validation set
Second split:
Splitting the train-val set into train and val.
Note on kwargs
best_parameters = {’C’: C, ’gamma’ : gamma}
Define best_parameters as a dict
svm = SVC(best_parameters)
Can pass a dict to a function expecting keyword arguments
The more folds Validation will be smaller. The more folds the more valid picture you have. But 3-5 would be
sufficient.
Summary - tuning
Tuning on Validation Set
Important - method
- Tuning using training data, cv on part of data
- After finding our best classifier, we retrain on all the data – both val and train!
- Then create a classification report
Classification: Accuracy
Default metric
BUSINESS IMPACT
What is the goal? What is the business impact of using the model?
Confusion Matrix
Accuracy
Accuracy
Precision
Being precise about the POSITIVE value
Divided by all the times the model says it has the disease
We can cheat, not same way as recall. High precision strategy would be to – saying less is good. Only choose the
positives in cases with highest confidence.
Precision
Recall:
True positives
ALL TIMES we say
Actually does have the diseacse
Easy ways of cheating: everyone have the disease > perfect recall. Precision might be bad.
Recall
F score
Because we can “cheat” with both recall and precision we need the F score to balance thing out
Get a value between 0 and 1. Only evaluating a model with either recall or precision is not good, as it might give a
skewed picture.
F
Digit Classification
Convert digits into unbalanced binary classification task
DummyClassifier
Confusion matrices
UNCERTAINTY IN PREDICTIONS
Classifiers compute a value that is compared against threshold – for linear classifiers, default is 0
ROC CURVE
o True positives against all potential positives
ROC Curve
Takes different thresholds and plots the two against
each other.
Indicate where we might get the best results
AUC
Confusion matrix
Recall: 37 / 37 = 1.00
- Perfect
Precision: 37 / 37 = 1.00
Precision: 43 / 46 = .93
Recall: 43 / 48 = .90
Classification report
For every class we get the metrics.
TAKEAWAYS
Metrics for Evaluation There are lots of ways to evaluate models
We should select a metric that corresponds to the goals and business impact of building the model
Can explore thresholds for classifier models
We select models by tuning hyperparameters – can do this wrt. the metric best suited to your case
Project
Readings: find a relevant research paper
Workplan: submit description in week 10 Use techniques discussed in class:
o Diverse metrics
o Thresholds
o Pipelines
o Grid Search CV/Tuning
LAB4
Fb posts labeled to
certain emotions
Value counts on emotion
column
Diversity in how many
instances – emotion
classification, quite
unbalanced
Preprocessing function,
converts text into bag of
features (words) – 1
feature for each word.
Increasing ngrams;
increasing number of
features. varies with
size of data whether it is
good to have a high
amount of ngrams.
TFIDF
Use a different scoring;
instead of how many
times a word occurs in
text, use TFIDF
Use tfidf algorithm on
the outcome vectorizer.
Create pipeline
Do gridsearch
Explore arbitrary
axpects,
Make a pipeline;
preprocssing important
for language processing,
bag of words models.
Look at characters
instead of words;
individual letters.
Default is ‘word’.
Use pipeline with
gridsearchCV. Model
choices,
Specify different options
for hyperparameters, do
gridsearch
Done 2 types.
Classification report.
5. ALGORITHM CHAINS AND PIPELINES
Lab4
2 NLP: Some Background
o The Revolution in NLP
o Language is Hard
Some NLP Basics
o Bag of Words
o Additional Topics
o Movie Reviews and Sentiment Analysis
Naive Bayes and Sentiment Classification
Logistic Regression
Lab 5
NLP: BACKGROUND
Based on GPT3
large language model
trained on missing word prediction transformer model – certain type of neural network good at learning in this
way
Further training through Reinforcement Learning for Human Feedback
Growth of ChatGPT
Took 5 days to get 1 mil users
LANGUAGE IS HARD
Descartes and AI
Could a machine imitate a human?
o No – you would always be able to tell the difference
Rene Descartes: Discourse on the Method, Part V (1637)
Language is ambiguous
Many words have multiple meanings
o Lexical Ambiguity
Phrases and sentence can have multiple meanings
o Structural Ambiguity
A single sentence can have many different meanings
Need Context to resolve ambiguities
NLP BASICS
BAG OF WORDS
Types of Data
Numerical Data
Categorical Data
Text data is different
o Content of an email
o A Headline
o Text of political speeches
Can text be treated as structured data?
Unigrams
Bigrams
ADDITIONAL TOPICS
Stopwords
Words that are “too frequent to be informative”
Built-in Stopwords List: above, elsewhere, into, well, fifteen, . . .
Could also discard words that appear too frequently
Common stopwords and less common
TF-IDF
Words that are frequent in a document tell a lot about that document
Words that appear in lots of documents are less interesting
TF-IDF
o increases as term frequency increases decreases as document frequency increases
o term frequency and inverse document frequency – frequency of word occurring acorss all documents
Sentiment Analysis
Is a text Positive or Negative?
Used for Social Media Analysis
Marketing
Impact of new product
TEXT CLASSIFICATION
Sentiment analysis Spam detection
Language identification
BAG OF WORDS
Document Classification and Bayes’ Rule
Bayes’ Rule:
Prepend prefix NOT to every word following negation until the next punctuation mark
LOGISTIC REGRESSION
Decision Threshold:
TAKEAWAYS
ChatGPT reflects a revolution in NLP and AI
Language is Hard: infinite and ambiguous
Basic NLP Techniques:
Bag of Words, nGrams
Naive Bayes and Logistic Regression: examples of generative and discriminative classifier models
LAB5
6. TECHNIQUES IN PRACTICAL ML
QA
INTRODUCTION
Motivation
● Data science is mostly taught with a focus on the hard technical skills: Statistics, algorithms and coding
● There’s also a heavy focus of modelling and things like hyperparameter tuning.
● While these things are important, they are only part of the equation.
● This often leads to a culture shock when aspiring data scientists move to industry.
● Case: Myself, studied the data science programme at CBS
○ This is the lecture I’d would have liked to have had.
● So, today, I hope to give you guys an introduction to the aspects of data science that you don’t pick up in a
Jupyter notebook.
● I also hope to show you why a business background can be super valuable in this space.
● What if you don’t aspire to become a data scientist?
Well, if you are interested in how value is actually created from AI, I hope that you will find this useful too.
Practical definitions
So, there’s a lot of different definitions of data science. This field is constantly evolving, and so is the perception of it.
Notice in Arthur Samuel’s definition, there’s no mention of data.
Artificial intelligence
o Artificial narrow intelligence (ANI)
e.g., smart speaker, self-driving car, web search, AI in farming and factories
It might be important to think of how AI and ML fits into the ecosystem of analytics.
Where do you think ML and AI fits in?
I hope that you’ll agree that we can’t really start thinking about what will happen before we know what
happened and why it happened
Really, the important takeaway here is that ML and AI is based on descriptive and diagnostic analytics.
o -> We can’t hope to model anything if we don’t understand it
This is an important perspective.
When speaking with a client, it can be very helpful to understand which kind of insights they are after.
If diagnostic analytics is the solve, the is no reason to start thinking about predictive analytics.
MARKET ADOPTION
In terms of some research
ML-AI ADOPTION
Global AI adoption globally is 2.5x higher today than in 2017 but may have reached a plateau.
AI MATURITY
● AI maturity measures the degree to which organizations have mastered AI-related capabilities in the right
combination to achieve high performance for customers, shareholders and employees.
● What does all this mean for you?
It means that the critical component is not being or having the cleverest data scientists (as everyone thought a
few years ago)
● Instead, we need ambidexterity: Multiple people, with different skills.
People that can understand the business and the technology.
Broad categories
About ai maturity:
AI Innovators: Have mature AI strategies,
but struggle to operationalize
SCOPING
Because of high uncertainty, careful scoping is critical - Don’t jump to modelling.
● Specifications
○ The WHY: How will this help?
○ What are the user stories?
○ What is the timeframe?
○ Evaluation criteria
■ Qualitative (Business) – end user’s idea,
■ Quantitative (Business, technical)
● Considerations
○ Is this a duct tape solve?
○ What level of analytics is required?
○ Are we reinventing the wheel?
○ The cold start problem (Logging the right data?)
○ Build vs. buy
● A small change to current ways-of-working may be better than a duct tape AI solution
● A user story:
○ “As a …. (persona)
○ I want to … (function)
○ in order to … (reason / user wish)”
● Qualitative: We would like to be able to detect credit card fraud cases more easily
● Quantitative: We would like to be able to detect at least 85% of card card fraud cases
TOOLS
Let’s look into one of our projects
Deployment ● Docker
● DVC
● FastAPI
● Streamlit
● Gunicorn
● GitHub Actions
● TerraForm
Testing ● PyTest
Environment ● PyEnv
● Poetry
● Pre-commit
● Jupyter notebooks
● Git + GitHub
Documentation ● MkDocs
● MkDocs-materials
MLAI ● Scikit-learn
● PyTorch (or Keras)
● LightGBM
● SpaCy
WRAPPING UP
General
MLAI is a tool, not a solution
Market adoption
MLAI maturity is multivariate and varies significantly.
The major barriers to success with MLAI are often not technical know-how.
Data science profiles
Business knowledge is a sought-after skill in data science
MLAI is not just technical know-how.
Data science projects
MLAI projects are inherently experimental
MLAI project have an additional risk dimension: Uncertainty of signals in data
Because of high uncertainty, careful scoping is critical - Don’t jump to modelling
When scoping, try to ask questions, quantify, approximate and discuss
Our clever models have zero value until we publish them
7. IMAGE PROCESSING
A lawsuit against using chat-gpt in US because it can make legal documents without paying high fee
Who do you sue? If it suggests something but it is wrong
Word
Ngrams
MORE ON LANGUAGE
Linguistic Annotation
Can easily be download and work with in the pipelines.
Work with different languages.
Convert the text with spacy, convert into list with annotated
tokens
Output of Spacy
Named Entities
doesn’t have a perfect
Text Similarity
Word embeddings
SENTIMENT ANALYSIS
INFERENCE
QUESTION ANSWERING
Specific ways to setup a QA task.
MAIN POINTS
INPUT-OUTPUT REPRESENTATION
Can represent a single sentence or pair of sentences
“Sentence” can be any span of text
Add [CLS] symbol to beginning of input sequence
[SEP] symbol at end of first sentence
If input is a pair of sentences, they are separated by [SEP]
Use WordPiece embeddings – variant subword tokenizing. Then we have emebddings – representations that
give numerical values, that can say something about the meaning of the words.
PRE-TRAINING
Input is pair of sentences
Two Tasks:
o Masked LM
Predict masked tokens, based on surrounding tokens
o Next Sentence Prediction
Predict whether second sentence naturally follows first one
Missing words: predicting what words occurs in specific context
Supervised? You have the actual answer
Unsupervised: you have unlimited amount of data. Any text you find.
FINE-TUNING
Much less resource intensive
Since the pretrained models know so much about language in advance we can use it
Different for different tasks
Sentence-pair tasks (Question Answering, Inference, . . . )
o Input is sentence pairs, with same representation as training
Single-sentence tasks: (Sentiment analysis, emotion classification, . . . )
o Input is single sentence, ending with [SEP] symbol
LAB7
DistilBERT
Smaller version of BERT
Almost matches BERT performance
Produces sentence embedding – vector of size 768
Logistic Regression
Classifies each sentence as Positive or Negative Features are 768 real numbers – sentence embedding
Dataset
SST2
Movie Review texts, classified as Positive or Negative
Train as usual
Scoring the model
Best Scores
8. AI IN PRODUCTION – GUEST SPEAKER
Topic TBD
Readings SLP, Ch. 5, 6
Activities to be Lab8, readings for next lecture
done before next
class
https://github.com/Proteusiq/hadithi/tree/main/talks
Confirmation bias
Ask questions before opening the dataset to avoid bias
How was the data generated?
How was it collected?
What contains in the data
EDA
How to avoid domain experts’ bias?
Highlight if there is differences
Challenge their assumptions
Ethics
o IN Europe moving to very controlled AI space
o Not in China or US
o Transparency – personal information?
Some say ‘keep them’ – sometimes it does not make sense to remove
protected attributes
E.g. predicting houses; The price is impacted by the number of non-danes in
the area. However, ‘race’ as an attribute can impose bias and should maybe
be removed.
Solution: Highlight that you have kept the attribute to show that
you’re mindful about it
Talk to your clients
o Iteratively improve and adjust the model making it more complex. But start very
simple, then talk to your clients and adjust it.
Code to abstraction
o Reading any type of file
o Example from lab: Making classes with gridsearch that takes in an algorithm
In setting where you need to be iterative and make experiments
Have it running in functions or classes so we only have to change it one place
o Always abstract the things we do
Do fit and predict proper
Then check if
Ideally build a system by having this abstraction
Import reader
Import Path
Dependency injection
Yaml > read in file and run
Schedule an online meeting before the end of next week to get answers to our project
Agenda:
1. The Revolution in AI/NLP: Large Language Models
a. GPT-3
2. BERT – Bidirectional Transformer Encoders
a. Training Bidirectional Encoders
b. Next Sentence Prediction
3. Transformers
4. 4 GPT-3 and the OpenAI API
GPT-3
Growth of ChatGPT
Took only 5 days to reach 1mil – historical impact in terms of SW. reflect
how powerful true AI is and is going to be.
Does it become inferior by time within the next 10 years– along with
quantum computing is developing?
- DAN: nobody knows. Chat-gpt is already inferior compared with gpt-
4. SW will continue to evolve. He talks about how ai can match
human intelligence.
Turing test? Used in the past to evaluate AI intelligence. Blown past the
turing test.
New goal is to reach AGI – Artificial general intelligence. Not specific task,
but all kinds of task humans solve with their intelligence.
Still many things humans are better to do than AI with our human
intelligence, that is the new tasks.
GPT-4
More effective, hopefully safer
They observed: “scaling up language models greatly improves task-agnostic, few-shot performance,
sometimes even becoming competitive with prior state-of-the-art fine-tuning approaches.”
o Task agnostic = the model itself not built to do QA, sentiment analysis, summarization etc. specific
taks tha t nlp systems often are trained to do as humans can. Think of it in a task agnostic way – the
models become competitive with state of the art with fine tuning approach.
o Pre-train – the fine tune for a specific tasks for example sentiment analysis (NOT task agnostic)
o As these models become really big, we don’t have to make them task agnostic?
GPT-3:
transformer model – like other models like BERT
175 billion parameters
10x more than any previous large language model
A SHIFT IN NLP
Their point
Shift from
o learning task-specific representations and designing task-specific architectures to
o using task-agnostic pre-training and task-agnostic architectures.
To using task-agnostic, few-shot performance competitive with prior state-of-the-art fine-tuning approaches
o We can do as well for all tasks without tuing the model for a specific task
Previous Approach:
o Pretrain model (GPT, BERT)
o fine-tune on a large dataset of examples to adapt a task agnostic model to perform a desired task
Recent work (Radford et. al 2019) suggested this final step may not be necessary.
o Very exiting.
IN-CONTEXT LEARNING
In paper not doing any fine tuning. Lot of training data and updating the weiths pf the network example. Instead “few
shot training”
one- and few-shot performance is often much higher than zero-shot performance
o 0-shot: sentiment analysis; classify sentences “0-shot” – can it do it out of the box?
o One/few shot: Also give it one example before on a positive and negative sentence. Then you give it
the task of classifying. Then test the model on a few shot learning. Shows that this type is competitive
This suggests that language models are “meta-learners” where slow outer-loop gradient descent based learning
is combined with fast in-context learning
o Standard way of training a neural network model (gradient descent) – slow process, each time
classyfing, gradually fine tuning the weighst. Very time consuming.
o Compared with fast-in context learning
. . . one notable pattern is that the gap between zero-, one-, and few-shot performance often grows with model
capacity, perhaps suggesting that larger models are more proficient meta-learners.
o As you give the model few numbers of examples it improves. The improvement is bigger for bigger
models. Large models more proficient meta learners: better at general learning trained with few-shot
in context learning.
o Movement towards AGI? Able to say “do this thing for me” – might thing that larger models would
care less about the in context learning because they know more stuff. But getting a generalized
learning abnility – if this continues to improve with larger models it shows an interesting movement.
o Suggestion that in context learning have mad gpt-3 revolutionary. common
GPT-3 Training
Common Crawl – common collection produced by crawling the internet.
Known high-quality reference corpora: Combined with these
o WebText
o Books Corpora
o English Wikipedia
Tasks
Look at different tasks and compare with specific task approaches .
LAMBADA: predict last word of paragraph
StoryCloze: select correct ending for five-sentence long stories
HellaSwag: pick best ending to story or set of instructions
Results:
compare with state of the art results
Sometimes gpt is performing better than state of the art, even with 0-shot
approach.
Results: Open-Domain QA
BERT:
Subword vocabulary consisting of 30,000 tokens generated using the Word-Piece algorithm
o Don’t deal with whole words all the time. Break up words in useful ways, common simple words are
left whole (like and, or etc), while other words have meaningful sub-parts (un-happy) – important part
of these models.
Hidden layers of size of 768
model with over 100M parameters
Fixed input size of 512 subword tokens
CLOZE TASK
All models trained on a part of this word predicion
Predict missing words in input
o Please turn _____ homework in.
CONTEXTUAL EMBEDDINGS
Output of the model is contextual embeddings for each token in the input
Can be used as a contextual representation of the meaning of the input token for any task
requiring the meaning of word.
o Big improvement
o Words change meaning in context.
Contextual embeddings used for tasks like measuring semantic similarity of two words in
context
Useful in linguistic tasks that require models of word meaning
Most common use: as embeddings of word or even entire sentences that are the inputs to
classifiers in the fine-tuning process for downstream applications.
TRANSFORMERS
Key characteristics
Autoregressive.
Introduction to API
o https://platform.openai.com/docs/ api-reference/introduction?lang=python
o
Examples:
o Q&A
https: //platform.openai.com/examples/default-qa
suggestion: few shot learning. Give it a prompt. General description.
o Classification:
https://platform.openai.com/examples/ default-classification
no training needed
potential for building useful apps with the API. Need to pay, but inexpensive.
QA
Topic Area
We are focusing on two main topic areas in this class
Relevant Literature
Find at least one recent research paper that deals with issues similar to your project. Summarize the main points of the
paper and compare it with your project.
Data Description
How many instances? What are the features? Give a few lines of the data and/or mention the range of values for
important featues. This could be a dataset you have found, or that you have constructed by combining different datasets.
ML task
What is the target value? Is this classification, regression or some other type of problem? It is important that you use
techniques and concepts covered in this course, such as Pipelines, Grid Search, and different Metrics and Thresholds.
Relevance
Why is it interesting?
The oral exam will include a discussion of the project, and can also cover other important topics from the course
readings and lectures.
Python resources
GitHub Page Links to an external site.with notebooks and code for the book "Introduction to Machine
Learning with Python".
Pandas TutorialsLinks to an external site.
Tips and Tricks for Jupyter NotebookLinks to an external site.
Learn PythonLinks to an external site.
Install Anaconda
If you don't have a Python installation, you should install Anaconda Links to an external site..
(You can use mamba, poetry, pyenv, virtualenv and a lot more. If in doubt, please use Anaconda)
Make sure to select Python 3.6 or above
When the download is complete, find the .exe or .dmg file, and install
Find Jupyter Notebook. In Windows look for Recently added software; on Mac look in the Applications folder
Datasets
Top 23 Best Public Datasets for Practicing Machine LearningLinks to an external site.
GoEmotionsLinks to an external site.
Reddit News
Here is a general introduction to Pushshift Links to an external site., which you can use to get Reddit data.
Ekstra Bladet
Danish Parliament (Folketinget)
AirBnB dataLinks to an external site.
COVID-19 data from OurWorldInData
EXAM
FINAL PROJECT
Use techniques and concepts covered in this course, such as Pipelines, Grid Search, and different Metrics and
Thresholds.
o Improving and experimenting the setup of your model
o Ideally done as it reflects the perception of the relevance of our model; what would be most relevant
Pretrained models
o Include pretrained models like BERT
Relevant research paper(s)
o Use as comparison or inspiration
Show how your dataset is interesting!
ORAL EXAM