You are on page 1of 25

COM1011

Fundamentals of Machine Learning


Dr. Nicolas Verschueren van Rees

Lecture 5: 7/10/2019
Housekeeping (reminder)
• Lectures
• Monday 11:35 – 12:25 Laver 212
• Tuesday 13:35 – 14:25 Queens 1B
• Labs
• Wednesday 9:35-10:25 Harrison 207
• Office Hours Monday after class.
• Assessment
• 2 Courseworks 20% each 8 November, 13 December
• 1 Exam, WINTER term – 60%. - January date.
• https://vle.exeter.ac.uk/course/view.php?id=8310
Course Content
• Introduction: History of Artificial Intelligence and Machine Learning
• Data:
• the nature of data,
• how to represent data: text, sound, images, networks;
• AI and ML applications to real world cases
• Data Representation:
• feature selection,
• feature construction;
• Machine Learning Paradigms
• supervised,
• unsupervised,
• reinforcement learning;
• Error Measures for Different Machine Learning Tasks:
• classification, regression, ranking, clustering;
• Algorithms: k-nearest neighbours, linear models, naïve Bayes, k-means, neural networks;
• Theoretical Notions in Machine Learning:
• model capacity and overfitting,
• curse of dimensionality.
What is data?

Qualitative Quantitative
• Descriptive information • Quantifiable
• Difficult/impossible to • Able to encode as
encode as numerical numerical values
values • Examples:
• Examples: • Quantities, length, mass
• names, • Categories
• feelings,
• aesthetics,
• subjective
interpretation
Representing quantitative data
Signed
Integers
Numerical Unsigned
Real
numbers

Quantitative
Free text
data
Text
Categories
Not
numerical
Colour
Images
B&W
Representing quantitative data
Signed
Integers
Numerical Unsigned
Real
numbers

Quantitative
Free text
data
Text
Categories
Not
numerical
Colour
Images
B&W
OptimiseRx
Prescribing Decision support solution in primary
care. Used in +4,000 GP Practices (England and
Wales)
● How does it work?
○ In the event of a prescription (during a medical
consultation), Optimise Rx might trigger a message
○ The GP can accept/ reject the message (not the only
options).
● What data is stored?
○ The time and practice of the medical consultation
○ The decision (accept or reject) and the message id
○ feedback from GP (text).
Relational database (tables)

● Tidy data
● How many variables and what type? (Numerical, not
numerical)
● Using the rest of the tables we can add new variables:
e.g. Practice Code -> Gps, coordinates, CCG
● In reality we have a large number of variables(~30)
Types of data (for some of the
variables)
● Numerical
○ Rejects/accepts/ hits
○ Date?
○ number of Gps
○ coordinates?
○ index of multiple deprivation
● Not Numerical
○ Type, Intent, focus of the message
○ Rejection reason (text)
○ Clinical commission group
What is the goal of this project?
● Provide a description of the data (Identify
variables affecting the number of rejections
● Suggest ways to improve the system

Exploratory Data Analysis


One possibility…
● Define a measurement of the rejections.
Rejection rate as the fraction of rejected
messages.
● group the data by Practice/Message and
aggregate over time
● Study the temporal behaviour
● Study the feedback provided by GPs.
Practice Level
(Aggregated over time)
Message Level
Message Level
Message Level: Rejection Rate
Representing data
● So far, we have represented numerical data (e.g.
histograms, scatter plots, violin plots) and see
how their distributions change for different not
numerical variables (and nothing obvious was
observed)
● We can use unsupervised learning algorithms to
find patterns in the data
Cluster of time series (example)
● What is clustering? Can I always do clustering?
● Example. Consider the rejection rate over time
(time series). Every line corresponds to a different
client. Are time series numerical?
Cluster of time series (example)
2 Clusters can be distinguished
Clusters Summary
● Split the data into similar groups with similar
characteristics.
○ Algorithms: k-means, partitioning around mediods,
hierarchical clustering, dbscan.
● Cons
○ Often is not clear the number of clusters (what is the
“optimal” number of clusters?)
● Is it always possible to do clustering?
Principal Component Analysis
● unsupervised learning algorithm , used to reduce
the dimension (number of variables).
● Idea: We are looking for the minimum number of
new variables (linear combination of the original)
which capture most of the variability.
● “Tidy data” (data in a matrix) is considered
● Widely used. Examples: Face recognition,
images/video compression.
Examples:
Example 1 (artificial):
PC1 PC2
x 0.4472136 0.8944272
y 0.8944272 -0.4472136
------------------------------
PC1 PC2
1.000000e+00 3.825623e-33
Example 2(artificial)

PC1 PC2 PC3 PC4 PC1 8.457358e-01


x -1.985519e-01 8.910165e-01 1.674755e-05 4.082483e-01 PC2 1.542630e-01
y 3.570864e-01 4.536768e-01 -2.714529e-05 -8.164966e-01 PC3 1.130843e-06
z -9.127247e-01 -1.633702e-02 7.103814e-05 -4.082483e-01 PC4 1.008977e-32
w 7.785674e-05 -1.446607e-06 1.000000e+00 3.112839e-15
Summary
● Data representation
● Example of real data (Optimise Rx, FDB)
○ How to represent data (Exploratory Data Analysis)
● 2 popular unsupervised learning algorithms
○ Clustering <-recognise groups
○ (P)rincipal (C)omponent (A)nalysis<- reduce
dimension

You might also like