You are on page 1of 5

Notes

Matthew Slight

July 2022

This is a collection of notes that I believe to be a minimal subset of essential mathematics


required for the Machine Learning enthusiast.

Contents
0.1 Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1 Study Resources 2
1.1 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Cheat Sheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Video Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Examples 3

3 Information Theory 4

4 Questions 5

0.1 Topics

A) Linear Algebra
i. First subitem
ii. Second subitem
iii. Third subitem
B) Calculus
C) Probability
i. Bayes Theorem and Independance Add Bayes
Theorem
D) Information Theory
and inde-
pendance
equations

1
Todo list
Add Bayes Theorem and independance equations . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 5, manifold hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Chapter 4.2 Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Standard Error, p60 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Chapter 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.4 Statistical Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Look at Chapter 2 Fourier and Chapter 3 Sparsity . . . . . . . . . . . . . . . . . . . . . . 3
watch this . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1 Study Resources
When I embarked on my Bachelors at Queen Mary College 20 years ago the internet was very
much in its infancy. The dotcom boom and bust of 2000 had just kicked in. Wikipedia was
an upstart. Google had only just established a reliable method for searching the web. Most
academic discussion was still happening on Usenet and Internet Relay Chat (IRC). This time
around, whilst preparing for my Masters things are vastly different. Many text books are available
from authors entirely free online, either as PDFs, websites in their own right. The idea of this
book is to be a spring-board and not replace the work of others which is vastly superior to
anything I have been able to produce.
Here then are the works which I so heavily relied upon, including both books and video series,
which will provide many more hours worth of detailed explanations than the much shorter and
concise summaries I provide in this book. At each point along the way I link back to specific
resources. The web changes at a pace much quicker than printed media, so I expect resources to
come and go - constantly evolving and I encourage the reader to seek out new and better resources
as needed. If one teacher isn’t able to get the penny to drop for you, then try a different route,
sometimes the more visual approach taken by the likes of 3Blue1Brown can achieve more in
10minutes than the best lecturer could in a lifetime. After all, there is only so much you can do
on a whiteboard

1.1 Books

Items marked with a ∗ are more Math heavy.


• Deep Learning with Python, Second Edition by Francios Cholet -
https:// livebook.manning.com/ book/ deep-learning-with-python-second-edition/ Chapter 5,
manifold
• ∗ Deep Learning Book, by Ian D Goodfellow - https:// www.deeplearningbook.org [1]
hypothesis
• TO READ https://udlbook.github.io/udlbook/
• Information Theory, Inference and Learning Algorithms – David J.C. MacKay
http:// www.inference.org.uk/ itprnn/ book.pdf Chapter 4.2
Data Com-
• Practical Statistics for Data Scientists, by Peter Bruce, Andrew Bruce -
pression
https:// www.oreilly.com/ library/ view/ practical-statistics-for/ 9781491952955/
Standard
Error, p60

2
• ∗ An Introduction to Statistical Learning – Trevor Hastie et al.
https:// hastie.su.domains/ ISLR2/ ISLRv2 website.pdf Chapter 4.2
• ∗∗ The Elements of Statistical Learning (Data Mining, Inference, and Prediction) – Trevor
Hastie et al. https:// hastie.su.domains/ Papers/ ESLII.pdf 2.4 Statisti-
cal Decision
• Deep Learning for NLP and Speech Recognition
Theory
https:// link.springer.com/ content/ pdf/ bfm% 3A978-3-030-14596-5% 2F1.pdf
• Data Driven Science and Engineering, Prof. Steve Brunton Look at
http:// databookuw.com/ databook.pdf Chapter 2
Fourier and
Chapter 3
1.2 Cheat Sheets Sparsity

• Statistics – http:// web.mit.edu/ ∼csvoss/ Public/ usabo/ stats handout.pdf


• LATEX – http:// tug.ctan.org/ info/ undergradmath/ undergradmath.pdf
• Probability – https:// medium.com/ data-comet/ probability-rules-cheat-sheet-e24b92a9017f
• Stanford CS221 Artificial Intelligence – https:// github.com/ afshinea/ stanford-cs-221-artificial-intelligence/
blob/ master/ en/ super-cheatsheet-artificial-intelligence.pdf
• Stanford CS229 Machine Learning – https:// github.com/ afshinea/ stanford-cs-229-machine-learning/
blob/ master/ en/ super-cheatsheet-machine-learning.pdf
• Stanford CS230 Deep Learning – https:// github.com/ afshinea/ stanford-cs-230-deep-learning/
blob/ master/ en/ super-cheatsheet-deep-learning.pdf
• Stanford CM106 Probability and Statistics for Engineers – https:// github.com/ shervinea/
stanford-cme-106-probability-and-statistics

1.3 Video Series

SVD and PCA by Prof. Steve Brunton -


https:// www.youtube.com/ playlist? list=PLMrJAkhIeNNRpsRhXTMt8uJdIGz9-X 1-
Linear Algebra by 3Blue1Brown -
https:// www.youtube.com/ playlist? list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE ab
Jacobian - https:// www.khanacademy.org/ math/ multivariable-calculus/ multivariable-derivatives/
jacobian/ v/ jacobian-prerequisite-knowledge
Image compression and vastness of space watch this

2 Examples
Here is an example of the summation:-
N
! N
"
x2 = log2 (x)
j=o k=1

3
Here is an example matrix:-
% (
# $ x y z w
1 2 3 & xi yi zi wi )
&
−→ ' )
a b c i j k l*
mij ni oi pi

This is an example function:-


f (x) := x2 : x ∈ R2 (2.1)

Taking derivatives:-
∂f ∂f ∂f
=
∂xi ∂xi ∂xi

3 Information Theory
Recall from probability theory, an ensemble X = (x, A, PX ), where the outcome x is the value
of a random variable from one of the possible values A = { a0 , a1 , ..., ai , ..., aI } with the corre-
sponding probabilities PX = { p0 , p1 , ..., pi , ..., pI }.

Definition 3.1 (Shannon information content). The amount of information (Shannon informa-
tion content) h(x) conveyed by an outcome x.
1
h(x) = log2 (3.1)
P (x)

Definition 3.2 (Entropy). Entropy of an ensemble X written as H(X) is the sum of the Shannon
information contents h(x) weighted by the corresponding probability P (x).
!
H(X) = P (x)h(x) (3.2)
x∈A
! 1
= P (x) log2
P (x)
x∈A

Or more simply:
!
H(X) = − P (x) log2 P (x) (3.3)
x∈A

Lemma 3.4. In compression theory, the limit of compression will always be H(X)
Lossy compression only removes what is considered to be unnecessary for the purposes of the
communication.

4
4 Questions
1. What other sorts of languages exist? Natural Language (Human to Human), Mathematics,
Programming languages (Human to Machine), Protocols (Machine to Machine), Animal
Language, Signalling
2. How else is information encoded? E.g. a photograph is a Width × Height × Colour Depth
snapshot, Audio, Video
3. Historic Methods of information capture has been analog → now digital → naive encoding
(ASCII, Morse etc.) → compressed / decomposed encoding (PCA / SVD / Fourier /
Word2Vec)
4. To what extent does recorded Analog or Digital data / information storage map to physical
world (whether biological like eyes and ears or written language, or artificial construction
i.e. art/photos or machine like sound recording or digital storage),
5. To what extent is the brain an information encoder ⇐⇒ decoder
6. Thought experiment, could we invent (construct) a natural language that was unambiguous
& could convey everything possible with English (so that all ideas, senses, meanings etc.
could be retained and conveyed)
(a) What could be the minimum vocabulary size of this language
7. To what extent is unsupervised machine learning a learned, representation encoding of real
world (“physical”) phenomena? Or more generally what does Machine Learning tell us
about information representation and encoding in the real world.
(a) Manifold hypothesis (dead end????)
http:// www.mit.edu/ ∼mitter/ publications/ 121 Testing Manifold.pdf
(b) Information bottleneck theory
i. Naftali Tishby lecture (what can deep learning tell us about the Brain) –
https:// www.youtube.com/ watch? v=EQTtBRM0sIs
ii. Original paper by Tishby et Al. https:// arxiv.org/ pdf/ physics/ 0004057.pdf
(c) Take the two classical compression types ‘lossless’ and ‘lossy’. Even though some
of the information is lost in lossy encoding the original sense of the message is still
conveyed. What theory talks about this in terms of retaining the original essence of
the message through communication.

References
[1] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
http:// www.deeplearningbook.org.

You might also like