APS1070 Lecture (1) Slides

APS1070
Foundations of Data Analytics and

Machine Learning
Fall 2022
Lecture 1:
• Introduction
• Course Overview
• Machine Learning Overview
• K-nearest Neighbour Classifier
Samin Aref 1
https://native-land.ca/
Traditional Land
Acknowledgement
I wish to acknowledge the Indigenous
Peoples of all the lands which we call
home including the land on which
the University of Toronto operates.
For thousands of years, it has been
the traditional land of the Huron-
Wendat, the Seneca, and the
Mississaugas of the Credit.
Today, this meeting place is still the
home to many Indigenous people
from across Turtle Island and I am
grateful to have the opportunity to
work on this land.
2
3
Instruction Team
Instructor: Prof. Samin Aref Head-TA: Ali Hadi Zadeh
TA: Haoyan (Max) Jiang TA: Chris Lucasius TA: Nisarg Patel TA: Mustafa Ammous
Get to know the instruction team on Quercus (course contacts page)
4
Communication
➢Preferred contact method for a quick response: Piazza;
1. Via Piazza Post to the “Entire Class”
2. Via Piazza Question using Post to “Instructor(s)” - select “instructors” to
include us all
➢Communication via email (APS1070 in subject line) is fine if you have

a reason for not using Piazza for that question.
5
Emails
Head-TA: Ali Hadi Zadeh a.hadizadeh@mail.utoronto.ca

TA: Chris Lucasius christopher.lucasius@mail.utoronto.ca
TA: Mustafa Ammous mustafa.ammous@mail.utoronto.ca
TA: Haoyan (Max) Jiang haoyanhy.jiang@mail.utoronto.ca
TA: Nisarg Patel npnisarg.patel@mail.utoronto.ca
Email the course head-TA (one person only) and he will escalate it to me if
needed.
➢Please prefix email subject with ‘APS1070’
6
A little bit about your instructor …
I worked for a Then, I worked
while here as a here as a research
And, I am a new I studied
visiting researcher. scientist.
professor here Industrial
X at the UofT. X
Engineering
X I washere
born here.
X
X
And
I studied
worked
Computer
here as a
Science
data
here.
scientist.
X
X
7
Networks
8
9
Balanced network
10
In a balanced network:
Enemy of an enemy = friend

Enemy of a friend = enemy
Friend of an enemy = enemy
Friend of a friend = friend
11
Balanced network
In a balanced network:

12
Meet Mike, a friend of and .
13
Here is George, another friend of .
does not really know George that well.
knows George and they hate each other!
14
Unbalanced network

15
Balanced
Unbalanced
subgraph
It is 1 edge away from balance.
16
Partitioning signed Social
Networks
social networks
-positive relationship (green edge)
-negative relationship (red edge)
Nodes: people
Edges: positive or negative ties
17
Social
New Guinean Tribes Networks
216= 65536 possible communities
Nodes: tribes
Edges: alliance and antagonism
18
Social
Monks Networks
218= 262144 possible communities
Nodes: people
Edges: positive or negative ties
19
Biology
Biological Network of a Protein
2329= 1 duotrigintillion !
Nodes: biological molecules

Edges: activation or inhibition relations
20
Political
Science
Partisan polarization
in the
US Senate over time
Node colors=party affiliation:

Republican
Democrat
Edge colors:
Collaboration
Avoidance
21
Clustering networked data
(unsupervised learning)
Community detection by
maximizing modularity
Nodes:
American college football teams
Edge between i and j:

Match between i and j
Clusters: Exact method

Heuristic method(Bayan algorithm)
(Louvain algorithm)
Leagues Accurate
Accurateretrieval
retrievalof
ofclusters=
clusters=95.4%
89.1%
22
About you…
➢ What is your area of study?

➢ Previous experience?
➢ Why did you take this course?
➢ What do you want to get out of this course?
23
Survey: Undergraduate Degree?
24
Survey: Undergraduate Studies in Toronto?
62.6% No
37.4% Yes
25
Survey: I am interested in…
➢ Becoming a Machine Learning Researcher (~22.9%)

➢ Become a Data Scientist (~76.2%)
➢ Starting a Machine Learning Startup (~25.7%)
➢ Learning ML For fun (~16.2%)
➢…
26
Survey: Programming Languages?
27
Survey: Rate Programming Abilities?
28
Why are you taking this course?
29
What kind of Projects would you like to do?
30
Is there anything else you would like me to know?
➢ I carefully read all your responses.
31
Part 1
Course Overview
32
Course Description
APS1070 is a prerequisite to the core courses in
the Emphasis in Analytics. This course covers
topics fundamental to data analytics and machine
learning, including an introduction to Python and
common packages, analysis of algorithms,
probability and statistics, matrix representations
and fundamental linear algebra operations, basic
algorithms and data structures and optimization.
The course is structured with both weekly lectures
and tutorials/Q&A sessions.
33
Primary Learning Outcomes
By the end of the course, students will be able to:
1. Describe and contrast machine learning models,
concepts and performance metrics
2. Analyze the complexity of algorithms and common
abstract data types.
3. Perform fundamental linear algebra operations, and
recognize their role in machine learning algorithms
4. Apply machine learning models and statistical
methods to datasets using Python and associated
libraries. 34
Course Components
➢ Lectures: (3 hrs per week)
➢ Tutorials/Q&A Sessions: (2 hrs per week)
➢ Four projects (submitted via Quercus and GitHub)
➢ Eight tasks/quizzes for reading assignments (submitted via Quercus)
➢ Any material covered in lectures / tutorials / projects / Piazza is fair

game for the midterm and final assessments.
35
APS1070 Course Information
➢ Course Website: http://q.utoronto.ca
➢ Access course materials on Quercus
➢ Verify using UTORid
➢ Textbook:
➢ “Mathematics for Machine Learning” by Marc P. Deisenroth et Link
al., 2020 (free)
➢ Recommended books:
➢ “The Elements of Statistical Learning”, 2nd Edition, by Trevor
Hastie et al., 2009 (free)
➢ “Introduction to Algorithms and Data Structures”, 4th Ed, by
Michael J. Dinneen, Georgy Gimel’farb, and Mark C. Wilson, 2016
(free)
Link Link
36
Computer Requirements and Online Tools
You will need access to:
➢ Jupyter Notebook, preferably through Google Colab, to be able to
complete the projects and follow along with coding in tutorials
➢ Quercus for submitting projects and reading assignment tasks/quizzes
and course announcements
37
Computer Requirements and Online Tools
You will need access to:
➢ Piazza to ask questions, communicate with the teaching team, and
participate in course discussions
➢ Top 10 endorsed answerers (across the two sections of APS1070) on
Piazza with at least 3 endorsed answers get 2 points added to their final
course grade.
➢ Questions in the general forms of “is this the correct answer?” or “what
is wrong with my code?” or “why my code does not compile?” and the
like will not receive a response.
38
Grade Breakdown
Projects/Quizzes Weight (%) Due dates
Reading assignment 5% Deadline for reading assignment one:
tasks/quizzes Sep. 21st at 21:00
Deadlines for other reading assignments: As
per course schedule
Project 1 10% Due Oct. 6th at 21:00
Midterm Assessment 14% As per course schedule
Project 2 13% Due Oct. 27th at 21:00
Project 3 14% Due Nov. 17th at 21:00
Project 4 14% Due Dec 1st at 21:00
Final Assessment* 30% TBD
Bonus for Piazza (+2 points) Dec 1st at 21:00
*The Final Assessment is mandatory and will result in a grade of FZ (failing the
course) assigned on the transcript if not completed. 39
Penalty for late Submissions
Quercus submission time will be used. Late projects and reading
assignments will incur a penalty as follows:
➢ -30% (of project maximum mark) if submitted within 72 hours past
the deadline.
➢ A mark of zero will be given if the submission is 72 hours late or
more.
40
Class Representatives
If you have any complaints or suggestions about this course, please email me directly.
Alternatively, talk to one of the class reps who will then talk to us and the teaching team.
We need 2 class reps. Volunteers can send me an email (with “APS1070 class rep” in
subject line) by 21 Sep.
Class reps are asked to keep in touch with the instruction team about any feedback they
receive from students and to attend a staff-student meeting.
This can be a great opportunity to develop your leadership skills.
41
Academic Integrity
➢ All the work you submit must be your own and no part of your
submitted work should be prepared by someone else. Plagiarism or
any other form of cheating in examinations, tests, assignments, or
projects, is subject to serious academic penalty (e.g., suspension or
expulsion from the faculty or university).
➢ A person who supplies an assignment or project to be copied will be
penalized in the same way as the one who makes the copy.
➢ Several plagiarism detection tools will be used to assist in the
evaluation of the originality of the submitted work for both text and
code. They are quite sophisticated and difficult to defeat.
42
Software
➢ Python 3
➢ NumPy, Matplotlib, Pandas and many more
➢ Google Colaboratory
➢ Jupyter notebook in the cloud
➢ no installation required
➢ requires Google Drive
All project handouts will be Jupyter notebooks
43
Why Python for Data Analysis?
➢ Very popular interpreted programming language
➢ Write scripts (use an interpreter)
➢ Large and active scientific computing and data
analysis community
➢ Powerful data science libraries – NumPy, pandas,
matplotlib, scikit-learn, dask
➢ Open-source, active community
➢ Encourages logical and clear code
➢ Invented by Guido Van Rossum
44
Course Philosophy
➢ Top-down approach
➢ Learn by doing
➢ Explain the motivations first
➢ Mathematical details second
➢ Focus on implementation skills
➢ Connect concepts using the theme of End-to-End Machine Learning
My goal is to have everyone leave the course with a strong
understanding of the mathematical and programming fundamentals
necessary for future courses.
45
Tentative Schedule (Weeks 1 – 3)
46
Tentative Schedule (Weeks 4-6)
47
Midterm
➢ Date and time : October 21st 9:00 – 13:00 (post on Piazza for conflicts)
➢ Location: EX 300 (exam centre)
➢ Bring a type-3 calculator (info on type-3 calculators)
➢ Consists of multiple choice, short answer, math and programming

questions as well as analytical and reasoning questions
➢ Cover all material before the midterm
➢ There is no make-up midterm assessment
➢ More details will be provided 1 – 2 weeks before the midterm
48
49
50
Slide Attribution
The slides used for APS1070 contain materials from various sources.
Special thanks to the following people:
• Jason Riordon
• Sinisa Colic
• Ali Hadi Zadeh
• Mark C. Wilson (Lecture 2)
51
Part 2
Machine Learning Overview
52
Why are we here?
➢ Machine Learning skills is in demand
➢ Neural Networks are evolving and leading to new performance limits
➢ Deep Neural Networks are at the cutting-edge of applied computing
53
Rising Tide of AI Capacity
➢ Jobs
requiring
creativity
seems to be a
safe career
choice…
➢ Programming
and AI design
seem safe
Source: Hans-Moravecs-rising-tide-of-AI-capacity 54
Why Machine Learning?
Q: How can we solve a programming problem?
➢ Evaluate all conditions and write a set of rules
that efficiently address those conditions
➢ ex. robot to navigate maze, turn towards opening
Q: How could we write a set of rules to

determine if a goat is in the image?
Requires systems that can learn from examples…
55
Examples of Applications
➢ Finance and banking
➢ Production
➢ Health
➢ Economics
➢ Data -> Knowledge -> Insight
➢ Marketing
➢ Intelligent assistants
56
➢ Problems for which it is difficult to formulate rules that
cover all the conditions that we are expected to see, or that
require a lot of fine-turning.
➢ Complex problems where no good solutions exist, state-of-
the-art Machine Learning techniques may be able to
succeed.
➢ Fluctuating environments: a Machine Learning system can
adapt to new data.
➢ Obtaining insights from large amount of data or complex
problems.
57
➢ A machine learning algorithm then takes “training data” and
produces a model to generate the correct output
➢ If done correctly the program will generalize to cases not
observed…more on this later
➢ Instead of writing programs by hand the focus shifts to collecting
quality examples that highlight the correct output
58
What is Machine Learning?
59
What is Machine Learning?
60
What is Data Science?
➢ Data Science
➢ Multidisciplinary
➢ Digital revolution
➢ Data-driven discovery
➢ Includes:
—Data Mining
—Machine Learning
—Big Data
—Databases
—…
http://www.martinhilbert.net/
Source: Brendan Tierney, 2012
61
ML, AL, and Data Science over years
62
Types of Machine Learning Systems
It is useful to classify machine learning systems into broad
categories based on the following criteria:
➢supervised, unsupervised, semi-supervised, and
reinforcement learning
➢classification versus regression
➢online versus batch learning
➢instance-based versus model-based learning
➢parametric or nonparametric
63
Supervised/Unsupervised Learning
➢ Machine Learning systems can be classified according
to the amount and type of supervision they get during
training.
➢ Supervised
➢ k-Nearest Neighbours, Linear Regression, Logistic Regression,
Decision Trees, Neural Networks, and many more
➢ Unsupervised
➢ K-Means, Principal Component Analysis
➢ Semi-supervised
➢ Reinforcement Learning
64
Instance-Based/Model-Based Learning
➢ Instance-Based: system learns the examples by heart,
then generalizes to new cases by using a
similarity/distance measure to compare them to the
learned examples.
➢ more details in week 3, an example in today’s lecture
➢ Model-Based: build a model of these examples and

then use that model to make predictions.
➢ more details in weeks 9 to 11
65
Challenges of Machine Learning
➢ Insufficient Data
➢ Quality Data
➢ Representative Data
➢ Irrelevant Features
➢ Overfitting the Training Data
➢ Underfitting the Training Data
➢ Testing and Validation
➢ Hyperparameter Tuning and Model Selection
➢ Data Mismatch
➢ Fairness, Societal Concerns 66
Part 3
k-nearest neighbours classifier
67
Nearest neighbour classifier
• Output is a class (here, blue
or yellow)
• Instance-based learning, or
lazy learning: computation
only happens once called
• Flexible approach – no
assumptions on data
distribution
What class is this?
68
k-Nearest Neighbour Classification
➢ Works on multidimensional data
➢ For visualization purposes we will use a 2-dimensional example
➢ + represents unknown test samples that we want to classify
+
+
69
K- nearest neighbour classifier
K = 11
K=4
K=1
K=1
K=4
K = 11 70
How to compute distance?
K = 11
K=4
Euclidean distance
K=1
x:(0,0)
y:(1,-5)
K=1
K=4
K = 11 71
Simple Algorithm
➢ We can make the assumption that similar samples
will be located close together
➢ Simple Algorithm
Calculate distance Obtain the nearest Determine labels

neighbour
72
Algorithm
+
➢ For a single nearest neighbour
73
How do we measure distance?
Dimension of x or y
74
Decision Boundary
➢ Can generate arbitrary

test points on the plane
and apply kNN
➢ The boundary between
regions of input space
assigned to different
categories.
75
Voronoi Diagrams (k=1)
Euclidian distance
Source: Wikipedia
Spherical Voronoi – Source: Jason Davies
76
Selection of k
➢ Q: What happens if we let k be very small?
➢ Q: What happens if we let k be very large?
77
Tradeoffs in choosing k?
➢ Small k
➢ Good at capturing fine-grained patterns
➢ May overfit, i.e. be sensitive to random noise
Excellent for training data, not that good for new data, too complex
➢ Large k
➢ Makes stable predictions by averaging over lots of
examples
➢ May underfit, i.e. fail to capture important regularities
Not that good for training data, not good for new data, too simple
78
What is the best k?
➢ Select the k based on the best performance on the
validation set, report the results on the test set
➢ Generally, the performance on the validation set

will be better than on the test set
➢ Q: How do you think it will perform on the training

set?
79
Normalization
➢ Nearest Neighbours can be sensitive to the ranges
of different features
➢ Often, the units are arbitrary
➢ Simple fix: normalize each dimension to be zero
mean and unit variance (i.e., compute the mean µj
and standard deviation 𝜎𝑗 , and take,
➢ Caution: depending on the problem, the scale might

be important!
80
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
81
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html K = 11
K=4
K=1
K=1
K=4
K = 11
82
Important definitions
➢ Task : Flower classification
➢ Target (label, Class)
➢ Features
Iris versicolor Iris virginica

Iris setosa
83
Important definitions
➢ Task : Flower classification
➢ Target (label, Class) : Setosa, Versicolor, Virginica.
➢ Features: Petal len, Petal wid, Sepal len, Sepal wid.
➢ Model
➢ Prediction
➢ Data point (sample)
Prediction:
➢ Dataset Versicolor
Features
target
84
Next Time
➢ Reading assignment 1 Due - Sep. 21 at 21:00
➢ Complete/review a Kaggle course on Python and submit a quiz on Quercus by the deadline
➢ Week 2 Lab: Tutorial 1

➢ Basic data science
➢ Week 2 Lecture – Analysis of Algorithms

➢ Algorithms and Big O Notation
➢ Sorting
➢ Hashing
➢ Project 1 Due - Oct. 6th at 21:00
85
KNN Code Example (Google Colab)
86

APS1070 Lecture (1) Slides

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

APS1070 Lecture (1) Slides

Uploaded by

Copyright:

Available Formats

APS1070

Foundations of Data Analytics and

Instructor: Prof. Samin Aref Head-TA: Ali Hadi Zadeh

➢Communication via email (APS1070 in subject line) is fine if you have

Head-TA: Ali Hadi Zadeh a.hadizadeh@mail.utoronto.ca

➢Please prefix email subject with ‘APS1070’

Enemy of an enemy = friend

Enemy of an enemy = friend

does not really know George that well.

knows George and they hate each other!

Enemy of an enemy = friend

216= 65536 possible communities

218= 262144 possible communities

Nodes: biological molecules

Node colors=party affiliation:

Edge between i and j:

Clusters: Exact method

➢ What is your area of study?

➢ Becoming a Machine Learning Researcher (~22.9%)

➢ I carefully read all your responses.

➢ Any material covered in lectures / tutorials / projects / Piazza is fair

This can be a great opportunity to develop your leadership skills.

All project handouts will be Jupyter notebooks

➢ Consists of multiple choice, short answer, math and programming

Machine Learning Overview

Q: How could we write a set of rules to

Requires systems that can learn from examples…

➢ Model-Based: build a model of these examples and

k-nearest neighbours classifier

Calculate distance Obtain the nearest Determine labels

➢ Can generate arbitrary

➢ Q: What happens if we let k be very large?

➢ Generally, the performance on the validation set

➢ Q: How do you think it will perform on the training

➢ Caution: depending on the problem, the scale might

Iris versicolor Iris virginica

➢ Week 2 Lab: Tutorial 1

➢ Week 2 Lecture – Analysis of Algorithms

➢ Project 1 Due - Oct. 6th at 21:00

You might also like