You are on page 1of 13

12/3/23, 8:38 PM CS246 | Home

(http://snap.stanford.edu/) (http://stanford.edu/)

CS246: Mining Massive


Data Sets
The course will offered next in Winter 2023.

Logistics
Lectures: are on Tuesday/Thursday 3:00-4:20 PM PDT in person in the NVIDIA Auditorium.
Lecture Videos: are available on Canvas (https://canvas.stanford.edu/courses/173895) for all
the enrolled Stanford students. You can also check our past Coursera MOOC
(https://www.youtube.com/channel/UC_Oao2FYkLAUlUVkBfze4jg/videos).
Public resources: The lecture slides and assignments will be posted online as the course
progresses. We are happy for anyone to use these resources, but we cannot grade the work of
any students who are not officially enrolled in the class.
Contact: Students should ask all course-related questions on Ed
(https://edstem.org/us/courses/38149/discussion/), where you will also find all the
announcements. For external enquiries, personal matters, or in emergencies, you can email us
at cs246-spr2223-staff@lists.stanford.edu.
Academic accommodations: If you need an academic accommodation based on a disability,
you should initiate the request with the Office of Accessible Education (OAE)
(https://oae.stanford.edu/students/getting-started/requesting-new-or-additional-
accommodations). The OAE will evaluate the request, recommend accommodations, and
prepare a letter for faculty. Students should contact the OAE as soon as possible since timely
notice is needed to coordinate accommodations.

https://web.stanford.edu/class/cs246/ 1/13
12/3/23, 8:38 PM CS246 | Home

Instructor

Jure Leskovec
(https://profiles.stanford.edu/jure-
leskovec)

Co-Instructor

Mina Ghashami
(https://mina-
ghashami.github.io/)

Course Coordinator

Lata Nair
()

https://web.stanford.edu/class/cs246/ 2/13
12/3/23, 8:38 PM CS246 | Home

Course Assistants

Zhuoyi Huang
Aman Bansal (Head TA) (https://www.linkedin.com/in/zhuoyi-
(http://bansalaman.com) huang/)

Luca Pistor
Paridhi Maheshwari (https://www.linkedin.com/in/luca-
(https://paridhimaheshwari2708.github.io/) pistor/)

https://web.stanford.edu/class/cs246/ 3/13
12/3/23, 8:38 PM CS246 | Home

Nirali Parekh Edmund Chen


(https://www.linkedin.com/in/nirali25parekh)
(https://edhschen.info/)

Jinang Shah Ana Selvaraj


(https://in.linkedin.com/in/jinang- (https://www.linkedin.com/in/ana-
shah-72229317b) selvaraj-915a1a161/)

Content
What is this course about? [Info Handout
(handouts/CS246_Info_Handout.pdf)]
The course will discuss data mining and machine learning algorithms for analyzing very large
amounts of data. The emphasis will be on MapReduce and Spark (http://spark.apache.org) as
tools for creating parallel algorithms that can process very large amounts of data.
Topics include: Frequent itemsets and Association rules, Near Neighbor Search in High
Dimensional Data, Locality Sensitive Hashing (LSH), Dimensionality reduction, Recommendation
Systems, Clustering, Link Analysis, Large-scale Supervised Machine Learning, Data streams,
Mining the Web for Structured Data, Web Advertising.

 Previous offerings
The previous version of the course is CS345A: Data Mining
(http://www.stanford.edu/class/cs345a/) which also included a course project. CS345A has now
been split into two courses, CS246 and CS341.

You can access class notes and slides of previous versions of the course here:

https://web.stanford.edu/class/cs246/ 4/13
12/3/23, 8:38 PM CS246 | Home

CS246 Websites: CS246: Winter 2022 (http://snap.stanford.edu/class/cs246-2022/) / CS246:


Spring 2021 (http://snap.stanford.edu/class/cs246-2021/) / CS246: Winter 2020
(http://snap.stanford.edu/class/cs246-2020/) / CS246: Winter 2019
(http://snap.stanford.edu/class/cs246-2019) / CS246: Winter 2018
(http://snap.stanford.edu/class/cs246-2018) / CS246: Winter 2017
(http://snap.stanford.edu/class/cs246-2017) / CS246: Winter 2016
(http://snap.stanford.edu/class/cs246-2016) / CS246: Winter 2015
(http://snap.stanford.edu/class/cs246-2015) / CS246: Winter 2014
(http://snap.stanford.edu/class/cs246-2014) / CS246: Winter 2013
(http://snap.stanford.edu/class/cs246-2013) / CS246: Winter 2012
(http://snap.stanford.edu/class/cs246-2012) / CS246: Winter 2011
(http://snap.stanford.edu/class/cs246-2011)

CS345a Website: CS345a: Winter 2010 (http://snap.stanford.edu/class/cs345a-2010)

 Prerequisites
Students are expected to have the following background:

Knowledge of basic computer science principles and skills, at a level sufficient to write a
reasonably non-trivial computer program (e.g., CS107 or CS145 or equivalent are
recommended).
Good knowledge of Java and Python will be extremely helpful since most assignments will
require the use of Spark.
Familiarity with basic probability theory (CS109 or Stat116 or equivalent is sufficient but not
necessary).
Familiarity with writing rigorous proofs (at a minimum, at the level of CS 103).
Familiarity with basic linear algebra (e.g., any of Math 51, Math 103, Math 113, CS 205, or EE
263 would be much more than necessary).
Familiarity with algorithmic analysis (e.g., CS 161 would be much more than necessary).

The recitation sessions in the first weeks of the class will give an overview of the expected
background.

 Reference Text
The following text is useful, but not required. It can be downloaded for free, or purchased from
Cambridge University Press.
Leskovec-Rajaraman-Ullman: Mining of Massive Dataset (http://www.mmds.org/)

 Schedule
Lecture slides will be posted here shortly before each lecture. If you wish to view slides further in
advance, refer to 2022 course offering's slides (http://snap.stanford.edu/class/cs246-2022/),
which are mostly similar.

https://web.stanford.edu/class/cs246/ 5/13
12/3/23, 8:38 PM CS246 | Home

This schedule is subject to change. All deadlines are at 11:59pm PST.

Date: Tue Apr 4

Description: Introduction; MapReduce and Spark


[slides (slides/01-intro.pdf)]

Ch1: Data Mining (http://infolab.stanford.edu/~ullman/mmds/ch1n.pdf)


Ch2: Large-Scale File Systems and Map-Reduce
(http://infolab.stanford.edu/~ullman/mmds/ch2n.pdf)

Events:

Deadlines:

Date: Thu Apr 6

Description: Frequent Itemsets Mining


[slides (slides/02-assocrules.pdf)]

Ch6: Frequent itemsets (http://infolab.stanford.edu/~ullman/mmds/ch6.pdf)

Events: Colab 0 (https://colab.research.google.com/drive/11cUeITEsvv-


c4YBU0ZmYqzrUlgG0XIeq?usp=sharing), Colab 1
(https://colab.research.google.com/drive/1NUy6HpAixV0JER8kJDeWDNlI0hodv2wX?
usp=sharing), Homework 1 (homework/hw1-bundle.zip) out

Deadlines:

Date: Fri Apr 7

Description: Recitation: Spark tutorial


[Colab (https://colab.research.google.com/drive/11cUeITEsvv-c4YBU0ZmYqzrUlgG0XIeq?
usp=sharing)]

Events:

Deadlines:

https://web.stanford.edu/class/cs246/ 6/13
12/3/23, 8:38 PM CS246 | Home

Date: Mon Apr 10

Description: Recitation: Probability and Proof Techniques


[handout (handouts/CS246_Proof_Probability.pdf)]

Events:

Deadlines:

Date: Tue Apr 11

Description: Locality-Sensitive Hashing I


[slides (slides/03-lsh.pdf)]

Ch3: Finding Similar Items (http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) (Sect. 3.1-3.4)

Events:

Deadlines:

Date: Thu Apr 13

Description: Locality-Sensitive Hashing II


[slides (slides/04-lsh_theory.pdf)]

Ch3: Finding Similar Items (http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf) (Sect. 3.5-3.8)

Events: Colab 2 (https://colab.research.google.com/drive/196A8aVVXgLmKp44faOHs2Sfau-


sGb8I4?usp=sharing)
out

Deadlines: Colab 0,
Colab 1
due

Date: Fri Apr 14

Description: Recitation: Linear Algebra


[handout (handouts/CS246_LinAlg_review.pdf)]

Events:

Deadlines:

https://web.stanford.edu/class/cs246/ 7/13
12/3/23, 8:38 PM CS246 | Home

Date: Tue Apr 18

Description: Clustering
[slides (slides/05-clustering.pdf)]

Ch7: Clustering (http://infolab.stanford.edu/~ullman/mmds/ch7.pdf) (Sect. 7.1-7.4)

Events:

Deadlines:

Date: Thu Apr 20

Description: Dimensionality Reduction


[slides (slides/06-dim_red.pdf)]

Ch11: Dimensionality Reduction (http://infolab.stanford.edu/~ullman/mmds/ch11.pdf) (Sect.


11.4)

Events: Colab 3
(https://colab.research.google.com/drive/1dQVCGe4YR_RsERZ7VZHJIhI1sdd_9adv?
usp=sharing), Homework 2 (homework/hw2-bundle.zip) out

Deadlines: Colab 2,
Homework 1 due

Date: Tue Apr 25

Description: Recommender Systems I


[slides (slides/07-recsys1.pdf)]

Ch9: Recommendation systems (http://infolab.stanford.edu/~ullman/mmds/ch9.pdf)

Events:

Deadlines:

https://web.stanford.edu/class/cs246/ 8/13
12/3/23, 8:38 PM CS246 | Home

Date: Thu Apr 27

Description: Recommender Systems II


[slides (slides/08-recsys2.pdf)]

Ch9: Recommendation systems (http://infolab.stanford.edu/~ullman/mmds/ch9.pdf)

Events: Colab 4
(https://colab.research.google.com/drive/16qUBvtFCEH5WoxJUgJWueKUc2tooEOrf?
usp=sharing)
out

Deadlines: Colab 3
due

Date: Tue May 2

Description: PageRank
[slides (slides/09-pagerank.pdf)]

Ch5: Link Analysis (http://infolab.stanford.edu/~ullman/mmds/ch5.pdf) (Sect. 5.1-5.3, 5.5)

Events:

Deadlines:

Date: Thu May 4

Description: Extensions of PageRank to Recommendations and Spam


[slides (slides/10-spam.pdf)]

Ch5: Link Analysis (http://infolab.stanford.edu/~ullman/mmds/ch5.pdf) (Sect. 5.4)


Ch10: Analysis of Social Networks (http://infolab.stanford.edu/~ullman/mmds/ch10n.pdf)
(Sect. 10.1-10.2, 10.6)

Events: Colab 5 (https://colab.research.google.com/drive/1uujHUhjQAn-


yqEf8X2_8yuoSKmwhiL1J?usp=sharing), Homework 3 (homework/hw3-bundle.zip) out

Deadlines: Colab 4,
Homework 2 due

https://web.stanford.edu/class/cs246/ 9/13
12/3/23, 8:38 PM CS246 | Home

Date: Tue May 9

Description: Community Detection in Graphs


[slides (slides/11-graphs1.pdf)]

Ch10: Analysis of Social Networks (http://infolab.stanford.edu/~ullman/mmds/ch10n.pdf)


(Sect. 10.3-10.5)

Events:

Deadlines:

Date: Thu May 11

Description: Graph Representation Learning


[slides (slides/12-graphs2.pdf)]

Ch10: Analysis of Social Networks (http://infolab.stanford.edu/~ullman/mmds/ch10n.pdf)


(Sect. 10.7-10.8)

Events: Colab 6 (https://colab.research.google.com/drive/1Jidj7VXEyn-


1zOHaWrV9Pj0Iuv_fhatK?usp=sharing)
out

Deadlines: Colab 5
due

Date: Tue May 16

Description: Graph Neural Networks


[slides (slides/13-GNNs.pdf)]

Events:

Deadlines:

https://web.stanford.edu/class/cs246/ 10/13
12/3/23, 8:38 PM CS246 | Home

Date: Thu May 18

Description: Learning Embeddings


[slides (slides/14-emb.pdf)]

Events: Colab 7 (https://colab.research.google.com/drive/1pVbjVRZN94Fs2jziUqa-


CTkrUlim0T_S?usp=sharing), Homework 4 (homework/hw4-bundle.zip) out

Deadlines: Colab 6,
Homework 3
due

Date: Tue May 23

Description: Decision Trees


[slides (slides/15-dt.pdf)]

Ch12: Large-Scale Machine Learning (http://infolab.stanford.edu/~ullman/mmds/ch12.pdf)

Events:

Deadlines:

Date: Thu May 25

Description: Mining Data Streams I & II


[slides (slides/16-streams.pdf)]

Ch4: Mining data streams (http://infolab.stanford.edu/~ullman/mmds/ch4.pdf)

Events: Colab 8
(https://colab.research.google.com/drive/1pXoQtiJPwJ8NVkkQcrMWjOEM4r4oIWeB?
usp=sharing)
out

Deadlines: Colab 7
due

https://web.stanford.edu/class/cs246/ 11/13
12/3/23, 8:38 PM CS246 | Home

Date: Tue May 30

Description: Matrix Sketching


[slides (slides/17-matrix_sketching.pdf)]

Events:

Deadlines:

Date: Thu Jun 1

Description: Computational Advertising


[slides (slides/18-advertising.pdf)]

Ch8: Advertising on the Web (http://infolab.stanford.edu/~ullman/mmds/ch8.pdf)

Events: Colab 9 (https://colab.research.google.com/drive/19MFfwBysfxYNPsSgUhnHZ-


3v4IYYqpQB?usp=sharing)
out

Deadlines: Colab 8,
Homework 4
due

Date: Mon Jun 5

Description: Exam

Events:

Deadlines:

Date: Tue Jun 6

Description: Optimizing Submodular Functions


[slides (slides/19-submodular-conclusion.pdf)]

Turning Down the Noise in the Blogosphere (http://www.cs.cmu.edu/~kbe/tdn_kdd09.pdf) by


El-Arini, Veda, Shahaf, Guestrin. KDD 2009.

Events:

Deadlines:

https://web.stanford.edu/class/cs246/ 12/13
12/3/23, 8:38 PM CS246 | Home

Date: Thu Jun 8

Description: No class

Events:

Deadlines: Colab 9
due

https://web.stanford.edu/class/cs246/ 13/13

You might also like