You are on page 1of 5

DS-GA 1004: Big Data

Brian McFee
Spring, 2020

E-mail: brian.mcfee@nyu.edu Web: bmcfee.github.io


Office hours: T 15:00–17:00 Class Hours: M 18:45–20:25
Office: 60 5th Ave., room 621 Class room: GCASL, room C95

Lab room: GCASL C95 Lab Hours: Th 18:45–19:35

Pre- and Co-requisites


Pre-requisites: DS-GA 1001 or equivalent OR prior coursework in databases
Co-requisites: DS-GA 1004 section 002 (Big Data Lab)

What is Big Data all about?


This class will introduce you to modern tools for working with large datasets. Specifically, you will
learn:
Relational databases
Distributed storage
Distributed computation
Applications and algorithms

Course Structure
Your grade for the class will be determined by the following break-down:
10% class participation
35% lab programming assignments
35% exams
20% final project
Additionally, the following policies will be in place:
• Most weeks in the calendar (listed below) have assigned reading. Complete this reading before
coming to class.
• Class participation will be measured by activity through an interactive polling system. Partici-
pation will be counted, correctness will not.

1
DS-GA 1004, S19

• Lab assignments will be due at 23:55 (Eastern time) on the day noted in the schedule below. Due dates
are indi-
• Lab assignments must be done by each student individually, unless otherwise noted. cated like
• For lab assignments, you will have two (2) slip days to use however you see fit over the entire this.
semester. For example, you can turn in one assignment two days late, or two assignments one
day late (each), without penalty. After that, there will be a 20% reduction per additional day
late. No assignment will be accepted more than 5 days after the due date. Slip days cannot be
used past the last day of the term (2020–05–11).
• There will be two (2) in-class midterm exams, and the better of your two scores will count for
both. There will be no make-up exams. No technology (phones, calculators, etc) is permitted
during exams, unless express permission is granted by the disability center.
• Exceptions to the above may be granted in case of emergencies, but any requests must be cleared
by the instructor prior to the original due date.

Course Policies
Who should I contact?
Miscelleneous policy questions: (e.g., when is the midterm? how do assignment due dates work?)
please re-read this course information document and the course website. If you still have ques-
tions, please post them to the discussion forum on the course website.
Help with assignments or course topics: post on the discussion forum, or ask the instructor or sec-
tion leaders during office hours.
Anything sensitive or confidential: (e.g., health issues, emergencies): email brian.mcfee@nyu.edu.
Anything else: I’m happy to talk to any student during office hours about various topics (course
advising, questions about research, and so on).

Class environment
It is my job to make this course inclusive and equitable for all students. Please do your part by seeking
to promote the success of others, and by treating each other in ways that respect and celebrate the
diversity of talent that is drawn to this field. Here are a few specific things that you should know
about my policies on creating an inclusive and equitable class environment (both in the classroom
and on the course website/forum):
• Preparation: Students come to this class from a wide range of backgrounds, and greatly varying
previous exposure to mathematics, programming, and data science more generally. I want to
assure students who may feel out of place here that you are indeed prepared to succeed in this
class! If you feel that there are gaps in your knowledge, please speak with the course staff and
we will help you find additional materials as needed.
• Classroom environment: For some reason, it is common in technical or programming-oriented
classes that some students ask “questions” that are not really questions so much as opportunities
to demonstrate knowledge of jargon, or facts that are beyond the scope of the topic at hand. This
can have discouraging effects on other students who may not be familiar with those terms, and
worry that this indicates that they are less prepared to do well in the class. (Note: this is rarely
the case: knowing terms outside the scope of the course is not a good predictor of success.) If
you find yourself wanting to make such a question or comment in lecture, I encourage you to
consider whether office hours would be a better venue for exploring that topic. The course staff
are more than happy to discuss tangentially related topics in office hours, when they would not
distract from lecture or alienate other students.

2/5
DS-GA 1004, S19

• Accessibility: If you have any accessibility requirements, please present a letter from the Moses
Center to me at your earliest convenience, so that I can ensure that materials and staff comply
with your needs. I am always willing to do what it takes to support you, but I ask that you have
your exam scheduling requests submitted no later than 1 week prior to the exam so that we have
sufficient time to make any necessary arrangements.

• Names and pronouns: If you have a name and/or pronoun that doesn’t match the class ros-
ter delivered from the registrar, please let me know and I will ensure that you are addressed
correctly in our class. You are always welcome to use your preferred form of address on all
class assignments and exams; just be sure to include your NYU netID number to make sure that
records link properly.

• Class expenses: If obtaining any material for use in our class presents a financial hardship for
you, please let me know and I will do my best to arrange for loaner materials.
• Feedback I will solicit (anonymous) feedback from students throughout the course, but if you
have pressing or specific issues, please do not hesitate to let me know if any aspect of our course
or class community can be improved.

Academic Integrity and Honesty


All students are expected to do their own work on problem sets. Students may discuss problem
sets with each other, as well as with the course staff. Any discussion of problem set questions with
others must be noted on a student’s final write-up. Each student must turn in their own write-up of
the problem set solutions. Excessive collaboration (i.e., beyond discussing problem set questions) can
result in honor code violations. Questions regarding acceptable collaboration should be directed to the
class instructor prior to the collaboration. It is a violation of the honor code to copy or derive problem
set or exam question solutions from other students (or anyone at all), textbooks, previous instances of
this course, or other courses covering the same topics. Copying solutions from other students, or from
students who previously took a similar course, is also clearly a violation of the honor code. Finally,
a good point to keep in mind is that you must be able to explain and/or re-derive anything that you
submit.
Please also refer to the general NYU academic integrity statement.

Technology infrastructure
During this class, students are encouraged (but not required) to use Github Classroom as a part
of course studies, and thus, will be required to agree to the Terms of Use (TOU) associated with
Marketplace Simulations. Github Classroom requires users to be over the age of 18. Personally
identifiable information is required to create an account. This information includes name, email
address, and IP address. This information will identify users to Github and companies with whom it
shares data. Github Classroom is not an NYU service. Therefore, the user should not use their NYU
login and password. Login and password information should be unique.
You should read carefully the Github Terms of Use1 and Privacy Policy2 regarding the impact on
your privacy rights and intellectual property rights. If you have any questions or objections regarding
those Terms of Use or the impact on the class, you are encouraged to speak to the instructor prior to
enrollment.

1 https://help.github.com/en/github/site-policy/github-terms-of-service
2 https://help.github.com/en/github/site-policy/github-privacy-statement

3/5
DS-GA 1004, S19

Calendar
This calendar is tentative, and content may be rearranged as needed. All assigned readings are avail-
able through the class website.

Week 01, 01/27: Course introduction

• Lab 0: Environment setup and git basics. 2020/02/05

Week 02, 02/03: Relational databases

• Reading: Hector Garcia-Molina, Jeffrey D Ullman, and Jennifer Widom. Database systems: the
complete book. Pearson Education, 2009. URL http://infolab.stanford.edu/~ullman/dscb.
html, chapter 2.
• Lab 1: working with RDBMS and SQL. 2020/02/19

Week 03, 02/10: Map-reduce


• Reading: Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large
clusters. Communications of the ACM, 51(1):107–113, 2008.

• Reading: David DeWitt and Michael Stonebraker. MapReduce: A major step backwards. The
Database Column, 1:23, 2008.

Week 04, 02/17: President’s day — no lecture this week; lab meets as usual.

• Lab 2: Hadoop and map reduce. 2020/03/05

Week 05, 02/24: The Hadoop distributed file system (HDFS)

• Reading: Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The Hadoop
distributed file system. In Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium
on, pages 1–10. IEEE, 2010.

Week 06, 03/02: Introduction to Spark


• Reading: Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica.
Spark: Cluster computing with working sets. HotCloud, 10(10-10):95, 2010.
• Reading: Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley,
Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. Spark SQL: Relational
data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on
Management of Data, pages 1383–1394. ACM, 2015.
• Lab 3: Spark and Spark-SQL. 2020/03/26

Week 07, 03/09: Exam 1


• Covers weeks 1–6.

Spring break, 03/16: no lecture or lab this week

4/5
DS-GA 1004, S19

Week 09, 03/23: Column-oriented storage


• Reading: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar,
Matt Tolton, and Theo Vassilakis. Dremel: interactive analysis of web-scale datasets. Proceedings
of the VLDB Endowment, 3(1-2):330–339, 2010.

• Lab 4: storage comparison 2020/04/09

Week 10, 03/30: Text and similarity search


• Reading: Jure Leskovec, Anand Rajaraman, and Jeffrey D Ullman. Mining of massive datasets.
Cambridge university press, 2nd edition, 2014. URL http://www.mmds.org/, chapter 3.

Week 11, 04/06: Responsible data science

• Guest lecture from Prof. Stoyanovich


• Lab: Final project lab 2020/05/13

Week 12, 04/13: Recommender systems


• Reading: Jure Leskovec, Anand Rajaraman, and Jeffrey D Ullman. Mining of massive datasets.
Cambridge university press, 2nd edition, 2014. URL http://www.mmds.org/, chapter 9.
• Reading: Charles Duhigg. How companies learn your secrets. The New York Times, 16:2012, 2012.

Week 13, 04/20: Graph processing


• Reading: Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank
citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.

• Reading: Jure Leskovec, Anand Rajaraman, and Jeffrey D Ullman. Mining of massive datasets.
Cambridge university press, 2nd edition, 2014. URL http://www.mmds.org/, chapter 5.
• Reading: GraphFrames user guide: https://graphframes.github.io/graphframes/docs/_site/
user-guide.html.

Week 14, 04/27: Exam 2


• Covers weeks 9–12.

Week 15, 05/04: Graphical processing units (GPUs)


• Reading: TBD.

Week 16, 05/11: Advanced topics and wrap-up

5/5

You might also like