You are on page 1of 5

Harvard University, Statistics 117

Data Analysis in Modern Biostatistics


Syllabus, Spring 2020

Lectures: Tue/Thu, 9:00am–10:15pm, Science Ctr 705


Instructor: Giovanni Parmigiani
E-mail: gp@jimmy.harvard.edu
Course web site: https://canvas.harvard.edu/courses/50403
Office Hours: Thu 10:30-12:30pm or by appointment, Science Ctr 400k.
Section Time: See poll on canvas website
TF: Albert Wu albertwu@g.harvard.edu
TF Office Hours: Tuesdays, 3-5 PM, 7th Floor Lounge Science Center.

Overview: The course is an introduction to the application of statistical concepts in biomedical


research, through the lens of biomarker research in cancer biology and medicine. The foundation is
a comprehensive collection of gene expression and clinical characteristics for patients with cancer of
the ovaries, called CuratedOvarianData [LINK]. Table 1 provides and overview of goals and methods
covered.

Students will learn the methodology in lectures, sections, and individual homeworks. They will
then use it to carry out data analyses through individual and group projects. After projects are
completed, there will be structured opportunities for peer-to-peer discussion and critique.

Research Goal Quantitative Method


Odds ratios and relative risks, sensitivity, specificity,
ROC analysis, effect sizes, priors & posteriors, MCMC
Biomarker Evaluation
(as implementable in Rjags), value of information
analysis, decision analysis, time-to-event outcomes
High dimensional biomarker searches, multiple testing,
Biomarker Discovery
False Discovery Rates, regression to the mean
Biomarker Validation Cross-study concordance, r-values
Basic hierarchical models, MCMC for hierarchical
Biomarker Meta-analysis
models
Selected basic machine learning algorithms (e.g.
Biomarker-based Prognosis clipping, cart, top scoring pairs), cross-validation versus
cross-study validation.

Table 1: Summary of Topics and Related Methodologies.

1
Objectives: To develop a sense for practical applications of statistical thinking and tools in biomed-
ical sciences. To learn how to make and defend statistical modeling choices, and to constructively
critique the analyses of others. To develop a working knowledge of the R programming language,
as relevant for biomedical data science applications.

Prerequisites:
Required: Stat 110 and (AP Stat or 102 or 104)/111/139.
Recommended: Stat 115.

Lectures and Sections: Lectures will cover general concepts and project critiques. Sections will
cover hands-on examples and programming. Attendance is mandatory for both.

One-hour weekly sections will begin the second week of the course. Sections will be the place to
work through examples, review lecture material in a more detailed way, solve problems, and learn
how to use the statistics packages for implementing the various methods taught in the course.

Table 2 is a draft lecture-by-lecture outline of material covered in the course. Of course we will
leave some wiggle room and make variations.

Homework: The workload will consist of four homework assignments and three projects. Home-
work will help you master analysis and conding skills to carry out the projects.

Homework assignments will be made available on the course web page. You will be told when the
assignment is posted online. Homework solutions are due by 4:00pm on the due date. You are
free to discuss and work on homework problems with other students, but you should write up your
solutions independently (see the collaboration policy statement).

The official course policy is that no late homework will be accepted. In return for your timely
submission of homework, we will make every effort to return graded homework promptly. See
Table 2 for key dates.

Homework Grading. Homework assignments will be graded by a grader or TF on a scale from


1 to 5. Homeworks are graded in large part on the clarity of your presentation of the solutions,
not just their correctness. Homeworks that are generally clear and correct will earn scores of 4
or 5; those less so will earn a 3. Sloppy and/or incomplete homeworks will receive a 1 or 2. All
homeworks will count toward your course grade – we will not drop any homework grades.

Project Assessment and Grading. The course requires three projects. See Table 2 for key
dates.

Projects will be graded by the TF on the same scale as the homework. Separately, projects will also
be reviewed by a peer group, and some will be discussed in class. After projects are turned in, we
will form mini-groups (4-5 students) to discuss each other’s projects. Each group will hold a guided
discussion of project results, and nominates one project for class discussion in the next lecture. The
instructor will review nominated projects and choose two or three. No additional credit will be
given for this selection as they will be identified for controversy/discussion value as much as quality.

2
Class Date Methodological Topic Key Dates
Jan 28 L1 Course introduction
Jan 30 L2 Gene Expression Biomarkers; OCa Data
Feb 4 L3 Biomarker Modeling and Evaluation
Feb 6 L4 Decision Analysis
Feb 11 L5 Bayesian Analysis
Feb 13 L6 MCMC Homework 1 Due on Feb 13th
Feb 18 L7 Time-to-event Data
Feb 20 L8 Last minute Q&A with AW Project 1 Due
Feb 25 L9 Comparing Criteria for Biomarker Discovery
Feb 27 L10 False Discovery Rates
Mar 3 L11 Project 1 Discussion
Mar 5 L12 Empirical Bayes FDR Homework 2 Due Mar 5th
Mar 10 L13 Biomarker Validation
Mar 12 L14 Biomarker Meta-analysis Take-home midterm exam on Mar 11,12,13
SPRING BREAK
Mar 24 L15 MCMC for Hierarchical Models, Binary
Mar 26 L16 MCMC for Hierarchical Models, Continuous
Mar 31 L17 Joint versus Conditional Models
Apr 2 L18 Prediction of Future Studies Homework 3, Due Apr 4
Apr 7 L19 Top Scoring Pairs
Apr 9 L20 Clipping vs LDA vs TSP vs Logistic Project 2 Due
Apr 14 L21 Project 2 Discussion
Apr 16 Cross-Study Validation Homework 4 Due
Apr 21 Class Presentations for Final Projects
Apr 23 Class Presentations for Final Projects
Apr 28 Class Presentations for Final Projects
Final Project Due May 9th

Table 2: Tentative Weekly Outline.

The final project will be due sometimes after the end of classes. Students will make a class presen-
tation of the final project’s results. We will set aside three lecture slots at the end of the semester
for this. The final project will incorporate any feedback, and a discussion of how it was addressed.
Additional more specific instructions will be given when the final project is posted.

Exam: The course will have one take-home midterm exam during the semester. The midterm will
test methodological concepts from the first half of the course. Students will be required to take it
online in a three-hour session within a 24-hour window.

3
Grades: Course grades will be determined by the following components, with the weights shown:

Homework assignments 20% (5% each)


Midterm Exam 20%
Course Projects 20% (10% Each)
Class Participation 10%
Final Project including Presentation 30%

Class participation will be evaluated based on contributions to the peer-review process and class
discussions.

Computing: The computer package for this course is R, which runs on Windows, MacOS, and
Linux systems, and can be used natively or within the Rstudio environment. R is free and available
online through www.r-project.org. The free version of Rstudio is at
https://www.rstudio.com/products/rstudio/download/ R is straightforward to learn, but is
sufficiently powerful and versatile to be useful for many projects that you might carry out after
this course. Reference guides on R and on the specific R packages required by the projects will be
placed on the course web site. No prior knowledge of R is needed, general concepts and examples
will be provided, but a fair amount of independence is expected in figuring out details. The online
resources are very helpful.

Documentation of your analyses will be through the R markdown language within RStudio.

Course Materials There is no single textbook for the course. Materials will include general
references on computing and modeling, such as,

W. N. Venables, D. M. Smith and the R Core Team. An introduction to R. free download


G. James, D. Witten, T. Hastie, R Tibshirani. An Introduction to Statistical Learning, with
Applications in R. free download
G. Parmigiani. Modeling in Medical Decision Making. Wiley 2002.

as well as articles or book chapters specific to the individual topics. All course materials will be
free and available through links on the course website.

I will make every effort to make lectures available in real time for tablet note-taking.

Logistics: A detailed schedule, including due dates for homework and projects, is in Table 2.

All course documents, including homeworks, supplementary material, links to data and software,
will be available on the course web site on canvas. Students will receive course announcements
through canvas e-mail.

Homework and projects will be submitted through canvas.

Disability accommodations for timed exams. Students needing academic adjustments or


accommodations because of a documented disability must present their Faculty Letter from the
Accessible Education Office (AEO) and speak with the instructor by the end of the third week

4
of the term. Failure to do so may result in us being unable to respond in a timely manner. All
discussions will remain confidential.

Collaboration policy statement: University policies against plagiarism will be strictly enforced.
You are encouraged to discuss problem sets with your classmates, but each student must write up
and code solutions separately. Be sure that you have worked through each problem yourself and
that the answers you submit are the results of your own efforts. You also may not share or view
another student’s computer code, submit output from another student’s computer session, or allow
another student to view your code or output. A good rule of thumb: if a fellow student asks if you
would like to discuss a homework problem, we encourage you to say “yes”; if a fellow student asks
to see your answer to a homework problem or R code, the answer is “no.”

You might also like