You are on page 1of 21

Lecture 1

Data Science for Molecular Engineering


BIEN/CENG3300
Self-introduction
• What is your name/how do you prefer we call you?
• Where are you from?
• What is your major? Which year?
• Why do you take this course?

• Name one problem you would like to solve with data science
In-class rules
• Active participation is highly encouraged; no pressure though.
• Interrupt and speak up if you have questions (no need to raise
hands)
What is data science?

Input Data Program Output


Data
Simulation

Input Data
Output Data Model

Data Science
Example – chemical engineering

You add reactants to an ideal reactor, you know the


mechanism and kinetics, then you can predict the conversion
at any given time; this is not a data science approach.

You run many reactions to get the conversions corresponding


to different conditions, then you build a model that’s fit to the
data to predict what is the relationship between the
conditions and conversion (what are the factors that affect
conversion); this is a data science approach.
Example - biology

You know all the gene expression pathways, and use this
knowledge to predict a phenotype based on a gene sequence;
this is not a data science approach.

You have a lot of gene sequence and phenotype data, and you
develop a model based on the data to predict which gene(s)
control a certain phenotype (and how?); this is a data science
approach.
Quiz
• Describe how you would tackle the following problem using a
data science approach

Design an absorbant material to separate ethylene and


propylene
So why do we need data science in
molecular engineering?
Because we often do not have complete knowledge of the
underlying “program”!!
Data Science in Molecular Engineering

How will you address this problem?


?
Some more benchmarking results
10 expert chemists
80 reactions with varying popularity

1. Is this a fair comparison?


2. Does this mean human
chemists can be replaced?

Chem. Sci., 2019, 10, 370-377


Other chemistry examples

ASKCOS (MIT)
RXN4Chemistry (IBM)
……
Other examples in engineering
Other examples in life sciences

Antibody
discovery
What this course is about…
Intended learning outcomes:
• 1. Identify problems that can be formulated as a data science problem
• 2. Process different types of data to be ready for model training
• 3. Understand the principles of supervised learning methods
• 4. Perform model training, validation and testing
• 5. Clearly interpret model predictions and present model results
• 6. Know the application of data science methods in molecular science
related problems
Assessment
• Homework 20%
• Final Exam 40%
• Course project presentation 25%
• Literature critique 15%
Homework rules
• Collaboration is okay, but need to acknowledge
• Please submit timely. Late submissions must be requested
24hrs in advance with justification, otherwise points will be
deducted.
Course project
• Choose a data science problem (preferably in molecular
engineering)
• Define the problem
• Collect data
• Choose/develop models
• Train and evaluate models
• Analyze and present the results/findings
Literature Critique
• Numerous publications are out every day on machine learning
applied to chemistry/biology related problems
• Not all of the applications of machine learning methods are
appropriate – even when they are published!
• Pick a paper and analyze what might be the limitations/could be
improved (An example will be provided in future lectures)
Course Format
• Concepts followed by interactive coding sessions
• Bring laptops on Friday lectures

• Tutorials
• Based on progress of the course, will arrange as needed
• TA will be available to answer questions about homeworks/practices
Weekly schedule
Part 0. Introduction
Week 1 Real-world applications of data science in physical, chemical and life sciences

Part 1. Math Methods for Data Science


Week 2 Basic concepts in linear algebra, calculus and numerical optimization
Week 3 Review of statistics and probability

Part 2. Data Processing


Week 4 Introduction to Python and Pandas for Processing Tabulated Data
Week 5 Cheminformatics toolboxes – the example of RDKit
Week 6 Results visualization – molecules, chemical reactions, trees, graphs

Part 4. Supervised Learning


Week 7 Introduction to common supervised learning problems in molecular engineering
Week 8 Linear regression - molecular property predictions and feature importance
Week 9 Molecular feature engineering and nonlinear regression
Week 10 Introduction to deep learning and neural networks

Part 5. Miscellaneous Topics


Week 11 Introduction to unsupervised learning and reinforcement learning
Week 12 Guest lecture on real world applications of data science
Week 13 Course project presentation, Literature critique presentation

You might also like