CENG3300 Lecture 1

Lecture 1
Data Science for Molecular Engineering

BIEN/CENG3300
Self-introduction
• What is your name/how do you prefer we call you?
• Where are you from?
• What is your major? Which year?
• Why do you take this course?
• Name one problem you would like to solve with data science
In-class rules
• Active participation is highly encouraged; no pressure though.
• Interrupt and speak up if you have questions (no need to raise
hands)
What is data science?
Input Data Program Output

Data
Simulation
Input Data
Output Data Model
Data Science
Example – chemical engineering
You add reactants to an ideal reactor, you know the

mechanism and kinetics, then you can predict the conversion
at any given time; this is not a data science approach.
You run many reactions to get the conversions corresponding

to different conditions, then you build a model that’s fit to the
data to predict what is the relationship between the
conditions and conversion (what are the factors that affect
conversion); this is a data science approach.
Example - biology
You know all the gene expression pathways, and use this
knowledge to predict a phenotype based on a gene sequence;
this is not a data science approach.
You have a lot of gene sequence and phenotype data, and you
develop a model based on the data to predict which gene(s)
control a certain phenotype (and how?); this is a data science
approach.
Quiz
• Describe how you would tackle the following problem using a
data science approach
Design an absorbant material to separate ethylene and

propylene
So why do we need data science in
molecular engineering?
Because we often do not have complete knowledge of the
underlying “program”!!
Data Science in Molecular Engineering
How will you address this problem?

?
Some more benchmarking results
10 expert chemists
80 reactions with varying popularity
1. Is this a fair comparison?

2. Does this mean human
chemists can be replaced?
Chem. Sci., 2019, 10, 370-377

Other chemistry examples
ASKCOS (MIT)
RXN4Chemistry (IBM)
……
Other examples in engineering
Other examples in life sciences
Antibody
discovery
What this course is about…
Intended learning outcomes:
• 1. Identify problems that can be formulated as a data science problem
• 2. Process different types of data to be ready for model training
• 3. Understand the principles of supervised learning methods
• 4. Perform model training, validation and testing
• 5. Clearly interpret model predictions and present model results
• 6. Know the application of data science methods in molecular science
related problems
Assessment
• Homework 20%
• Final Exam 40%
• Course project presentation 25%
• Literature critique 15%
Homework rules
• Collaboration is okay, but need to acknowledge
• Please submit timely. Late submissions must be requested
24hrs in advance with justification, otherwise points will be
deducted.
Course project
• Choose a data science problem (preferably in molecular
engineering)
• Define the problem
• Collect data
• Choose/develop models
• Train and evaluate models
• Analyze and present the results/findings
Literature Critique
• Numerous publications are out every day on machine learning
applied to chemistry/biology related problems
• Not all of the applications of machine learning methods are
appropriate – even when they are published!
• Pick a paper and analyze what might be the limitations/could be
improved (An example will be provided in future lectures)
Course Format
• Concepts followed by interactive coding sessions
• Bring laptops on Friday lectures
• Tutorials
• Based on progress of the course, will arrange as needed
• TA will be available to answer questions about homeworks/practices
Weekly schedule
Part 0. Introduction
Week 1 Real-world applications of data science in physical, chemical and life sciences
Part 1. Math Methods for Data Science

Week 2 Basic concepts in linear algebra, calculus and numerical optimization
Week 3 Review of statistics and probability
Part 2. Data Processing

Week 4 Introduction to Python and Pandas for Processing Tabulated Data
Week 5 Cheminformatics toolboxes – the example of RDKit
Week 6 Results visualization – molecules, chemical reactions, trees, graphs
Part 4. Supervised Learning

Week 7 Introduction to common supervised learning problems in molecular engineering
Week 8 Linear regression - molecular property predictions and feature importance
Week 9 Molecular feature engineering and nonlinear regression
Week 10 Introduction to deep learning and neural networks
Part 5. Miscellaneous Topics

Week 11 Introduction to unsupervised learning and reinforcement learning
Week 12 Guest lecture on real world applications of data science
Week 13 Course project presentation, Literature critique presentation

CENG3300 Lecture 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CENG3300 Lecture 1

Uploaded by

Copyright:

Available Formats

Lecture 1

Data Science for Molecular Engineering

Input Data Program Output

You add reactants to an ideal reactor, you know the

You run many reactions to get the conversions corresponding

Design an absorbant material to separate ethylene and

How will you address this problem?

1. Is this a fair comparison?

Chem. Sci., 2019, 10, 370-377

Part 1. Math Methods for Data Science

Part 2. Data Processing

Part 4. Supervised Learning

Part 5. Miscellaneous Topics

You might also like