Appendix 4: Abdul Wahab Ziaullah, Sanjay Chawla, Fadwa El Mellouhi Automated Materials Science Workflows

Appendix 4
Abdul Wahab Ziaullah, Sanjay Chawla, Fadwa El Mellouhi

Automated Materials Science Workflows
1 Introduction
The field of material science targeting development of enhanced materials is
quite enormous, often requiring manual exhaustive search of materials with op-
timum desired properties, via Density Functional Theory (DFT) calculations or
physical experiments. Our aim is to develop automated computational materials
discovery workflows by utilizing capabilities of machine learning. Machine learn-
ing enables principled way of exploring an exhaustive search space by gathering
data via sampling, updating the knowledge model and using the knowledge
model to better target gathering subsequent data until sufficient information
has been obtained. The final knowledge model developed by such approach is
then utilized for extracting optimal parameters such as materials composition
for maximum stability etc. Here, the knowledge model is our approximation of
actual underlying model, which is not known in advance and/or could be very
expensive to obtain. We therefore, build and update the knowledge model by
sequentially taking samples via physical experiments, DFT calculations or from
various dataset. This sequential updating of the knowledge model and looking
for next best experiment or DFT computation is guided by some optimization
criteria, in our case its Bayesian optimization.
An example for such case would be tuning the composition for perovskite
by modelling a knowledge function F (x), whose domain x represents possible
range of enthalpy of mixing while the function outputs (prediction), possible
configuration of the perovskite molecule to achieve the desired intrinsic stability.
As such knowledge models are formulated empirically, due to the nature of
machine learning, some optimization is utilized to achieve model with desired
prediction accuracy. The optimization simultaneously enables obtaining better
knowledge models as more data is gathered as well as enables determining the
global optima of the knowledge model.
2 Background
We begin by exploring the commonly utilized machine learning algorithms for
our optimum materials discovery objective such as utilization of knowledge mod-
els a.k.a surrogate functions such as Gaussian Process (GP), Random Forrest
1
Encouraging
Search Space
Domain Neural Networks
Knowledge
Restricting Ensemble Methods
Search Space Random Forest
Bayes Neural Network
Surrogate Stuke et al 2021
Modelling Random Forrest
RBF
Gaussian Process Kernel
Gaussian
ϵ-greedy
ML Methods/
Parameters Matern
Particle Swarm t-SNE
Optimization
Empirical Function
Domain
Bayesian Knowledge
Optimization Optimization Analytical Function
Strategies
Grid Gonzales et. al 2015

Search K-Parallel Evaluations
Krigging Believer
Random Search PCA
Upper Confidence Bound
Acquisition Function Mutual Information
Expected Improvement
Figure 1: Selected Methods and Parameters for Benchmarking using Test Func-
tions
(RF), and running optimization algorithms such as Bayesian optimization, to

improve the prediction capability of the surrogate functions by obtaining infor-
mative data by conducting experiments or survey. In the field of computational
materials science, obtaining data could be expensive or time consuming, such
as running DFT calculations, hence, optimization enables to obtain few yet
informative data. We presented the usecase of utilizing machine learning for
obtaining maximum stability of the perovskite structure by obtaining key data
by running very targeted DFT computations [3]. Moreover, the objective of the
research is also to explore better methods that can be target ted for improving
time for better computations. A tree of potential approaches is presented below:
The highlighted branch of the tree is the selected method or parameters

that we plan to utilize for our benchmarking to determine which method or
parameter is better suited to our use-case. Our optimization workflow is built
using Rocketsled library [1] which runs under the framework of Fireworks li-
2
brary [2] enabling it to be executed in parallel computation platforms such as
high performance computing (HPC). The parallel capability comes from Fire-
works internal usage of MongoDB database, which keeps track of the data and
various computations in various clusters and seamlessly integrating them into a
single executable workflow.
3 Methodology
To ensure consistent analysis of our research objective, we begin by performing
benchmark experiments on the existing methods and parameters presented in
Figure 1. This would also enable us to compare new methodologies/algorithms
that we develop along the duration of the research project. To this aim we de-
velop Bayesian optimization workflow and use Gaussian Process and Random
Forrest as surrogate functions for knowledge modeling. Bayesian optimization
uses acquisition function for determining the next experiment or function eval-
uation to be conducted. We establish optimization objective for finding the
minima of well known analytical test functions presented in Table 1. As one
of the criteria for single-objective material science experiments is to determine
composition of the material to provide an optimum characteristics such as op-
timum band-gap in perovskites, simultaneously learning known test functions
and comparing their minima will help us evaluate the performance of the se-
lected method. The input domain of all test functions were selected as (-4,4)
as the functions are well defined in that range and possess at-least single min-
ima. For each function we run multiple evaluations and report the computation
time and root mean square errors (RMSE). We compare the computation time
of Gaussian Process with Random Forrest with two acquisition function for
BO namely Expected Imporvement (EI) and Lower Confidence Bound (LCB).
We also compare the two acquisition functions side by side for their suggestive
properties.
Table 1: Analytical Test Functions for Benchmarking
Test Function Formula Minima

Rosenbrock 2−
100(x x21 )2
+ (1 − x1 )2 (1,1)
Egg Crate x21 + x22 + 25 sin2 (x1 ) + sin2 (x2 ) (0,0)
2
5.1x21 5x1 1

Branin x2 − 4π 2
+ π
−6 + 10 1 − 8π
cos(x1 ) + 10 (π, 2.275)
1 6
Camelback ! 2x21 − 1.05x41 + x
6 1
+ x1 x2 + x22 (0,0)
r
1 2
−0.2 × (x + x22 ) 1
h i
n 1 (cos(2πx1 ) + cos(2πx2 ))
Ackley −20 e −e 2 + 20 + e (0,0)
All the functions were subjected to same test conditions
3
4 Results and Discussion
We gather the results by running the computation on 28-core Intel Xeon E5-
2690 v4 @ 2.6GHz cpu with 130GB RAM. A single instance of comparison for
computation time and RMSE accuracy 100 evaluations using GPs is presented
in Table 2. From the table we can determine that although various test functions
are different in nature, their computation time is more or less uniform. This is
also true if surrogate function is changed to Random Forest.
Table 2: Computation time and Accuracy
Test Function Computation time (s) RMSE

Rosenbrock 14.628 0.215
Egg Crate 14.539 0.281
Branin 14.425 0.790
Camelback 14.949 0.424
Ackley 14.841 0.679
Average 14.676 0.478
Standard Deviation
All the functions were subjected to same test conditions
We compared computation times for GP with RF for multiple evaluations

and determined that RF was several orders of magnitude slower then GP. The
result is presented in Figure 2
Figure 2: Computation Times for Random Forrest Regressor (RFR) and Gaus-
sian Process Regressor (GPR)
For accuracy calculations, we ran several experiments with GP and RF along
4
with EI and LCB acquisition functions. The result is presented in Figure 3.
Figure 3: Computation Times for Random Forrest Regressor (RFR) and Gaus-
sian Process Regressor (GPR)
From Figure 3 we notice a general trend that with increasing number of

evaluations the RMSE is generally coming down. However, we also notice some
anomalies which could be attributed to use of acquisition function for suggestion
of next sample. In general using GP with EI as acquisition function produced
better results and would be carried forward for the study on actual material
science datasets.
References
[1] Alexander Dunn, Julien Brenneck, and Anubhav Jain. “Rocketsled: a soft-
ware library for optimizing high-throughput computational searches”. In:
Journal of Physics: Materials 2.3 (Apr. 2019).
[2] Anubhav Jain et al. “FireWorks: a dynamic workflow system designed for
high-throughput applications”. In: Concurrency and Computation: Practice
and Experience 27.17 (2015), pp. 5037–5059.
[3] Heesoo Park et al. “Design Principles of Large Cation Incorporation in
Halide Perovskites”. In: Molecules 26.20 (2021), p. 6184.

Appendix 4: Abdul Wahab Ziaullah, Sanjay Chawla, Fadwa El Mellouhi Automated Materials Science Workflows

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Appendix 4: Abdul Wahab Ziaullah, Sanjay Chawla, Fadwa El Mellouhi Automated Materials Science Workflows

Uploaded by

Copyright:

Available Formats

Appendix 4

Abdul Wahab Ziaullah, Sanjay Chawla, Fadwa El Mellouhi

Grid Gonzales et. al 2015

Upper Confidence Bound

Acquisition Function Mutual Information

(RF), and running optimization algorithms such as Bayesian optimization, to

The highlighted branch of the tree is the selected method or parameters

Table 1: Analytical Test Functions for Benchmarking

Test Function Formula Minima

Table 2: Computation time and Accuracy

Test Function Computation time (s) RMSE

We compared computation times for GP with RF for multiple evaluations

For accuracy calculations, we ran several experiments with GP and RF along

From Figure 3 we notice a general trend that with increasing number of

You might also like