Professional Documents
Culture Documents
1 Introduction
The field of material science targeting development of enhanced materials is
quite enormous, often requiring manual exhaustive search of materials with op-
timum desired properties, via Density Functional Theory (DFT) calculations or
physical experiments. Our aim is to develop automated computational materials
discovery workflows by utilizing capabilities of machine learning. Machine learn-
ing enables principled way of exploring an exhaustive search space by gathering
data via sampling, updating the knowledge model and using the knowledge
model to better target gathering subsequent data until sufficient information
has been obtained. The final knowledge model developed by such approach is
then utilized for extracting optimal parameters such as materials composition
for maximum stability etc. Here, the knowledge model is our approximation of
actual underlying model, which is not known in advance and/or could be very
expensive to obtain. We therefore, build and update the knowledge model by
sequentially taking samples via physical experiments, DFT calculations or from
various dataset. This sequential updating of the knowledge model and looking
for next best experiment or DFT computation is guided by some optimization
criteria, in our case its Bayesian optimization.
An example for such case would be tuning the composition for perovskite
by modelling a knowledge function F (x), whose domain x represents possible
range of enthalpy of mixing while the function outputs (prediction), possible
configuration of the perovskite molecule to achieve the desired intrinsic stability.
As such knowledge models are formulated empirically, due to the nature of
machine learning, some optimization is utilized to achieve model with desired
prediction accuracy. The optimization simultaneously enables obtaining better
knowledge models as more data is gathered as well as enables determining the
global optima of the knowledge model.
2 Background
We begin by exploring the commonly utilized machine learning algorithms for
our optimum materials discovery objective such as utilization of knowledge mod-
els a.k.a surrogate functions such as Gaussian Process (GP), Random Forrest
1
Encouraging
Search Space
Domain Neural Networks
Knowledge
Restricting Ensemble Methods
Search Space Random Forest
Bayes Neural Network
Surrogate Stuke et al 2021
Modelling Random Forrest
RBF
Gaussian Process Kernel
Gaussian
ϵ-greedy
ML Methods/
Parameters Matern
Particle Swarm t-SNE
Optimization
Empirical Function
Domain
Bayesian Knowledge
Optimization Optimization Analytical Function
Strategies
Expected Improvement
Figure 1: Selected Methods and Parameters for Benchmarking using Test Func-
tions
2
brary [2] enabling it to be executed in parallel computation platforms such as
high performance computing (HPC). The parallel capability comes from Fire-
works internal usage of MongoDB database, which keeps track of the data and
various computations in various clusters and seamlessly integrating them into a
single executable workflow.
3 Methodology
To ensure consistent analysis of our research objective, we begin by performing
benchmark experiments on the existing methods and parameters presented in
Figure 1. This would also enable us to compare new methodologies/algorithms
that we develop along the duration of the research project. To this aim we de-
velop Bayesian optimization workflow and use Gaussian Process and Random
Forrest as surrogate functions for knowledge modeling. Bayesian optimization
uses acquisition function for determining the next experiment or function eval-
uation to be conducted. We establish optimization objective for finding the
minima of well known analytical test functions presented in Table 1. As one
of the criteria for single-objective material science experiments is to determine
composition of the material to provide an optimum characteristics such as op-
timum band-gap in perovskites, simultaneously learning known test functions
and comparing their minima will help us evaluate the performance of the se-
lected method. The input domain of all test functions were selected as (-4,4)
as the functions are well defined in that range and possess at-least single min-
ima. For each function we run multiple evaluations and report the computation
time and root mean square errors (RMSE). We compare the computation time
of Gaussian Process with Random Forrest with two acquisition function for
BO namely Expected Imporvement (EI) and Lower Confidence Bound (LCB).
We also compare the two acquisition functions side by side for their suggestive
properties.
3
4 Results and Discussion
We gather the results by running the computation on 28-core Intel Xeon E5-
2690 v4 @ 2.6GHz cpu with 130GB RAM. A single instance of comparison for
computation time and RMSE accuracy 100 evaluations using GPs is presented
in Table 2. From the table we can determine that although various test functions
are different in nature, their computation time is more or less uniform. This is
also true if surrogate function is changed to Random Forest.
Figure 2: Computation Times for Random Forrest Regressor (RFR) and Gaus-
sian Process Regressor (GPR)
4
with EI and LCB acquisition functions. The result is presented in Figure 3.
Figure 3: Computation Times for Random Forrest Regressor (RFR) and Gaus-
sian Process Regressor (GPR)
References
[1] Alexander Dunn, Julien Brenneck, and Anubhav Jain. “Rocketsled: a soft-
ware library for optimizing high-throughput computational searches”. In:
Journal of Physics: Materials 2.3 (Apr. 2019).
[2] Anubhav Jain et al. “FireWorks: a dynamic workflow system designed for
high-throughput applications”. In: Concurrency and Computation: Practice
and Experience 27.17 (2015), pp. 5037–5059.
[3] Heesoo Park et al. “Design Principles of Large Cation Incorporation in
Halide Perovskites”. In: Molecules 26.20 (2021), p. 6184.