You are on page 1of 12

International Journal of Fatigue 113 (2018) 389–400

Contents lists available at ScienceDirect

International Journal of Fatigue


journal homepage: www.elsevier.com/locate/ijfatigue

An online tool for predicting fatigue strength of steel alloys based on T


ensemble data mining☆

Ankit Agrawal , Alok Choudhary
Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL 60201, United States

A R T I C LE I N FO A B S T R A C T

Keywords: Fatigue strength is one of the most important mechanical properties of steel. Here we describe the development
Materials informatics and deployment of data-driven ensemble predictive models for fatigue strength of a given steel alloy represented
Supervised learning by its composition and processing information. The forward models for PSPP relationships (predicting property
Ensemble learning of a material given its composition and processing parameters) are built using over 400 experimental ob-
Fatigue strength
servations from the Japan National Institute of Materials Science (NIMS) steel fatigue dataset. Forty modeling
Online tool
techniques, including ensemble modeling were explored to identify the set of best performing models for dif-
ferent attribute sets. Data-driven feature selection techniques were also used to find a small non-redundant
subset of attributes, and the processing/composition parameters most influential to fatigue strength were
identified to inform future design efforts. The developed predictive models are deployed in a user-friendly online
web-tool available at http://info.eecs.northwestern.edu/SteelFatigueStrengthPredictor.

1. Introduction particular, the Materials Genome Initiative “will enable discovery, de-
velopment, manufacturing, and deployment of advanced materials at
The field of materials science and engineering involves conducting least twice as fast as possible today, at a fraction of the cost”. MGI
experiments and simulations to understand the science of materials in describes a Materials Innovation Infrastructure encompassing advanced
order to discover and engineer new materials with superior properties. A computational, experimental, and data informatics tools. The Materials
close look at the developments in the field of materials science and Genome Initiative Strategic Plan released in 2014 [28] also identifies
engineering over the centuries reveals that like in any other field of data analytics as one of the key objectives as part of integrating ex-
science, three distinct stages of development can be identified here: periments, computation, and theory, in order to realize the vision of
emperical/experimental, theoretical, and computational/simulation- MGI. It is worth noting that even though we are currently in the age of
based. Over the last few years, the data generated by such experiments “big data”, as far as the field of materials science is concerned, we are
and simulations has grown exponentially, making it amenable to still far from it, since open, accessible data has been rather limited.
knowledge extraction via data-driven techniques, thereby heralding the However, recent MGI-supported efforts [28–30] and other similar ef-
arrival of the fourth paradigm of science [2], which is data-driven forts around the world are promoting the availability and accessibility
science, unifying the first three paradigms of experiment, theory, and of digital data in materials science.
simulation. In the field of materials science, this has led to the emer- It is in the spirit and pursuit of the above-described vision and ap-
gence of the new field called materials informatics [3–5], which has proach of MGI that we discuss and present in this article, an online data
been very successful in recent years in deciphering the processing- informatics tool to predict the fatigue strength of a given steel alloy,
structure-property-performance (PSPP) relationships in materials sci- which is a crucial property to know, given the high cost and time of
ence [6–26]. fatigue testing, and potentially disastrous consequences of fatigue fail-
In June 2011, the US government launched the Materials Genome ures. It is the most important information required for design and
Initiative (MGI) [27] to realize the vision of development of advanced failure analysis of mechanical components. Fatigue is estimated to ac-
materials necessary for economic security and human well-being. In count for over 90% of all mechanical failures of structural components


A conference version of this paper appeared as a short demonstration paper in the Proceedings of 25th ACM International Conference on Information and Knowledge Management
(CIKM), 2016, pp. 2497–2500 [1]. The current article significantly expands on the conference paper by presenting a comprehensive description of the methodologies, detailed comparison
results, and scientific insights.

Corresponding author.
E-mail address: ankitag@eecs.northwestern.edu (A. Agrawal).

https://doi.org/10.1016/j.ijfatigue.2018.04.017
Received 7 April 2017; Received in revised form 26 March 2018; Accepted 14 April 2018
Available online 26 April 2018
0142-1123/ © 2018 Elsevier Ltd. All rights reserved.
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400

[31], and hence, fatigue strength prediction is of critical importance. available. This is one of the largest databases in the world with details
The predictive models deployed in the tool are a result of the applica- on composition, mill product (upstream) features and subsequent pro-
tion of supervised learning techniques on an experimental fatigue da- cessing (heat treatment) parameters. It consists of carbon and low-alloy
taset from Japan National Institute of Materials Science MatNavi da- steels, carburizing steels and spring steels. Apart from composition and
tabase [32], which is freely accessible online. This dataset has been processing details, it also has data on mechanical properties of steels, in
previously used to build similar predictive models for fatigue strength particular rotating bending fatigue strength at 107 cycles at room tem-
[33,34], but the resulting models had not been deployed. Following are perature conditions. Fatigue strength is the highest stress that a mate-
the primary contributions of this work: rial can withstand for a given number of cycles without breaking, and is
thus an extremely critical property of steel for industrial use.
• Comparison of 40 supervised modeling configurations on the NIMS The features in the NIMS dataset can be categorized into the fol-
steel fatigue dataset, including ensemble modeling techniques. Prior lowing:
studies [33,34] did not explore the advanced ensemble modeling
techniques used in this work. The maximum number of models ex- • Chemical composition – %C, %Si, %Mn, %P, %S, %Ni, %Cr, %Cu, %
plored previously was 12 in [34]. Mo (all in wt%)
• More accurate predictive models than prior works on the same data. • Upstream processing details – ingot size, reduction ratio, non-me-
The R2 and MAE values from the new models was found to be sig- tallic inclusions
nificantly better (statistically) than the best models in [34] at • Heat treatment conditions – temperature, time and other process
p = 0.05. Visual inspection of scatter plots also reveal regions where conditions for normalizing, through-hardening, carburizing,
the new models perform significantly better. quenching and tempering processes
• Additional modeling experiments using only composition and only • Mechanical property – fatigue strength (MPa).
processing attributes to evaluate their predictive potential.
• Using data-driven feature selection techniques to identify a reduced 2.2. Preprocessing
set of non-redundant attributes and subsequent modeling experi-
ments on those to obtain predictive models using fewer input fea- We have used the data from [34], and summarize their preproces-
tures while still having a satisfactory predictive accuracy. sing here. The raw dataset from NIMS consisted of multiple grades of
• Identification of processing and composition parameters most in- steel and in some records, some of the heat treatment processing steps
fluential to fatigue strength. did not exist. This is because different specimens can be subjected to
• Deployment of the most accurate “forward” models identified as a different processing routes where some processing steps may not have
result of the above analysis in a web-tool. occurred. In order to make a coherent database, all the key processes in
the data (normalization, through hardening, carburization, quenching,
The web-tool presented here is expected to be a useful resource for tempering) were included. For the cases where a given process did not
the materials science and engineering community to make fast and take place, the corresponding time variable was set to zero and the
accurate predictions of this crucial property of steel, which can in turn corresponding temperature was set to the austenization temperature or
aid in discovering better steels. The rest of the article is organized as the average of rest of the data where the process exists. This pre-
follows: Section 2 presents the data mining workflow employed in this processed data was also made publicly available as supplementary data
study. The data analytics experiments and results are presented in accompanying [34] and is the starting point of the current study.
Section 3, and the online steel fatigue strength predictor deploying the The preprocessed data has 437 instances/rows, 25 features/columns
predictive models in Section 4. We conclude the article with some fu- (composition and processing parameters), and one target property (fa-
ture directions in Section 5. tigue strength). The details of the 25 attributes are given in Table 1.

2. Methods 2.3. Feature selection

The overall data-driven process is depicted as a block diagram in We used the correlation feature selection (CFS) method for feature
Fig. 1. We now describe the data and the various stages of the work- ranking. CFS is used to identify a subset of features highly correlated
flow. with the class variable and weakly correlated amongst them [35]. CFS
was used in conjunction with a best first search to find a subset S with
2.1. Data best average merit, which is given by:
n·rfo
Fatigue dataset for steel from Japan National Institute of Material MeritS =
n + n (n−1)·rff
Science (NIMS) MatNavi [32] was used in this work, which is publicly

Fig. 1. The data mining workflow used in this


work. Rotating bending fatigue testing data was
obtained from the publicly available NIMS
MatNavi database. It was preprocessed as de-
scribed in [34]. This preprocessed data was
made available as supplementary data accom-
panying [34], and is used as a starting point for
this study. Feature selection was used to identify
a small non-redundant subset of attributes to be
used in the online tool, so that users do not have
to enter too many values. Supervised learning
techniques were then used to learn predictive
models for fatigue strength. The models are
evaluated using standard validation techniques,
and the most accurate models are deployed in an
online user-friendly web-tool that can predict the fatigue strength of arbitrary compositions and processing parameters.

390
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400

Table 1 5. Gaussian process: A Gaussian Process generates data located


NIMS data attributes. throughout some domain such that any finite subset of the range
Abbreviation Details follows a multivariate Gaussian distribution, and uses that dis-
tribution to make predictions [40].
C % Carbon 6. Support vector machines: SVMs are based on the Structural Risk
Si % Silicon
Minimization (SRM) principle from statistical learning theory. A
Mn % Manganese
P % Phosphorus
detailed description of SVMs and SRM is available in [41]. Here we
S % Sulphur use SVMs for regression [42]. In their basic form, SVMs attempt to
Ni % Nickel perform modeling by constructing hyperplanes in a multi-
Cr % Chromium dimensional space that separates the instances according to the
Cu % Copper
target variable. It supports both classification and regression tasks
Mo % Molybdenum
NT Normalizing Temperature and can handle multiple continuous and nominal variables.
THT Through Hardening Temperature 7. Decision table: Decision table typically constructs rules involving
THt Through Hardening Time different combinations of attributes, which are selected using an
THQCr Cooling Rate for Through Hardening
attribute selection search method [43].
CT Carburization Temperature
Ct Carburization Time
8. Decision stump: A decision stump [44] is a weak tree-based ma-
DT Diffusion Temperature chine learning model consisting of a single-level decision tree with
Dt Diffusion time a categorical or numeric class label. Decision stumps are usually
QmT Quenching Media Temperature (for Carburization) used in ensemble machine learning techniques.
TT Tempering Temperature
9. M5 model trees: M5 Model Trees [45] are a reconstruction of
Tt Tempering Time
TCr Cooling Rate for Tempering Quinlan’s M5 algorithm [46] for inducing trees of regression
RedRatio Reduction Ratio (Ingot to Bar) models, which combines a conventional decision tree with the
dA Area Proportion of Inclusions Deformed by Plastic Work option of linear regression functions at the nodes. It tries to parti-
dB Area Proportion of Inclusions Occurring in Discontinuous Array tion the training data using a decision tree induction algorithm by
dC Area Proportion of Isolated Inclusions
trying to minimize the intra-subset variation in the class values
Fatigue Rotating Bending Fatigue Strength (107 Cycles)
down each branch, followed by back pruning and smoothing,
which substantially increases prediction performance. It also uses
the techniques used in CART [47] to effectively deal with en-
where n is the number of features in S,rfo is the average value of feature- umerated attributes and missing values.
outcome correlations, and rff is the average value of all feature-feature 10. Random tree: Random Tree is a decision tree model that considers
correlations. a randomly chosen subset of attributes at each node. The number of
attributes chosen are, in general, significantly less than the total
2.4. Predictive modeling number of attributes. Random trees are usually used as building
blocks for random forests [48], which, in general, has been found to
We used 40 regression schemes in this study, including both direct improve prediction performance.
application of regression techniques and constructing their ensembles 11. Reduced error pruning tree: Commonly known as REPTree [44],
using various ensembling techniques. Here we describe these techni- it is a implementation of a fast decision tree learner, which builds a
ques very briefly: decision/regression tree using information gain/variance and
prunes it using reduced-error pruning to avoid over-fitting. Part of
1. Linear regression: Linear regression is probably the oldest and the training data is withheld from decision tree construction as a
most widely used predictive model, which commonly represents a pruning set and is subsequently used for pruning. At each internal
regression that is linear in the unknown parameters used in the fit. node in the tree, an error rate is identified by propagating the errors
The most common form of linear regression is least squares fitting upwards from the leaf nodes. This is compared to the error rate if
[36]. Least squares fitting of lines and polynomials are both forms that internal node was replaced by a leaf node with the average
of linear regression. value of the target attribute in that node. If it results in a reduction
2. Nearest-neighbor (IBk): Also known as instance-based model, it of error, then the subtree below the node can be pruned, and the
uses normalized Euclidean distance to find the training instance node with the highest scope of reducing error is pruned.
closest to the given test instance, and predicts the same target value 12. Random forest: The Random Forest [48] model consists of mul-
as this training instance [37]. If multiple instances have the same tiple decision trees. In that sense, it is an ensemble of random trees.
(smallest) distance to the test instance, the first one found is used. It The final prediction of an instance in a Random Forest is given by
eliminates the need for building models and supports adding new the average of the predictions from the individual trees. In many
instances to the training database dynamically. cases, it is known to produce robust and accurate predictions, along
3. Nearest-neighbor (KStar): This is another type of nearest- with the ability to handle a very large number of input variables,
neighbor model that uses an entropy-based distance function in- while also being relatively robust to over-fitting.
stead of Euclidean distance. 13. Additive regression: It is a meta learner that enhances the per-
4. Artificial neural networks: ANNs are networks of interconnected formance of a regression base classifier. Each iteration fits a model
artificial neurons, and are commonly used for non-linear statistical to the residuals left by the classifier on the previous iteration [49].
data modeling to model complex relationships between inputs and The predictions of each of the learners are added together to get the
outputs. The network includes a hidden layer of multiple artificial overall prediction.
neurons connected to the inputs and outputs with different edge 14. Bagging: Bagging [50] is an ensemble learning algorithm to im-
weights. The internal edge weights are learnt during the training prove the stability of classification and regression algorithms by
process using techniques like back propagation. Multilayer per- reducing variance. Bagging is usually applied to decision tree
ceptron (MLP) for regression with one hidden layer was used in this models to boost their performance. It involves generating a number
work. Several good descriptions of neural networks are available of new training sets (called bootstrap modules) from the original set
[38,39]. These models form the basis of deep learning approaches by sampling uniformly with replacement. The bootstrap modules
that are becoming very popular nowadays. are then used to generate models whose predictions are averaged to

391
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400

generate the final prediction. Bagging has been shown to work 1


RMSE = ∑N (y−y )̂ 2
better with decision trees than with linear models. N (3)
15. Random Committee: This is a technique for building an ensemble
of randomizable base models. Each base model is built using a ∑N |y−y |̂
RAE =
different random seed but uses the exact same data. The final ∑N |y−y | (4)
prediction is a simple average of the individual predictions.
16. Random subspace: The Random Subspace ensembling technique ∑N (y−y )̂ 2
RRSE =
[51] constructs a decision tree based model consisting of multiple ∑N (y−y )2 (5)
trees, which are constructed systematically by pseudo-randomly
selecting subsets of features, trying to achieve a balance between where y denotes the actual fatigue strength (MPa), y ̂ denotes the pre-
overfitting and achieving maximum accuracy. It maintains highest dicted fatigue strength (MPa), y denotes the average fatigue strength
accuracy on training data and improves on generalization accuracy across the dataset, and N is the number of instances in the dataset.
as it grows in complexity. 10-fold cross validation setting was used to evaluate all the models,
17. Rotation Forest: Rotation forest [52] is a method for generating which randomly divides the dataset into 10 parts, uses 9 parts as
model ensembles based on feature extraction, which can work both training set and 1 part as the test set, and repeats the process 10 times
with classification and regression base learners. Training data for with different test sets before aggregating the results together.
the base modeling technique is created by applying Principal Therefore, each labeled instance in the steel fatigue strength prediction
Component Analysis (PCA) [53] to K subsets of the feature set, database is tested exactly once by a model that did not see it while
followed by K axis rotations to form the new features for the base training. Further, the entire process was repeated 10 times to aid in
learner, to encourage simultaneously individual accuracy and di- statistical significance testing.
versity within the ensemble.
18. Voting: Voting is a popular ensemble technique for combining 3. Data analytics experiments and results
multiple classifiers. It has been shown that ensemble classifiers
using voting may outperform the individual classifiers in certain In this section we first present the comparison results of different
cases [54]. In this case, we combine multiple classifiers by using the modeling techniques, followed by the modeling results using only
average of predictions generated by each model, although one can composition and only processing attributes. Subsequently, we discuss
combine the predictions in other ways, such as taking the max- the results of feature selection and modeling on the reduced set of at-
imum, minimum, median, etc. tributes, and finally the analysis of the most influential features for
fatigue strength.
Forty different modeling configurations were obtained using the
above techniques as follows. We used techniques #1 to #12 above di- 3.1. Comparison of various modeling configurations on entire dataset
rectly on the training data to get predictive models. The five ensembling
techniques #13 to #17 work in conjunction with a base modeling Table 2 presents the comparison of the different modeling techni-
technique. Theoretically, we can use all 12 as base models, but we ques used in this work with respect to the different metrics mentioned
excluded the two nearest-neighbor models, gaussian process, SVM, earlier. All results are based on 10 independent runs of 10-fold cross-
decision table, and random forest while ensembling for one or more of validation to facilitate statistical significance testing. In addition, the
the following reasons: large model size, large training/testing time, low training and testing times for each model, and the model size is also
accuracy, already an ensemble model. Of the five ensemble methods, listed. Since we performed 10-fold cross-validation, training time is on
random committee can only work with randomizable base models, i.e. 90% of the data, testing time is on 10% of the data, and model size also
which use a random seed to build a model. Only three of the remaining corresponds to the model build on 90% of the data (all averaged across
direct modeling techniques fulfilled that criterion, which were multi- the 100 runs). WEKA software [55] verison 3.7.13 was used for all
layer perceptron, random tree, and reduced error pruning trees. analytics with default parameters, unless otherwise stated. These results
Further, we identified the set of best performing models from the above were obtained by using the entire set of 25 input attributes. Table 2 is
analysis whose performance was not statistically distinguishable at sorted by the MAE metric, and the performance numbers that are not
p = 0.05, and built another ensemble voting model (#18) that averages statistically distinguishable at p = 0.05 are boldfaced. The top four
the predictions from the best performing models to generate the final models from Table 2 are subsequently combined using the Voting
prediction. Thus, we explored a total of 12 + 6 × 4 + 3 + 1 = 40 dif- modeling scheme to obtain the final model
ferent configurations of modeling techniques. (R2 = 0.9819,MAE = 17.67 MPa,RMSE = 25.08 MPa ), whose perfor-
mance was found to be significantly better than all the four constituent
models, as well as better than the modeling techniques used previously
2.5. Evaluation in [34], at p = 0.05. Visual inspection of the scatter plots (Fig. 2) also
reveal that the current model is able to make more accurate predictions
A model’s predictive performance is assessed by how close it can for carbon and low alloy steels, where the best model in [34] had failed.
predict the experimental fatigue strength (which is the ground truth in
this case). The metrics used for this purpose include the coefficient of 3.2. Modeling using only composition and only processing attributes
correlation (R), explained variance (R2 ), Mean Absolute Error (MAE),
Root Mean Squared Error (RMSE), Relative Absolute Error (RAE), and Since the NIMS dataset has two kinds of attributes (composition
Root Relative Squared Error (RRSE). The formulae of these evaluation and processing), it would be interesting to know if there is any
criteria are as follows: added value of using both compared to just one kind. We thus
N performed additional experiments with only composition attributes
̂ y ̂)
∑i = 1 (yi −y )(yi −
R= (9) and only processing attributes (16). The same setting of 10-fold
N N
̂ y ̂)2
∑i = 1 (yi −y )2 ∑i = 1 (yi − (1) cross-validation was used, with 10 runs for statistical significance
testing. Instead of presenting two new tables like Table 2, here we
1 only summarize the results on these subsets of the data. Additive
MAE = e = ∑ |y−y |̂ regression with M5 model trees as underlying regressor was found
N N (2)
to be the most accurate model for the composition-only dataset

392
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400

Table 2
Comparison of different techniques with 10-fold cross-validation setting using all 25 attributes (table sorted by MAE, best accuracy numbers boldfaced that were
statistically not distinguishable at p = 0.05, modeling techniques used in final Voting model also boldfaced).
Modeling Scheme R R2 MAE RMSE RAE RRSE TrainTime TestTime ModelSize
(MPa) (MPa) (% ) (% ) (s) (s) (bytes)

RotationForest_M5 0.9900 0.9801 18.74 26.50 14.76 14.44 1.1058 0.0094 716,215
RotationForest_MLP 0.9894 0.9789 18.97 27.00 15.00 14.76 3.4866 0.0102 664,851
Bagging_MLP 0.9895 0.9791 18.97 27.03 14.99 14.78 3.0486 0.0009 99,700
AdditiveRegression_M5 0.9897 0.9795 19.05 26.66 15.01 14.54 0.3996 0.0003 44,210
Bagging_M5 0.9890 0.9781 19.36 27.96 15.23 15.21 0.8039 0.0008 264,058
M5 ModelTrees 0.9893 0.9787 19.64 27.46 15.45 14.94 0.0885 0.0001 19,684
NeuralNetworks (MLP) 0.9881 0.9763 19.89 28.41 15.72 15.56 0.3652 0.0002 13,616
RandomCommittee_MLP 0.9877 0.9756 20.37 29.14 16.05 15.84 3.0811 0.0010 99,424
AdditiveRegression_MLP 0.9851 0.9704 20.94 32.27 16.49 17.56 3.5265 0.0009 99,476
RandomCommittee_REPTree 0.9874 0.9750 21.39 29.48 16.86 16.09 0.0637 0.0003 83,087
Bagging_REPTree 0.9872 0.9746 21.44 29.90 16.88 16.29 0.0597 0.0002 103,283
RotationForest_REPTree 0.9871 0.9744 21.82 30.06 17.20 16.39 0.2026 0.0086 653,435
RotationForest_RandomTree 0.9866 0.9734 22.25 30.58 17.57 16.69 0.1698 0.0088 941,446
RandomForest 0.9875 0.9752 22.28 29.43 17.59 16.08 0.2594 0.0025 2,762,888
RandomCommittee_RandomTree 0.9858 0.9718 23.59 31.42 18.65 17.18 0.0416 0.0003 430,293
Bagging_RandomTree 0.9853 0.9708 23.96 31.78 18.94 17.38 0.0323 0.0003 275,199
SVM 0.9816 0.9635 24.34 36.65 19.12 19.90 0.3355 0.0001 110,816
RotationForest_LinearRegression 0.9834 0.9671 24.62 34.26 19.37 18.62 0.1698 0.0088 610,462
Bagging_LinearRegression 0.9832 0.9667 24.69 34.48 19.42 18.73 0.0341 0.0003 43,911
AdditiveRegression_REPTree 0.9797 0.9598 24.71 35.70 19.43 19.39 0.0127 0.0001 13,277
REPTree 0.9812 0.9628 24.87 35.17 19.59 19.18 0.0064 0.0001 10,449
AdditiveRegression_LinearRegression 0.9830 0.9663 24.99 34.64 19.66 18.84 0.0117 0.0001 12,111
LinearRegression 0.9830 0.9663 24.99 34.64 19.66 18.83 0.0031 0.0001 7598
RandomSubSpace_REPTree 0.9829 0.9661 26.48 34.18 20.93 18.70 0.0388 0.0003 89,933
RandomSubSpace_MLP 0.9814 0.9631 27.25 36.06 21.49 19.67 1.5278 0.0010 100,764
RandomSubSpace_M5 0.9821 0.9645 27.45 35.67 21.70 19.51 0.6407 0.0008 158,759
RandomSubSpace_RandomTree 0.9768 0.9541 31.19 38.65 24.72 21.22 0.0359 0.0005 349,143
AdditiveRegression_DecisionStump 0.9663 0.9337 34.42 47.87 27.12 26.13 0.0156 0.0001 4194
RandomTree 0.9708 0.9425 34.82 44.35 27.55 24.33 0.0041 0.0001 45,445
AdditiveRegression_RandomTree 0.9715 0.9438 34.85 43.68 27.60 23.96 0.0104 0.0001 76,641
GaussianProcess 0.9670 0.9351 34.86 48.11 27.35 26.13 0.0846 0.0740 732,187
RandomSubSpace_LinearRegression 0.9660 0.9332 36.41 49.40 28.66 26.92 0.0191 0.0005 54,937
DecisionTable 0.9445 0.8921 37.21 58.47 29.34 31.81 0.0511 0.0002 24,944
NearestNeighbor_Kstar 0.9610 0.9235 40.37 49.81 32.01 27.36 0.0001 0.1965 110,615
NearestNeighbor_Ibk 0.9539 0.9099 47.86 55.65 37.92 30.39 0.0001 0.0037 97,573
RotationForest_DecisionStump 0.8622 0.7434 70.67 91.81 55.60 50.15 0.1494 0.0085 572,252
RandomSubSpace_DecisionStump 0.8402 0.7059 73.13 97.49 57.55 53.31 0.0106 0.0002 24,493
DecisionStump 0.8402 0.7059 73.13 97.49 57.55 53.31 0.0016 0.0000 2298
Bagging_DecisionStump 0.8402 0.7059 73.16 97.55 57.57 53.34 0.0148 0.0000 4418

( R2 = 0.9308,MAE = 38.86 MPa,RMSE = 48.14 MPa ), and also sig- two with Voting scheme gave the following accuracy numbers:
nificantly better than all other models, so Voting scheme was not R2 = 0.9738,MAE = 21.63 MPa,RMSE = 30.19 MPa .
necessary here to combine multiple models. For the processing-only Clearly, neither composition attributes alone nor processing attri-
dataset, two models resulted in statistically indistinguishable per- butes alone performed as well as using both together, suggesting that
formance. One was RandomForest and the second was Random- they capture complimentary information about materials, and sig-
Committee with REPTree as the base regressor. Combining these nificantly contribute to model accuracy.

Fig. 2. Scatter plots comparing the best model from [34] and the final model from the current study based on Voting scheme. The new model can be seen to perform
significantly better in the low fatigue strength region of the plot where the old model had failed.

393
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400

Table 3
Comparison of different techniques with 10-fold cross-validation setting using reduced subset of 9 non-redundant attributes (table sorted by MAE, best accuracy
numbers boldfaced that were statistically not distinguishable at p = 0.05, modeling techniques used in the final Voting model also boldfaced).
Modeling Scheme R R2 MAE RMSE RAE RRSE TrainTime TestTime ModelSize
(MPa) (MPa) (% ) (% ) (s) (s) (bytes)

RandomCommittee_REPTree 0.9680 0.9370 37.36 45.74 29.57 25.04 0.0265 0.0001 46,157
RotationForest_MLP 0.9679 0.9368 37.86 46.18 29.98 25.29 1.1262 0.0036 257,569
RotationForest_REPTree 0.9673 0.9357 37.86 46.41 29.96 25.43 0.0671 0.0029 241,800
M5 ModelTrees 0.9666 0.9343 38.25 46.85 30.24 25.63 0.0547 0.0001 14,884
AdditiveRegression_M5 0.9666 0.9343 38.25 46.85 30.24 25.63 0.1678 0.0002 26,643
Bagging_MLP 0.9669 0.9349 38.44 46.92 30.42 25.68 1.0263 0.0005 70,459
Bagging_M5 0.9662 0.9335 38.52 47.17 30.46 25.82 0.4883 0.0005 115,962
RotationForest_M5 0.9667 0.9345 38.53 47.08 30.45 25.75 0.5809 0.0033 300,707
Bagging_REPTree 0.9661 0.9333 38.71 47.16 30.65 25.83 0.0251 0.0001 69,484
NeuralNetworks (MLP) 0.9659 0.9330 38.79 47.56 30.71 26.04 0.0997 0.0002 10,070
RandomCommittee_MLP 0.9648 0.9308 39.24 48.34 31.09 26.48 1.0321 0.0005 70,183
REPTree 0.9631 0.9276 39.26 48.83 31.02 26.68 0.0026 0.0001 6370
AdditiveRegression_REPTree 0.9610 0.9235 39.66 50.01 31.28 27.23 0.0048 0.0001 7454
RandomSubSpace_REPTree 0.9615 0.9245 40.16 50.44 31.70 27.56 0.0186 0.0002 71,785
AdditiveRegression_MLP 0.9622 0.9258 40.46 49.34 32.07 27.10 0.7933 0.0006 65,278
DecisionTable 0.9551 0.9122 41.17 53.09 32.59 29.13 0.0123 0.0002 18,154
RandomSubSpace_M5 0.9599 0.9214 42.53 52.77 33.53 28.85 0.4339 0.0009 130,728
SVM 0.9528 0.9078 43.85 56.24 34.57 30.66 0.0881 0.0001 58,672
RotationForest_LinearRegression 0.9540 0.9101 43.90 55.37 34.62 30.20 0.0526 0.0031 221,755
Bagging_LinearRegression 0.9537 0.9095 43.98 55.58 34.68 30.31 0.0091 0.0002 30,594
RandomSubSpace_MLP 0.9552 0.9124 44.07 55.21 34.72 30.15 0.6222 0.0008 79,472
LinearRegression 0.9534 0.9090 44.11 55.76 34.79 30.41 0.0009 0.0001 5638
AdditiveRegression_LinearRegression 0.9534 0.9090 44.11 55.76 34.79 30.41 0.0029 0.0002 8955
RandomForest 0.9551 0.9122 44.36 53.68 35.15 29.46 0.2147 0.0024 1,844,181
AdditiveRegression_DecisionStump 0.9533 0.9088 44.43 55.12 35.07 30.20 0.0069 0.0001 3503
Bagging_RandomTree 0.9527 0.9076 45.76 55.07 36.28 30.21 0.0280 0.0002 182,919
RandomSubSpace_RandomTree 0.9497 0.9019 46.96 56.83 37.23 31.20 0.0250 0.0006 214,037
NearestNeighbor_Kstar 0.9481 0.8989 47.16 57.46 37.40 31.57 0.0002 0.0843 51,433
NearestNeighbor_Ibk 0.9498 0.9021 47.24 56.94 37.47 31.26 0.0001 0.0020 45,971
RotationForest_RandomTree 0.9489 0.9004 47.26 57.19 37.49 31.42 0.0754 0.0031 398,752
RandomCommittee_RandomTree 0.9493 0.9012 47.31 57.11 37.53 31.37 0.0315 0.0003 205,941
AdditiveRegression_RandomTree 0.9487 0.9000 47.42 57.46 37.61 31.55 0.0107 0.0001 29,506
RandomTree 0.9474 0.8976 47.63 57.98 37.77 31.86 0.0031 0.0001 22,236
GaussianProcess 0.9145 0.8363 52.83 73.36 41.50 39.93 0.0804 0.0706 680,172
RandomSubSpace_LinearRegression 0.9266 0.8586 56.35 75.39 44.09 40.80 0.0087 0.0004 42,166
RotationForest_DecisionStump 0.8533 0.7281 71.49 94.09 56.22 51.40 0.0480 0.0027 198,090
DecisionStump 0.8402 0.7059 73.13 97.49 57.55 53.31 0.0006 0.0000 1607
Bagging_DecisionStump 0.8397 0.7051 73.21 97.63 57.62 53.39 0.0062 0.0001 3727
RandomSubSpace_DecisionStump 0.8115 0.6585 76.10 104.92 59.79 57.24 0.0059 0.0002 18,168

3.3. Feature selection and modeling in low dimensional space deteriorates as the number of parameters is reduced, which is along
expected lines, since fewer input features means lesser information for
It would also be interesting to identify a smaller non-redundant the machine learning model to learn from. However, it is interesting to
subset of attributes that are most influential in predicting fatigue note that the CFS-reduced 9-parameter model performs slightly better
strength. As it was confirmed by the previous analysis that composition than the composition-only 9-parameter model, thereby underscoring
and processing attributes capture complementary information and are the importance of including both composition and processing in-
important for the model, we used correlation feature selection (CFS) formation in the model, and the efficacy of CFS technique in de-
technique to identify subsets of both kinds of attributes. The application termining a more informative 9-parameter set for building machine
of CFS technique to composition attributes identified a subset of six learning models.
composition attributes: C, Si, P, Cr, Cu, and Mo. The same analysis on The final modeling techniques determined to be the most accurate
processing attributes identified a subset of three processing attributes: for the four subsets were also tested on a holdout set (also known as the
THT (through hardening temperature), THQCr (cooling rate for through train-test split setting for testing). Here a 3:1 split was used, with 75%
hardening), and Tt (tempering time). We combined these six compo- data randomly selected for training, while remaining 75% used for
sition and three processing attributes to make a new dataset of nine testing the models. Note that choosing a 9:1 split would have essentially
attributes, and once again performed the regression modeling with corresponded to one out of the ten iterations of the 10-fold cross-vali-
various modeling schemes using the same settings (10 runs of 10-fold dation. A different split ratio with a smaller training split was thus
cross-validation) to obtain the best predictive model for this dataset. chosen in order to more realistically evaluate the expected accuracy of
Table 3 presents the comparison results. Top three models were found the models on unseen data. Table 5 presents the accuracy numbers of
to have statistically indistinguishable accuracy on all performance the final Voting models on 25% holdout dataset. As expected, the ac-
metrics, and were thus combined using the Voting scheme, resulting in curacy on the holdout set is marginally lower than the cross-validation
the following accuracy numbers: R2 = 0.9440,MAE = accuracy, primarily because of smaller training dataset.
36.41 MPa,RMSE = 44.14 MPa .
Table 4 lists the 10-fold cross-validation accuracy numbers of the
3.4. Most influential features for fatigue strength
final Voting models on different subsets of the NIMS database. Figs. 3
and 4 show the scatter plots and error histograms of the same. The full
Recall that the CFS technique used earlier to find reduced feature
25-parameter model has the highest accuracy, and the performance
subsets works based on correlation. Therefore, in order to dig deeper

394
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400

Table 4
Accuracy of final Voting models on different attribute sets with 10-fold cross-validation setting.
Dataset #Attributes R R2 MAE RMSE RAE RRSE
(MPa) (MPa) (% ) (% )

Entire dataset 25 0.9909 0.9819 17.67 25.08 13.79 13.43


Composition only 9 0.9648 0.9308 38.86 48.14 30.64 26.24
Processing only 16 0.9868 0.9738 21.63 30.19 16.87 16.18
Reduced set 9 0.9716 0.9440 36.41 44.14 28.40 23.65

towards the understanding of which features are most influential for of it could also be an artifact of the way the dataset was constructed
fatigue strength, we look at the correlation of individual features w.r.t. processing parameters, as described in Section 2.2. None-
amongst themselves and with fatigue strength. While the CFS-based theless, it reconfirms the well-known existence of cause-effect PSPP
analysis presented earlier aimed at finding the minimal subset of fea- relationships in steels and materials in general, underscoring the
tures with good predictive power, here we look at predictive potential critical dependence of materials property/performance on proces-
of individual features to understand the ranking of features in terms of sing via (micro) structure. Although all processing parameters were
their influence on fatigue strength. Figs. 5 and 6 present the heat map of highly correlated with fatigue strength, the most influential ones
intra-feature correlation values and features ranked by correlation with were found to be related to tempering, carburization, diffusion,
fatigue strength respectively. The following observations can be made through hardening, and normalization (in that order). In particular,
from these figures w.r.t. processing parameters, composition para- tempering time, carburization temperature/time, diffussion tem-
meters, and property/performance metric (fatigue strength): perature/time, and quenching media temperature were highly po-
sitively correlated with fatigue strength. Given the way the dataset
• Correlation with processing parameters: Relatively higher correla- was constructed, most of these reflect the fact that performing one
tion (corresponding to darker cells in top left region of Fig. 5) is or more of these processing steps enhances the fatigue strength of
observed between fatigue strength and processing parameters and steels. Through hardening temperature/time, tempering tempera-
amongst processing parameters themselves. Some of it is expected ture, and cooling rates of tempering and through hardening were
since many processing parameters are inherently coupled together found to be negatively correlated with fatigue strength, suggesting
(e.g. carburization temperature and carburization time), while some that through hardening with rapid cooling adversely affects fatigue

Fig. 3. Scatter plots of the final Voting models for the four attribute sets.

395
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400

Fig. 4. Error histograms of the final Voting models for the four attribute sets.

Table 5
Accuracy of final Voting models on different attribute sets with 3:1 train:test split
Dataset #Attributes R R2 MAE RMSE RAE RRSE
(MPa) (MPa) (% ) (% )

Entire dataset 25 0.9892 0.9785 21.00 32.28 14.38 14.98


Composition only 9 0.9672 0.9357 42.66 53.89 29.22 25.01
Processing only 16 0.9862 0.9726 25.89 35.90 17.73 16.67
Reduced set 9 0.9769 0.9543 37.09 46.61 25.40 21.63

Fig. 5. Intra-feature correlation heat map with positive and negative correlations in red and blue respectively. (For interpretation of the references to color in this
figure legend, the reader is referred to the web version of this article.)

396
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400

Fig. 6. Features ranked in order of de-


creasing absolute correlation with fatigue
strength.

Fig. 7. Scatter plots of fatigue strength vs.


tempering temperature, tempering time, and
tempering cooling rate. Steels that were not
tempered were represented by tempering
time = 0, tempering temperature = 30 °C,
and tempering cooling rate = 0. These
curves are consistent with known knowledge
that tempering can significantly enhance the
fatigue strength of steels. Moreover, they
also suggest that the following tempering
configurations are more favorable for im-
proving fatigue strength: tempering at rela-
tively lower temperatures, tempering for a
longer duration of time, and slow cooling.

strength. Further, it also indicates that tempering at relatively low


temperatures (160–200 °C) for a long duration of time with slow
cooling can improve the fatigue strength of steels. Fig. 7 shows the
scatter plots of fatigue strength with tempering-related attributes.
• Correlation with composition parameters: Relatively lower correla-
tion is observed between fatigue and composition parameters, be-
tween composition and processing parameters, and amongst com-
position parameters themselves. This is expected, since the
composition parameters are largely independent of processing
parameters and of each other as well. They certainly influence the
fatigue strength, and in fact, the presence of small quantities of these
elements is expected to boost the mechanical properties of steel Fig. 8. Scatter plots of fatigue strength vs. carbon content. Carburized steels are
alloys. But since the dataset in study is an experimental sparse da- denoted by red dots, and uncarburized steels by blue dots. Clearly, carbon
taset with varying degrees of representations of parameter spaces of content is positively correlated with fatigue strength and carburization further
different features, the influence of each individual element is con- enhances the fatigue strength.
founded with other parameters, and only weak-to-moderate positive
correlation or weak negative correlation with fatigue strength is the actual property is dependent on numerous factors and phenomena,
observed for most elements. The only exception found is carbon, for many of which may not be recorded in this dataset or might even be
which a moderate negative correlation of −0.41 was observed, unknown to us so far. Therefore, while the data-driven insights and
which was quite surprising. A deeper dive into the dataset revealed models derived in this work are expected to aid in discovery and design
the reason for this anomaly, which resulted due to ignoring the ef- of new steels, they are intended not to replace but complement the
fect of carburization for the correlation calculation. In the NIMS expertise of materials design engineers and should be used with cau-
steel dataset, carburization is recorded as part of processing para- tion.
meters but not reflected in the composition data, although it effec-
tively increases the carbon content on the outer surface of steel. The 4. Deployment: steel fatigue strength predictor
correlation between carbon and fatigue strength within non-car-
burized and carburized steels was found to be 0.42 and 0.38 re- Most of the data-driven predictive models are, in general, not simple
spectively, confirming carbon to one of the most influential com- equations as in the case of something like linear regression, and are
position parameter for fatigue strength. Fig. 8 shows the scatter plot more like black box models that are not directly interpretable. This is
of fatigue strength with carbon content as recorded in the NIMS even more the case with advanced ensembling techniques used in this
database. In addition to carbon, other elements that were found to study. Therefore, it is not straightforward to use these forward models
be significantly influential to fatigue strength include chromium, in traditional spreadsheet software to get property predictions for a
molybdenum, copper, and silicon. given material representation, and usually some code/scripts in an
appropriate programming language (depending on how the models
Although the above-described data science experiments and results were created in the first place) are needed to use such models and make
demonstrate the potential of data-driven techniques to develop accu- predictions.
rate predictive models and derive actionable insights, it is important to Therefore, to make the forward predictive models readily accessible
remember that the dataset in study is a relatively small dataset where for use by the materials science and engineering community, we have
the parameter space of each feature is only sparsely represented, and created an online steel fatigue strength predictor that can take as input

397
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400

Fig. 9. A screenshot of the deployed steel fatigue strength predictor.

the values of the composition and processing attributes of steels, and accurate, might be useful in cases when all the attribute values are not
generate predictions of fatigue strength for the given steel. In addition available. The final Voting models are deployed in this tool for both
to the models on the full set of attributes, the tool also has the option of attribute sets. The primary advantage of such a tool is ready access to
using the models on the reduced set of attributes, which although less fast and accurate forward models of PSPP relationships without the need

398
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400

to do costly experiments and simulations, which can help identify magnets. Sci Rep 2014;4.
promising candidates for further exploration with simulations and/or [11] Liu R, Yabansu YC, Agrawal A, Kalidindi SR, Choudhary AN. Machine learning
approaches for elastic localization linkages in high-contrast composite materials.
experiments. Fig. 9 shows the screenshot of the steel fatigue strength Integr Mater Manuf Innov 2015;4(13):1–17.
predictor, and the tool is available online at http://info.eecs. [12] Balachandran PV, Theiler J, Rondinelli JM, Lookman T. Materials prediction via
northwestern.edu/SteelFatigueStrengthPredictor. classification learning. Sci Rep 2015;5.
[13] Liu R, Kumar A, Chen Z, Agrawal A, Sundararaghavan V, Choudhary A. A predictive
machine learning approach for microstructure optimization and materials design.
5. Conclusion and future work Nat Sci Rep 2015;5. (11551).
[14] Faber F, Lindmaa A, von Lilienfeld OA, Armiento R. Crystal structure representa-
tions for machine learning models of formation energies. Int J Quant Chem 2015.
In this materials informatics study, we compared 40 different [15] Liu R, Ward L, Wolverton C, Agrawal A, Liao W-K, Choudhary A. Deep learning for
modeling techniques for predicting fatigue strength of steel alloys and chemical compound stability prediction. In: Proceedings of ACM SIGKDD workshop
analyzed the most influential features for fatigue strength using data on large-scale deep learning for data mining (DL-KDD); 2016. p. 1–7.
[16] Ward L, Agrawal A, Choudhary A, Wolverton C. A general-purpose machine
from a publicly available experimental database from NIMS. The most
learning framework for predicting properties of inorganic materials. npj Comput
accurate models were deployed in an online web-tool called the steel Mater 2016;2:16028.
fatigue strength predictor. The deployed tool is expected to be a useful [17] Liu R, Agrawal A, Liao W-K, Graef MD, Choudhary A. Materials discovery: under-
resource for researchers and practitioners in the materials science and standing polycrystals from large-scale electron patterns. In: Proceedings of IEEE
BigData workshop on advances in software and hardware for big data to knowledge
engineering community. discovery (ASH); 2016. p. 2261–9.
The presented workflow of data analytics and deployment of for- [18] Furmanchuk A, Agrawal A, Choudhary A. Predictive analytics for crystalline ma-
ward PSPP models can be readily applied on other experimental and terials: bulk modulus. RSC Adv 2016;6(97):95246–51.
[19] Agrawal A, Meredig B, Wolverton C, Choudhary A. A formation energy predictor for
computational materials science data. Future work includes making crystalline materials using ensemble data mining. In: Proceedings of IEEE interna-
attempts to further improve the model accuracy and generalizability by tional conference on data mining (ICDM) (Demo); 2016. p. 1276–9.
using/deriving more relevant attributes (such as by using CALPHAD [20] Ward L, Liu R, Krishna A, Hegde VI, Agrawal A, Choudhary A, et al. Including
crystal structure attributes in machine learning models of formation energies via
techniques) and/or using data-driven modeling techniques to building Voronoi tessellations. Phys Rev B 2017;96(2):024104.
and deploying accurate models for other material properties. We also [21] Liu R, Yabansu YC, Yang Z, Choudhary AN, Kalidindi SR, Agrawal A. Context aware
believe that the demonstrated ability to build such fast and accurate machine learning approaches for modeling elastic localization in three-dimensional
composite microstructures. Integr Mater Manuf Innov 2017:1–12.
forward models could also help in the future in realizing the inverse [22] Gagorik AG, Savoie B, Jackson N, Agrawal A, Choudhary A, Ratner MA, et al.
models of discovery and design, wherein new steel alloys with high Improved scaling of molecular network calculations: the emergence of molecular
fatigue strength can be identified along with the processing routes to domains. J Phys Chem Lett 2017;8(2):415–21.
[23] Gopalakrishnan K, Khaitan SK, Choudhary A, Agrawal A. Deep convolutional neural
make such high strength steels, enabling data-driven design of ad-
networks with transfer learning for computer vision-based data-driven pavement
vanced steels. distress detection. Constr Build Mater 2017;157:322–30.
[24] Furmanchuk A, Saal JE, Doak JW, Olson GB, Choudhary A, Agrawal A. Prediction of
Acknowledgments seebeck coefficient for compounds without restriction to fixed stoichiometry: a
machine learning approach. J Comput Chem 2018;39(4):191–202.
[25] Gopalakrishnan K, Gholami H, Vidyadharan A, Choudhary A, Agrawal A. Crack
The authors are grateful to NIMS for making the raw data on fatigue damage detection in unmanned aerial vehicle images of civil infrastructure using
steel strength publicly available, and also to the authors of Ref. [33] to pre-trained deep learning model. Int J Traffic Transp Eng 2018;8:1.
[26] Paul A, Acar P, Liu R, Liao W-K, Choudhary A, Sundararaghavan V, et al. Data
preprocess the raw NIMS data and make it available as supplementary sampling schemes for microstructure design with vibrational tuning constraints. Am
data accompanying Ref. [34]. This work was performed under the Inst Aeronaut Astronaut (AIAA) J 2018;56(3):1239–50.
following financial assistance award 70NANB14H012 from U.S. De- [27] Materials Genome Initiative for Global Competitiveness, June 2011; OSTP 2011.
[28] Materials Genome Initiative Strategic Plan, National Science and Technology
partment of Commerce, National Institute of Standards and Technology Council Committee on Technology Subcommittee on the Materials Genome
as part of the Center for Hierarchical Materials Design (CHiMaD). The Initiative; June 2014.
authors also acknowledge partial support from AFOSR award FA9550- [29] Ward CH, Warren JA, Hanisch RJ. Making materials science and engineering data
more valuable research products. Integr Mater Manuf Innov 2014;3(1):1–17.
12-1-0458.
[30] Materials science and engineering data challenge. < https://www.challenge.gov/
challenge/materials-science-and-engineering-data-challenge/ > [accessed: March
References 31, 2016].
[31] Dieter GE. Mechanical metallurgy. 3rd ed. Mc Graw-Hill Book Co.; 1986.
[32] National Institute of Materials Science. < http://smds.nims.go.jp/fatigue/index_en.
[1] Agrawal A, Choudhary A. A fatigue strength predictor for steels using ensemble data html > [accessed: March 31, 2016].
mining. In: Proceedings of 25th ACM international conference on information and [33] Gautham BP, Kumar R, Bothra S, Mohapatra G, Kulkarni N, Padmanabhan KA. More
knowledge management (CIKM) (Demo); 2016. p. 2497–500. efficient ICME through materials informatics and process modeling. John Wiley &
[2] Hey T, Tansley S, Tolle K. The fourth paradigm: data-intensive scientific discovery, Sons, Inc.; 2011. p. 35–42.
microsoft research; 2009. < http://research.microsoft.com/en-us/collaboration/ [34] Agrawal A, Deshpande PD, Cecen A, Basavarsu GP, Choudhary AN, Kalidindi SR.
fourthparadigm/ > . Exploration of data science techniques to predict fatigue strength of steel from
[3] Kalidindi SR, Graef MD. Materials data science: current status and future outlook. composition and processing parameters. Integr Mater Manuf Innov 2014;3(8):1–19.
Ann Rev Mater Res 2015;45(1):171–93. http://dx.doi.org/10.1146/annurev- [35] Hall M. Correlation-based feature selection for machine learning [Ph.D. thesis].
matsci-070214-020844. Citeseer; 1999.
[4] Rajan K. Materials informatics: the materials gene and big data. Ann Rev Mater Res [36] Weher E. Edwards, Allen, l.: an introduction to linear regression and correlation. (A
2015;45(1):153–69. http://dx.doi.org/10.1146/annurev-matsci-070214-021132. series of books in psychology.) W.H. Freeman and Comp., San Francisco 1976. 213
[5] Agrawal A, Choudhary A. Perspective: materials informatics and big data: S., Tafelanh., s 7.00. Biometrical J 1977;19(1):83–4.
Realization of the fourth paradigm of science in materials science. APL Mater [37] Aha DW, Kibler D. Instance-based learning algorithms. Mach Learn 1991;37–66.
2016;4(053208):1–10. [38] Bishop C. Neural networks for pattern recognition. Oxford: University Press; 1995.
[6] Hautier G, Fischer CC, Jain A, Mueller T, Ceder G. Finding natures missing ternary [39] Fausett L. Fundamentals of neural networks. New York: Prentice Hall; 1994.
oxide compounds using machine learning and density functional theory. Chem [40] Ebden M. Gaussian processes for regression: a quick introduction; 2008. < http://
Mater 2010;22(12):3762–7. www.robots.ox.ac.uk/mebden/reports/GPtutorial.pdf > [accessed: March 30,
[7] Gopalakrishnan K, Agrawal A, Ceylan H, Kim S, Choudhary A. Knowledge discovery 2016].
and data mining in pavement inverse analysis. Transport 2013;28(1):1–10. [41] Vapnik VN. The nature of statistical learning theory. Springer; 2000.
[8] Deshpande P, Gautham BP, Cecen A, Kalidindi S, Agrawal A, Choudhary A. [42] Shevade S, Keerthi S, Bhattacharyya C, Murthy K. Improvements to the SMO al-
Application of statistical and machine learning techniques for correlating properties gorithm for SVM regression. In: IEEE transactions on neural networks; 1999.
to composition and manufacturing processes of steels. John Wiley & Sons, Inc.; [43] Kohavi R. The power of decision tables. Proceedings of the 8th European conference
2013. on machine learning, ECML ’95. London, UK: Springer-Verlag; 1995. p. 174–89.
[9] Meredig B, Agrawal A, Kirklin S, Saal JE, Doak JW, Thompson A, et al. [44] Witten I, Frank E. Data mining: practical machine learning tools and techniques.
Combinatorial screening for new materials in unconstrained composition space with Morgan Kaufmann Pub; 2005.
machine learning. Phys Rev B 2014;89(094104):1–7. [45] Wang Y, Witten I. Induction of model trees for predicting continuous classes. In:
[10] Kusne AG, Gao T, Mehta A, Ke L, Nguyen MC, Ho K-M, et al. On-the-fly machine- Proc European conference on machine learning poster papers, Prague, Czech
learning for high-throughput experiments: search for rare-earth-free permanent Republic; 1997. p. 128–37.

399
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400

[46] Quinlan JR. Learning with continuous classes. World Scientific; 1992. p. 343–8. [52] Rodriguez J, Kuncheva L, Alonso C. Rotation forest: a new classifier ensemble
[47] Breiman L, Friedman J, Olshen R, Stone C. Classification and regression trees. method. IEEE Trans Pattern Anal Mach Intell 2006;28(10):1619–30. http://dx.doi.
Monterey, CA: Wadsworth and Brooks; 1984. org/10.1109/TPAMI.2006.211.
[48] Breiman L. Random forests. Mach Learn 2001;45(1):5–32. [53] Jolliffe IT. Principal component analysis. 2nd ed. Springer; 2002.
[49] Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal [54] Kittler J, Hatef M, Duin RP, Matas J. On combining classifiers. IEEE Trans Pattern
1999;38:367–78. Anal Mach Intell 1998;20(3):226–39.
[50] Breiman L. Bagging predictors. Mach Learn 1996;24(2):123–40. [55] Hall M, Frank E, et al. The weka data mining software: an update, SIGKDD Explor
[51] Ho T. The random subspace method for constructing decision forests. IEEE Trans 2009;11(1).
Pattern Anal Mach Intell 1998;20(8):832–44.

400

You might also like