Professional Documents
Culture Documents
2018 - An Online Tool For Predicting Fatigue Strength of Steel Alloys Based On Ensemble Data Mining
2018 - An Online Tool For Predicting Fatigue Strength of Steel Alloys Based On Ensemble Data Mining
A R T I C LE I N FO A B S T R A C T
Keywords: Fatigue strength is one of the most important mechanical properties of steel. Here we describe the development
Materials informatics and deployment of data-driven ensemble predictive models for fatigue strength of a given steel alloy represented
Supervised learning by its composition and processing information. The forward models for PSPP relationships (predicting property
Ensemble learning of a material given its composition and processing parameters) are built using over 400 experimental ob-
Fatigue strength
servations from the Japan National Institute of Materials Science (NIMS) steel fatigue dataset. Forty modeling
Online tool
techniques, including ensemble modeling were explored to identify the set of best performing models for dif-
ferent attribute sets. Data-driven feature selection techniques were also used to find a small non-redundant
subset of attributes, and the processing/composition parameters most influential to fatigue strength were
identified to inform future design efforts. The developed predictive models are deployed in a user-friendly online
web-tool available at http://info.eecs.northwestern.edu/SteelFatigueStrengthPredictor.
1. Introduction particular, the Materials Genome Initiative “will enable discovery, de-
velopment, manufacturing, and deployment of advanced materials at
The field of materials science and engineering involves conducting least twice as fast as possible today, at a fraction of the cost”. MGI
experiments and simulations to understand the science of materials in describes a Materials Innovation Infrastructure encompassing advanced
order to discover and engineer new materials with superior properties. A computational, experimental, and data informatics tools. The Materials
close look at the developments in the field of materials science and Genome Initiative Strategic Plan released in 2014 [28] also identifies
engineering over the centuries reveals that like in any other field of data analytics as one of the key objectives as part of integrating ex-
science, three distinct stages of development can be identified here: periments, computation, and theory, in order to realize the vision of
emperical/experimental, theoretical, and computational/simulation- MGI. It is worth noting that even though we are currently in the age of
based. Over the last few years, the data generated by such experiments “big data”, as far as the field of materials science is concerned, we are
and simulations has grown exponentially, making it amenable to still far from it, since open, accessible data has been rather limited.
knowledge extraction via data-driven techniques, thereby heralding the However, recent MGI-supported efforts [28–30] and other similar ef-
arrival of the fourth paradigm of science [2], which is data-driven forts around the world are promoting the availability and accessibility
science, unifying the first three paradigms of experiment, theory, and of digital data in materials science.
simulation. In the field of materials science, this has led to the emer- It is in the spirit and pursuit of the above-described vision and ap-
gence of the new field called materials informatics [3–5], which has proach of MGI that we discuss and present in this article, an online data
been very successful in recent years in deciphering the processing- informatics tool to predict the fatigue strength of a given steel alloy,
structure-property-performance (PSPP) relationships in materials sci- which is a crucial property to know, given the high cost and time of
ence [6–26]. fatigue testing, and potentially disastrous consequences of fatigue fail-
In June 2011, the US government launched the Materials Genome ures. It is the most important information required for design and
Initiative (MGI) [27] to realize the vision of development of advanced failure analysis of mechanical components. Fatigue is estimated to ac-
materials necessary for economic security and human well-being. In count for over 90% of all mechanical failures of structural components
☆
A conference version of this paper appeared as a short demonstration paper in the Proceedings of 25th ACM International Conference on Information and Knowledge Management
(CIKM), 2016, pp. 2497–2500 [1]. The current article significantly expands on the conference paper by presenting a comprehensive description of the methodologies, detailed comparison
results, and scientific insights.
⁎
Corresponding author.
E-mail address: ankitag@eecs.northwestern.edu (A. Agrawal).
https://doi.org/10.1016/j.ijfatigue.2018.04.017
Received 7 April 2017; Received in revised form 26 March 2018; Accepted 14 April 2018
Available online 26 April 2018
0142-1123/ © 2018 Elsevier Ltd. All rights reserved.
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400
[31], and hence, fatigue strength prediction is of critical importance. available. This is one of the largest databases in the world with details
The predictive models deployed in the tool are a result of the applica- on composition, mill product (upstream) features and subsequent pro-
tion of supervised learning techniques on an experimental fatigue da- cessing (heat treatment) parameters. It consists of carbon and low-alloy
taset from Japan National Institute of Materials Science MatNavi da- steels, carburizing steels and spring steels. Apart from composition and
tabase [32], which is freely accessible online. This dataset has been processing details, it also has data on mechanical properties of steels, in
previously used to build similar predictive models for fatigue strength particular rotating bending fatigue strength at 107 cycles at room tem-
[33,34], but the resulting models had not been deployed. Following are perature conditions. Fatigue strength is the highest stress that a mate-
the primary contributions of this work: rial can withstand for a given number of cycles without breaking, and is
thus an extremely critical property of steel for industrial use.
• Comparison of 40 supervised modeling configurations on the NIMS The features in the NIMS dataset can be categorized into the fol-
steel fatigue dataset, including ensemble modeling techniques. Prior lowing:
studies [33,34] did not explore the advanced ensemble modeling
techniques used in this work. The maximum number of models ex- • Chemical composition – %C, %Si, %Mn, %P, %S, %Ni, %Cr, %Cu, %
plored previously was 12 in [34]. Mo (all in wt%)
• More accurate predictive models than prior works on the same data. • Upstream processing details – ingot size, reduction ratio, non-me-
The R2 and MAE values from the new models was found to be sig- tallic inclusions
nificantly better (statistically) than the best models in [34] at • Heat treatment conditions – temperature, time and other process
p = 0.05. Visual inspection of scatter plots also reveal regions where conditions for normalizing, through-hardening, carburizing,
the new models perform significantly better. quenching and tempering processes
• Additional modeling experiments using only composition and only • Mechanical property – fatigue strength (MPa).
processing attributes to evaluate their predictive potential.
• Using data-driven feature selection techniques to identify a reduced 2.2. Preprocessing
set of non-redundant attributes and subsequent modeling experi-
ments on those to obtain predictive models using fewer input fea- We have used the data from [34], and summarize their preproces-
tures while still having a satisfactory predictive accuracy. sing here. The raw dataset from NIMS consisted of multiple grades of
• Identification of processing and composition parameters most in- steel and in some records, some of the heat treatment processing steps
fluential to fatigue strength. did not exist. This is because different specimens can be subjected to
• Deployment of the most accurate “forward” models identified as a different processing routes where some processing steps may not have
result of the above analysis in a web-tool. occurred. In order to make a coherent database, all the key processes in
the data (normalization, through hardening, carburization, quenching,
The web-tool presented here is expected to be a useful resource for tempering) were included. For the cases where a given process did not
the materials science and engineering community to make fast and take place, the corresponding time variable was set to zero and the
accurate predictions of this crucial property of steel, which can in turn corresponding temperature was set to the austenization temperature or
aid in discovering better steels. The rest of the article is organized as the average of rest of the data where the process exists. This pre-
follows: Section 2 presents the data mining workflow employed in this processed data was also made publicly available as supplementary data
study. The data analytics experiments and results are presented in accompanying [34] and is the starting point of the current study.
Section 3, and the online steel fatigue strength predictor deploying the The preprocessed data has 437 instances/rows, 25 features/columns
predictive models in Section 4. We conclude the article with some fu- (composition and processing parameters), and one target property (fa-
ture directions in Section 5. tigue strength). The details of the 25 attributes are given in Table 1.
The overall data-driven process is depicted as a block diagram in We used the correlation feature selection (CFS) method for feature
Fig. 1. We now describe the data and the various stages of the work- ranking. CFS is used to identify a subset of features highly correlated
flow. with the class variable and weakly correlated amongst them [35]. CFS
was used in conjunction with a best first search to find a subset S with
2.1. Data best average merit, which is given by:
n·rfo
Fatigue dataset for steel from Japan National Institute of Material MeritS =
n + n (n−1)·rff
Science (NIMS) MatNavi [32] was used in this work, which is publicly
390
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400
391
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400
392
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400
Table 2
Comparison of different techniques with 10-fold cross-validation setting using all 25 attributes (table sorted by MAE, best accuracy numbers boldfaced that were
statistically not distinguishable at p = 0.05, modeling techniques used in final Voting model also boldfaced).
Modeling Scheme R R2 MAE RMSE RAE RRSE TrainTime TestTime ModelSize
(MPa) (MPa) (% ) (% ) (s) (s) (bytes)
RotationForest_M5 0.9900 0.9801 18.74 26.50 14.76 14.44 1.1058 0.0094 716,215
RotationForest_MLP 0.9894 0.9789 18.97 27.00 15.00 14.76 3.4866 0.0102 664,851
Bagging_MLP 0.9895 0.9791 18.97 27.03 14.99 14.78 3.0486 0.0009 99,700
AdditiveRegression_M5 0.9897 0.9795 19.05 26.66 15.01 14.54 0.3996 0.0003 44,210
Bagging_M5 0.9890 0.9781 19.36 27.96 15.23 15.21 0.8039 0.0008 264,058
M5 ModelTrees 0.9893 0.9787 19.64 27.46 15.45 14.94 0.0885 0.0001 19,684
NeuralNetworks (MLP) 0.9881 0.9763 19.89 28.41 15.72 15.56 0.3652 0.0002 13,616
RandomCommittee_MLP 0.9877 0.9756 20.37 29.14 16.05 15.84 3.0811 0.0010 99,424
AdditiveRegression_MLP 0.9851 0.9704 20.94 32.27 16.49 17.56 3.5265 0.0009 99,476
RandomCommittee_REPTree 0.9874 0.9750 21.39 29.48 16.86 16.09 0.0637 0.0003 83,087
Bagging_REPTree 0.9872 0.9746 21.44 29.90 16.88 16.29 0.0597 0.0002 103,283
RotationForest_REPTree 0.9871 0.9744 21.82 30.06 17.20 16.39 0.2026 0.0086 653,435
RotationForest_RandomTree 0.9866 0.9734 22.25 30.58 17.57 16.69 0.1698 0.0088 941,446
RandomForest 0.9875 0.9752 22.28 29.43 17.59 16.08 0.2594 0.0025 2,762,888
RandomCommittee_RandomTree 0.9858 0.9718 23.59 31.42 18.65 17.18 0.0416 0.0003 430,293
Bagging_RandomTree 0.9853 0.9708 23.96 31.78 18.94 17.38 0.0323 0.0003 275,199
SVM 0.9816 0.9635 24.34 36.65 19.12 19.90 0.3355 0.0001 110,816
RotationForest_LinearRegression 0.9834 0.9671 24.62 34.26 19.37 18.62 0.1698 0.0088 610,462
Bagging_LinearRegression 0.9832 0.9667 24.69 34.48 19.42 18.73 0.0341 0.0003 43,911
AdditiveRegression_REPTree 0.9797 0.9598 24.71 35.70 19.43 19.39 0.0127 0.0001 13,277
REPTree 0.9812 0.9628 24.87 35.17 19.59 19.18 0.0064 0.0001 10,449
AdditiveRegression_LinearRegression 0.9830 0.9663 24.99 34.64 19.66 18.84 0.0117 0.0001 12,111
LinearRegression 0.9830 0.9663 24.99 34.64 19.66 18.83 0.0031 0.0001 7598
RandomSubSpace_REPTree 0.9829 0.9661 26.48 34.18 20.93 18.70 0.0388 0.0003 89,933
RandomSubSpace_MLP 0.9814 0.9631 27.25 36.06 21.49 19.67 1.5278 0.0010 100,764
RandomSubSpace_M5 0.9821 0.9645 27.45 35.67 21.70 19.51 0.6407 0.0008 158,759
RandomSubSpace_RandomTree 0.9768 0.9541 31.19 38.65 24.72 21.22 0.0359 0.0005 349,143
AdditiveRegression_DecisionStump 0.9663 0.9337 34.42 47.87 27.12 26.13 0.0156 0.0001 4194
RandomTree 0.9708 0.9425 34.82 44.35 27.55 24.33 0.0041 0.0001 45,445
AdditiveRegression_RandomTree 0.9715 0.9438 34.85 43.68 27.60 23.96 0.0104 0.0001 76,641
GaussianProcess 0.9670 0.9351 34.86 48.11 27.35 26.13 0.0846 0.0740 732,187
RandomSubSpace_LinearRegression 0.9660 0.9332 36.41 49.40 28.66 26.92 0.0191 0.0005 54,937
DecisionTable 0.9445 0.8921 37.21 58.47 29.34 31.81 0.0511 0.0002 24,944
NearestNeighbor_Kstar 0.9610 0.9235 40.37 49.81 32.01 27.36 0.0001 0.1965 110,615
NearestNeighbor_Ibk 0.9539 0.9099 47.86 55.65 37.92 30.39 0.0001 0.0037 97,573
RotationForest_DecisionStump 0.8622 0.7434 70.67 91.81 55.60 50.15 0.1494 0.0085 572,252
RandomSubSpace_DecisionStump 0.8402 0.7059 73.13 97.49 57.55 53.31 0.0106 0.0002 24,493
DecisionStump 0.8402 0.7059 73.13 97.49 57.55 53.31 0.0016 0.0000 2298
Bagging_DecisionStump 0.8402 0.7059 73.16 97.55 57.57 53.34 0.0148 0.0000 4418
( R2 = 0.9308,MAE = 38.86 MPa,RMSE = 48.14 MPa ), and also sig- two with Voting scheme gave the following accuracy numbers:
nificantly better than all other models, so Voting scheme was not R2 = 0.9738,MAE = 21.63 MPa,RMSE = 30.19 MPa .
necessary here to combine multiple models. For the processing-only Clearly, neither composition attributes alone nor processing attri-
dataset, two models resulted in statistically indistinguishable per- butes alone performed as well as using both together, suggesting that
formance. One was RandomForest and the second was Random- they capture complimentary information about materials, and sig-
Committee with REPTree as the base regressor. Combining these nificantly contribute to model accuracy.
Fig. 2. Scatter plots comparing the best model from [34] and the final model from the current study based on Voting scheme. The new model can be seen to perform
significantly better in the low fatigue strength region of the plot where the old model had failed.
393
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400
Table 3
Comparison of different techniques with 10-fold cross-validation setting using reduced subset of 9 non-redundant attributes (table sorted by MAE, best accuracy
numbers boldfaced that were statistically not distinguishable at p = 0.05, modeling techniques used in the final Voting model also boldfaced).
Modeling Scheme R R2 MAE RMSE RAE RRSE TrainTime TestTime ModelSize
(MPa) (MPa) (% ) (% ) (s) (s) (bytes)
RandomCommittee_REPTree 0.9680 0.9370 37.36 45.74 29.57 25.04 0.0265 0.0001 46,157
RotationForest_MLP 0.9679 0.9368 37.86 46.18 29.98 25.29 1.1262 0.0036 257,569
RotationForest_REPTree 0.9673 0.9357 37.86 46.41 29.96 25.43 0.0671 0.0029 241,800
M5 ModelTrees 0.9666 0.9343 38.25 46.85 30.24 25.63 0.0547 0.0001 14,884
AdditiveRegression_M5 0.9666 0.9343 38.25 46.85 30.24 25.63 0.1678 0.0002 26,643
Bagging_MLP 0.9669 0.9349 38.44 46.92 30.42 25.68 1.0263 0.0005 70,459
Bagging_M5 0.9662 0.9335 38.52 47.17 30.46 25.82 0.4883 0.0005 115,962
RotationForest_M5 0.9667 0.9345 38.53 47.08 30.45 25.75 0.5809 0.0033 300,707
Bagging_REPTree 0.9661 0.9333 38.71 47.16 30.65 25.83 0.0251 0.0001 69,484
NeuralNetworks (MLP) 0.9659 0.9330 38.79 47.56 30.71 26.04 0.0997 0.0002 10,070
RandomCommittee_MLP 0.9648 0.9308 39.24 48.34 31.09 26.48 1.0321 0.0005 70,183
REPTree 0.9631 0.9276 39.26 48.83 31.02 26.68 0.0026 0.0001 6370
AdditiveRegression_REPTree 0.9610 0.9235 39.66 50.01 31.28 27.23 0.0048 0.0001 7454
RandomSubSpace_REPTree 0.9615 0.9245 40.16 50.44 31.70 27.56 0.0186 0.0002 71,785
AdditiveRegression_MLP 0.9622 0.9258 40.46 49.34 32.07 27.10 0.7933 0.0006 65,278
DecisionTable 0.9551 0.9122 41.17 53.09 32.59 29.13 0.0123 0.0002 18,154
RandomSubSpace_M5 0.9599 0.9214 42.53 52.77 33.53 28.85 0.4339 0.0009 130,728
SVM 0.9528 0.9078 43.85 56.24 34.57 30.66 0.0881 0.0001 58,672
RotationForest_LinearRegression 0.9540 0.9101 43.90 55.37 34.62 30.20 0.0526 0.0031 221,755
Bagging_LinearRegression 0.9537 0.9095 43.98 55.58 34.68 30.31 0.0091 0.0002 30,594
RandomSubSpace_MLP 0.9552 0.9124 44.07 55.21 34.72 30.15 0.6222 0.0008 79,472
LinearRegression 0.9534 0.9090 44.11 55.76 34.79 30.41 0.0009 0.0001 5638
AdditiveRegression_LinearRegression 0.9534 0.9090 44.11 55.76 34.79 30.41 0.0029 0.0002 8955
RandomForest 0.9551 0.9122 44.36 53.68 35.15 29.46 0.2147 0.0024 1,844,181
AdditiveRegression_DecisionStump 0.9533 0.9088 44.43 55.12 35.07 30.20 0.0069 0.0001 3503
Bagging_RandomTree 0.9527 0.9076 45.76 55.07 36.28 30.21 0.0280 0.0002 182,919
RandomSubSpace_RandomTree 0.9497 0.9019 46.96 56.83 37.23 31.20 0.0250 0.0006 214,037
NearestNeighbor_Kstar 0.9481 0.8989 47.16 57.46 37.40 31.57 0.0002 0.0843 51,433
NearestNeighbor_Ibk 0.9498 0.9021 47.24 56.94 37.47 31.26 0.0001 0.0020 45,971
RotationForest_RandomTree 0.9489 0.9004 47.26 57.19 37.49 31.42 0.0754 0.0031 398,752
RandomCommittee_RandomTree 0.9493 0.9012 47.31 57.11 37.53 31.37 0.0315 0.0003 205,941
AdditiveRegression_RandomTree 0.9487 0.9000 47.42 57.46 37.61 31.55 0.0107 0.0001 29,506
RandomTree 0.9474 0.8976 47.63 57.98 37.77 31.86 0.0031 0.0001 22,236
GaussianProcess 0.9145 0.8363 52.83 73.36 41.50 39.93 0.0804 0.0706 680,172
RandomSubSpace_LinearRegression 0.9266 0.8586 56.35 75.39 44.09 40.80 0.0087 0.0004 42,166
RotationForest_DecisionStump 0.8533 0.7281 71.49 94.09 56.22 51.40 0.0480 0.0027 198,090
DecisionStump 0.8402 0.7059 73.13 97.49 57.55 53.31 0.0006 0.0000 1607
Bagging_DecisionStump 0.8397 0.7051 73.21 97.63 57.62 53.39 0.0062 0.0001 3727
RandomSubSpace_DecisionStump 0.8115 0.6585 76.10 104.92 59.79 57.24 0.0059 0.0002 18,168
3.3. Feature selection and modeling in low dimensional space deteriorates as the number of parameters is reduced, which is along
expected lines, since fewer input features means lesser information for
It would also be interesting to identify a smaller non-redundant the machine learning model to learn from. However, it is interesting to
subset of attributes that are most influential in predicting fatigue note that the CFS-reduced 9-parameter model performs slightly better
strength. As it was confirmed by the previous analysis that composition than the composition-only 9-parameter model, thereby underscoring
and processing attributes capture complementary information and are the importance of including both composition and processing in-
important for the model, we used correlation feature selection (CFS) formation in the model, and the efficacy of CFS technique in de-
technique to identify subsets of both kinds of attributes. The application termining a more informative 9-parameter set for building machine
of CFS technique to composition attributes identified a subset of six learning models.
composition attributes: C, Si, P, Cr, Cu, and Mo. The same analysis on The final modeling techniques determined to be the most accurate
processing attributes identified a subset of three processing attributes: for the four subsets were also tested on a holdout set (also known as the
THT (through hardening temperature), THQCr (cooling rate for through train-test split setting for testing). Here a 3:1 split was used, with 75%
hardening), and Tt (tempering time). We combined these six compo- data randomly selected for training, while remaining 75% used for
sition and three processing attributes to make a new dataset of nine testing the models. Note that choosing a 9:1 split would have essentially
attributes, and once again performed the regression modeling with corresponded to one out of the ten iterations of the 10-fold cross-vali-
various modeling schemes using the same settings (10 runs of 10-fold dation. A different split ratio with a smaller training split was thus
cross-validation) to obtain the best predictive model for this dataset. chosen in order to more realistically evaluate the expected accuracy of
Table 3 presents the comparison results. Top three models were found the models on unseen data. Table 5 presents the accuracy numbers of
to have statistically indistinguishable accuracy on all performance the final Voting models on 25% holdout dataset. As expected, the ac-
metrics, and were thus combined using the Voting scheme, resulting in curacy on the holdout set is marginally lower than the cross-validation
the following accuracy numbers: R2 = 0.9440,MAE = accuracy, primarily because of smaller training dataset.
36.41 MPa,RMSE = 44.14 MPa .
Table 4 lists the 10-fold cross-validation accuracy numbers of the
3.4. Most influential features for fatigue strength
final Voting models on different subsets of the NIMS database. Figs. 3
and 4 show the scatter plots and error histograms of the same. The full
Recall that the CFS technique used earlier to find reduced feature
25-parameter model has the highest accuracy, and the performance
subsets works based on correlation. Therefore, in order to dig deeper
394
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400
Table 4
Accuracy of final Voting models on different attribute sets with 10-fold cross-validation setting.
Dataset #Attributes R R2 MAE RMSE RAE RRSE
(MPa) (MPa) (% ) (% )
towards the understanding of which features are most influential for of it could also be an artifact of the way the dataset was constructed
fatigue strength, we look at the correlation of individual features w.r.t. processing parameters, as described in Section 2.2. None-
amongst themselves and with fatigue strength. While the CFS-based theless, it reconfirms the well-known existence of cause-effect PSPP
analysis presented earlier aimed at finding the minimal subset of fea- relationships in steels and materials in general, underscoring the
tures with good predictive power, here we look at predictive potential critical dependence of materials property/performance on proces-
of individual features to understand the ranking of features in terms of sing via (micro) structure. Although all processing parameters were
their influence on fatigue strength. Figs. 5 and 6 present the heat map of highly correlated with fatigue strength, the most influential ones
intra-feature correlation values and features ranked by correlation with were found to be related to tempering, carburization, diffusion,
fatigue strength respectively. The following observations can be made through hardening, and normalization (in that order). In particular,
from these figures w.r.t. processing parameters, composition para- tempering time, carburization temperature/time, diffussion tem-
meters, and property/performance metric (fatigue strength): perature/time, and quenching media temperature were highly po-
sitively correlated with fatigue strength. Given the way the dataset
• Correlation with processing parameters: Relatively higher correla- was constructed, most of these reflect the fact that performing one
tion (corresponding to darker cells in top left region of Fig. 5) is or more of these processing steps enhances the fatigue strength of
observed between fatigue strength and processing parameters and steels. Through hardening temperature/time, tempering tempera-
amongst processing parameters themselves. Some of it is expected ture, and cooling rates of tempering and through hardening were
since many processing parameters are inherently coupled together found to be negatively correlated with fatigue strength, suggesting
(e.g. carburization temperature and carburization time), while some that through hardening with rapid cooling adversely affects fatigue
Fig. 3. Scatter plots of the final Voting models for the four attribute sets.
395
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400
Fig. 4. Error histograms of the final Voting models for the four attribute sets.
Table 5
Accuracy of final Voting models on different attribute sets with 3:1 train:test split
Dataset #Attributes R R2 MAE RMSE RAE RRSE
(MPa) (MPa) (% ) (% )
Fig. 5. Intra-feature correlation heat map with positive and negative correlations in red and blue respectively. (For interpretation of the references to color in this
figure legend, the reader is referred to the web version of this article.)
396
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400
397
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400
the values of the composition and processing attributes of steels, and accurate, might be useful in cases when all the attribute values are not
generate predictions of fatigue strength for the given steel. In addition available. The final Voting models are deployed in this tool for both
to the models on the full set of attributes, the tool also has the option of attribute sets. The primary advantage of such a tool is ready access to
using the models on the reduced set of attributes, which although less fast and accurate forward models of PSPP relationships without the need
398
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400
to do costly experiments and simulations, which can help identify magnets. Sci Rep 2014;4.
promising candidates for further exploration with simulations and/or [11] Liu R, Yabansu YC, Agrawal A, Kalidindi SR, Choudhary AN. Machine learning
approaches for elastic localization linkages in high-contrast composite materials.
experiments. Fig. 9 shows the screenshot of the steel fatigue strength Integr Mater Manuf Innov 2015;4(13):1–17.
predictor, and the tool is available online at http://info.eecs. [12] Balachandran PV, Theiler J, Rondinelli JM, Lookman T. Materials prediction via
northwestern.edu/SteelFatigueStrengthPredictor. classification learning. Sci Rep 2015;5.
[13] Liu R, Kumar A, Chen Z, Agrawal A, Sundararaghavan V, Choudhary A. A predictive
machine learning approach for microstructure optimization and materials design.
5. Conclusion and future work Nat Sci Rep 2015;5. (11551).
[14] Faber F, Lindmaa A, von Lilienfeld OA, Armiento R. Crystal structure representa-
tions for machine learning models of formation energies. Int J Quant Chem 2015.
In this materials informatics study, we compared 40 different [15] Liu R, Ward L, Wolverton C, Agrawal A, Liao W-K, Choudhary A. Deep learning for
modeling techniques for predicting fatigue strength of steel alloys and chemical compound stability prediction. In: Proceedings of ACM SIGKDD workshop
analyzed the most influential features for fatigue strength using data on large-scale deep learning for data mining (DL-KDD); 2016. p. 1–7.
[16] Ward L, Agrawal A, Choudhary A, Wolverton C. A general-purpose machine
from a publicly available experimental database from NIMS. The most
learning framework for predicting properties of inorganic materials. npj Comput
accurate models were deployed in an online web-tool called the steel Mater 2016;2:16028.
fatigue strength predictor. The deployed tool is expected to be a useful [17] Liu R, Agrawal A, Liao W-K, Graef MD, Choudhary A. Materials discovery: under-
resource for researchers and practitioners in the materials science and standing polycrystals from large-scale electron patterns. In: Proceedings of IEEE
BigData workshop on advances in software and hardware for big data to knowledge
engineering community. discovery (ASH); 2016. p. 2261–9.
The presented workflow of data analytics and deployment of for- [18] Furmanchuk A, Agrawal A, Choudhary A. Predictive analytics for crystalline ma-
ward PSPP models can be readily applied on other experimental and terials: bulk modulus. RSC Adv 2016;6(97):95246–51.
[19] Agrawal A, Meredig B, Wolverton C, Choudhary A. A formation energy predictor for
computational materials science data. Future work includes making crystalline materials using ensemble data mining. In: Proceedings of IEEE interna-
attempts to further improve the model accuracy and generalizability by tional conference on data mining (ICDM) (Demo); 2016. p. 1276–9.
using/deriving more relevant attributes (such as by using CALPHAD [20] Ward L, Liu R, Krishna A, Hegde VI, Agrawal A, Choudhary A, et al. Including
crystal structure attributes in machine learning models of formation energies via
techniques) and/or using data-driven modeling techniques to building Voronoi tessellations. Phys Rev B 2017;96(2):024104.
and deploying accurate models for other material properties. We also [21] Liu R, Yabansu YC, Yang Z, Choudhary AN, Kalidindi SR, Agrawal A. Context aware
believe that the demonstrated ability to build such fast and accurate machine learning approaches for modeling elastic localization in three-dimensional
composite microstructures. Integr Mater Manuf Innov 2017:1–12.
forward models could also help in the future in realizing the inverse [22] Gagorik AG, Savoie B, Jackson N, Agrawal A, Choudhary A, Ratner MA, et al.
models of discovery and design, wherein new steel alloys with high Improved scaling of molecular network calculations: the emergence of molecular
fatigue strength can be identified along with the processing routes to domains. J Phys Chem Lett 2017;8(2):415–21.
[23] Gopalakrishnan K, Khaitan SK, Choudhary A, Agrawal A. Deep convolutional neural
make such high strength steels, enabling data-driven design of ad-
networks with transfer learning for computer vision-based data-driven pavement
vanced steels. distress detection. Constr Build Mater 2017;157:322–30.
[24] Furmanchuk A, Saal JE, Doak JW, Olson GB, Choudhary A, Agrawal A. Prediction of
Acknowledgments seebeck coefficient for compounds without restriction to fixed stoichiometry: a
machine learning approach. J Comput Chem 2018;39(4):191–202.
[25] Gopalakrishnan K, Gholami H, Vidyadharan A, Choudhary A, Agrawal A. Crack
The authors are grateful to NIMS for making the raw data on fatigue damage detection in unmanned aerial vehicle images of civil infrastructure using
steel strength publicly available, and also to the authors of Ref. [33] to pre-trained deep learning model. Int J Traffic Transp Eng 2018;8:1.
[26] Paul A, Acar P, Liu R, Liao W-K, Choudhary A, Sundararaghavan V, et al. Data
preprocess the raw NIMS data and make it available as supplementary sampling schemes for microstructure design with vibrational tuning constraints. Am
data accompanying Ref. [34]. This work was performed under the Inst Aeronaut Astronaut (AIAA) J 2018;56(3):1239–50.
following financial assistance award 70NANB14H012 from U.S. De- [27] Materials Genome Initiative for Global Competitiveness, June 2011; OSTP 2011.
[28] Materials Genome Initiative Strategic Plan, National Science and Technology
partment of Commerce, National Institute of Standards and Technology Council Committee on Technology Subcommittee on the Materials Genome
as part of the Center for Hierarchical Materials Design (CHiMaD). The Initiative; June 2014.
authors also acknowledge partial support from AFOSR award FA9550- [29] Ward CH, Warren JA, Hanisch RJ. Making materials science and engineering data
more valuable research products. Integr Mater Manuf Innov 2014;3(1):1–17.
12-1-0458.
[30] Materials science and engineering data challenge. < https://www.challenge.gov/
challenge/materials-science-and-engineering-data-challenge/ > [accessed: March
References 31, 2016].
[31] Dieter GE. Mechanical metallurgy. 3rd ed. Mc Graw-Hill Book Co.; 1986.
[32] National Institute of Materials Science. < http://smds.nims.go.jp/fatigue/index_en.
[1] Agrawal A, Choudhary A. A fatigue strength predictor for steels using ensemble data html > [accessed: March 31, 2016].
mining. In: Proceedings of 25th ACM international conference on information and [33] Gautham BP, Kumar R, Bothra S, Mohapatra G, Kulkarni N, Padmanabhan KA. More
knowledge management (CIKM) (Demo); 2016. p. 2497–500. efficient ICME through materials informatics and process modeling. John Wiley &
[2] Hey T, Tansley S, Tolle K. The fourth paradigm: data-intensive scientific discovery, Sons, Inc.; 2011. p. 35–42.
microsoft research; 2009. < http://research.microsoft.com/en-us/collaboration/ [34] Agrawal A, Deshpande PD, Cecen A, Basavarsu GP, Choudhary AN, Kalidindi SR.
fourthparadigm/ > . Exploration of data science techniques to predict fatigue strength of steel from
[3] Kalidindi SR, Graef MD. Materials data science: current status and future outlook. composition and processing parameters. Integr Mater Manuf Innov 2014;3(8):1–19.
Ann Rev Mater Res 2015;45(1):171–93. http://dx.doi.org/10.1146/annurev- [35] Hall M. Correlation-based feature selection for machine learning [Ph.D. thesis].
matsci-070214-020844. Citeseer; 1999.
[4] Rajan K. Materials informatics: the materials gene and big data. Ann Rev Mater Res [36] Weher E. Edwards, Allen, l.: an introduction to linear regression and correlation. (A
2015;45(1):153–69. http://dx.doi.org/10.1146/annurev-matsci-070214-021132. series of books in psychology.) W.H. Freeman and Comp., San Francisco 1976. 213
[5] Agrawal A, Choudhary A. Perspective: materials informatics and big data: S., Tafelanh., s 7.00. Biometrical J 1977;19(1):83–4.
Realization of the fourth paradigm of science in materials science. APL Mater [37] Aha DW, Kibler D. Instance-based learning algorithms. Mach Learn 1991;37–66.
2016;4(053208):1–10. [38] Bishop C. Neural networks for pattern recognition. Oxford: University Press; 1995.
[6] Hautier G, Fischer CC, Jain A, Mueller T, Ceder G. Finding natures missing ternary [39] Fausett L. Fundamentals of neural networks. New York: Prentice Hall; 1994.
oxide compounds using machine learning and density functional theory. Chem [40] Ebden M. Gaussian processes for regression: a quick introduction; 2008. < http://
Mater 2010;22(12):3762–7. www.robots.ox.ac.uk/mebden/reports/GPtutorial.pdf > [accessed: March 30,
[7] Gopalakrishnan K, Agrawal A, Ceylan H, Kim S, Choudhary A. Knowledge discovery 2016].
and data mining in pavement inverse analysis. Transport 2013;28(1):1–10. [41] Vapnik VN. The nature of statistical learning theory. Springer; 2000.
[8] Deshpande P, Gautham BP, Cecen A, Kalidindi S, Agrawal A, Choudhary A. [42] Shevade S, Keerthi S, Bhattacharyya C, Murthy K. Improvements to the SMO al-
Application of statistical and machine learning techniques for correlating properties gorithm for SVM regression. In: IEEE transactions on neural networks; 1999.
to composition and manufacturing processes of steels. John Wiley & Sons, Inc.; [43] Kohavi R. The power of decision tables. Proceedings of the 8th European conference
2013. on machine learning, ECML ’95. London, UK: Springer-Verlag; 1995. p. 174–89.
[9] Meredig B, Agrawal A, Kirklin S, Saal JE, Doak JW, Thompson A, et al. [44] Witten I, Frank E. Data mining: practical machine learning tools and techniques.
Combinatorial screening for new materials in unconstrained composition space with Morgan Kaufmann Pub; 2005.
machine learning. Phys Rev B 2014;89(094104):1–7. [45] Wang Y, Witten I. Induction of model trees for predicting continuous classes. In:
[10] Kusne AG, Gao T, Mehta A, Ke L, Nguyen MC, Ho K-M, et al. On-the-fly machine- Proc European conference on machine learning poster papers, Prague, Czech
learning for high-throughput experiments: search for rare-earth-free permanent Republic; 1997. p. 128–37.
399
A. Agrawal, A. Choudhary International Journal of Fatigue 113 (2018) 389–400
[46] Quinlan JR. Learning with continuous classes. World Scientific; 1992. p. 343–8. [52] Rodriguez J, Kuncheva L, Alonso C. Rotation forest: a new classifier ensemble
[47] Breiman L, Friedman J, Olshen R, Stone C. Classification and regression trees. method. IEEE Trans Pattern Anal Mach Intell 2006;28(10):1619–30. http://dx.doi.
Monterey, CA: Wadsworth and Brooks; 1984. org/10.1109/TPAMI.2006.211.
[48] Breiman L. Random forests. Mach Learn 2001;45(1):5–32. [53] Jolliffe IT. Principal component analysis. 2nd ed. Springer; 2002.
[49] Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal [54] Kittler J, Hatef M, Duin RP, Matas J. On combining classifiers. IEEE Trans Pattern
1999;38:367–78. Anal Mach Intell 1998;20(3):226–39.
[50] Breiman L. Bagging predictors. Mach Learn 1996;24(2):123–40. [55] Hall M, Frank E, et al. The weka data mining software: an update, SIGKDD Explor
[51] Ho T. The random subspace method for constructing decision forests. IEEE Trans 2009;11(1).
Pattern Anal Mach Intell 1998;20(8):832–44.
400