You are on page 1of 12

Statistical Analysis of Inclusion Chemistry Distributions in Steels

M. Abdulsalam1, P. Kaushik2, B. Webler1


1
Carnegie Mellon University
5000 Forbes Ave, Wean Hall 3325, Pittsburgh, PA, USA 15213
Phone: (412) 638-3310
Email: mabdulsa@andrew.cmu.edu
2
ArcelorMittal Global R&D
3001 E Columbus Drive, East Chicago, IN, USA 46312
Phone: (219) 399-5423
Email: pallava.kaushik@arcelormittal.com

Keywords: Inclusion, Statistical Analysis, Clustering, EM algorithm, K-means

INTRODUCTION
Non-metallic inclusions play a major role in defining the properties and cleanliness of steel. They are an inevitable product of
the chemical reactions occurring during steel processing, and if not controlled they can drastically affect the properties of
steel, and in some cases be detrimental to in-use performance. Research on the evolution and development of inclusions
throughout steel processing is a constantly ongoing study [1]. Thus, understanding the types, chemistries, and size distributions
of non-metallic inclusions is vital for optimizing the properties and processing of steel. Numerous techniques have been
developed for detecting and analyzing inclusions, such as conventional optical microscopy, laser induced spectrometry, ultra
sound testing, X-ray computed tomography, and scanning electron microscopy (SEM), just to name a few [2]. An integral part
of the analysis is the identification or classification of the inclusions, since certain types of inclusions can have adverse
effects on the steel’s properties. Various methods have been applied to identify the types of inclusions present within a steel
sample, using user defined rules. Story et al. [3] utilized an automated steel cleanliness analysis tool (ASCAT) to examine the
chemistry, content, and size of inclusions, which in turn helped to improve the castability and properties of steel.
Statistical analysis has been widely used in all fields of research as it provides further insight and explores correlations
between data by combining the collection, analysis, interpretation, presentation, and organization of the data [4]. Generally,
the aim of statistical analysis is to identify patterns or trends in data sets, that are otherwise difficult to recognize. For
relatively small sample sizes, traditional methods of sampling data and interpreting results have been used, and for multi-
dimensional and extremely large data sets today’s powerful computers and advanced algorithms enable statistical analysis to
be a valuable research tool. The output of inclusion analysis is in this category of large, multivariate data sets. The variables
in each observation can correspond to the chemistries, dimensions, morphology, or the location of each inclusion, therefore
examining the correlations and trends between these variables can prove to be very informative.
Nowadays, investigation of inclusions is widely carried out using scanning electron microscopes (SEM) equipped with
energy dispersive x-ray (EDX) spectroscopes, using automated inclusion characterization (AIC). By scanning steel samples
in electron microscopes, automated analysis identifies inclusions within the steel matrix and determines their size,
morphology, and chemistry [5]. This study investigates ways to optimize data analysis and increase the amount of information
obtained from automated inclusion studies, by using statistical analyses to identify correlations in the multivariable data sets
generated by AIC. So far statistical techniques carried out in this study have been mainly focused on cluster analyses.
Not to be confused with the clustering of inclusions, cluster analysis, or sometimes termed as clustering, is basically an
automated method of arranging data into groups (referred to as clusters). Accordingly, the analysis referred to in this study
pertains to the cluster of data, where a cluster is comprised of inclusions with similar chemistries and not a physical cluster of
inclusions. Clustering groups observations by similarity, such that observations in one cluster are related to each other, in one
way or another, more than observations in other clusters. Clustering itself is not one specific algorithm that can be applied to
any set of data, rather it’s a general term used that serves a specific task. There are numerous clustering models, each with its

AISTech 2017 Proceedings. © 2017 by AIST. 2817


own algorithm on how to define and identify a cluster, as a result the different models vary significantly depending on the
input parameters and desired output [6]. Thus, a thorough understanding of the cluster models and the data sample is essential
for selecting the appropriate clustering algorithm.
Figure 1 below provides a visual illustration of a clustering algorithm, applied to the “Old Faithful” data set, a sample
bivariate data set. The algorithm identified 3 clusters, as shown on the right, were the observations are color coded according
to the cluster in which they belong.

Figure 1 Clustering of "Old Faithful" Data Set

EXPERIMENTAL PROCEDURE
The current aim of this study, is to utilize the output results obtained from the automated analysis to determine and propose
an optimum clustering model that will identify the types of inclusions present in a steel sample.
The input data set to be used for clustering, is the number of inclusions in a sample along with the chemical composition of
each inclusion. Where the number of inclusions identified are considered as the number of observations in the data set, and
their chemistries represent the number of variables (dimensionality). The desired output is the number of clusters along with
their chemistries, corresponding to the type of inclusion each cluster represents, such as spinel (MgAl2O3), Al2O3, CaS etc.
For the purpose of this study, data sets were limited to 5 dimensions, the Mg, Al, Ca, S and Mn contents of each inclusion,
obtained from the automated analysis.
Sample Preparation and SEM Analysis
All samples investigated under this study are from an industrial production facility. They were two ladle samples and one
tundish sample, all from the same heat. The lollipop samples were cut to less than 1 inch in length and mounted to
accommodate the SEM, samples were then ground and polished in incremental steps, to achieve a smooth polished cross-
section ready for the automated analysis.
The automated analysis was conducted in an FEI/Aspex Explorer SEM utilizing a back-scattered electron detector, equipped
with an EDX analyzer. The operating parameters were all held constant for all samples to ensure consistency, the accelerating
voltage was 10 kV, the spot size set to 40%, and the working distance to 16 - 17 mm. Based on previous literature [7],
automated inclusion analyses conducted at 10 kV are more precise due to steel matrix distortion effects produced at higher
voltages, such as 20 kV. Accordingly, a spot size of 40% is a good compromise between image resolution and analysis speed
for an accelerating voltage of 10 kV. The selected working distance is the specified optimal distance for EDX analysis, as
per the SEM’s manufacturer, FEI [8].
The EDX analysis recorded the counts of each chemical element within the inclusions, therefore counts were converted to
estimated chemical compositions using the Merlet algorithm [9] in an Excel sheet. Thereafter, the results were filtered out to
ensure that all the readings to be analyzed are from actual non-metallic inclusions, and not from contaminants on the surface
of the samples or from erroneous readings. The filtering criteria excluded any reading were the [Fe] plus the [Mn] is larger
than 85%, in addition to filtering out any reading with more than 100 Si counts, and any reading with an area greater than
19.6 μm2. An area of 19.6 μm2 was specified, since 19.6 μm2 corresponds to an equivalent diameter of 5 μm, the expected
inclusion size [10], therefore any feature with a diameter larger than 5 μm was omitted.
Clustering Algorithms
Initially, k-means, a popular algorithm for clustering, was applied to the data. The k-means algorithm is a centroid-based
clustering technique, where each cluster is assigned a mean vector (centroid) [6]. It is an iterative technique where the number

2818 AISTech 2017 Proceedings. © 2017 by AIST.


of clusters (k) is pre-determined, and the initial centroids are either computed by randomly assigning observations to each
cluster and calculating their mean, or by selecting observations from the given data set and using them as the initial
centroids. Each observation is then assigned to the nearest centroid by calculating the Euclidean distance, so as to minimize
the “within clusters sum of squares” (WSS), and based on this assignment the new centroids are calculated. These 2 steps are
repeated until convergence, when the centroids and observation assignments no longer change.
One drawback of the k-means algorithm is that , the number of clusters, needs to be pre-determined. To overcome this issue,
the “Elbow” plot can be utilized to validate the ideal number of clusters [11], it is a plot of vs the within cluster sum of
squares (equation1).

, (1)

= number of observations
= number of clusters
, = nth observation of the kth cluster
= mean of the kth cluster

As increases, the sum of squares within each group of clusters decreases, and generally the optimal number of clusters is
when there’s a sharp change in the slope, at the “elbow” of the k vs WSS plot.
Another disadvantage is the tendency for k-means to produce equally sized clusters, which is not always realistic especially
when considering non-metallic inclusions in steel. As a result, the next approach was to apply the expectation-maximization
(EM) algorithm. The features that distinguish the EM algorithm from k-means are that the EM algorithm allows for varying
size and shape of clusters, and the EM algorithm is considered as a soft clustering technique. Soft clustering is when each
observation is assigned a probabilistic mix proportion to belong to each cluster, which is beneficial when observations lie
midway between clusters and useful for assessing the quality of the classification. Contrary to k-means, a hard clustering
technique, where each observation is assigned to one cluster.
When considering the EM algorithm, the data set is assumed to be a Gaussian Mixture Model (GMM), where each cluster has
its own normal distribution and the data set’s mix distribution represents the sum of the cluster distributions. This is
illustrated in Figure 2, where the blue curves are the individual clusters’ Gaussian distributions and the red curve is the GMM
of the whole data set.

Figure 2 EM Algorithm Assumption


Similar to the k-means algorithm, EM is an iterative technique, involving an Expectation step (E step) and a Maximization
step (M step) [12]. The first step assigns initial parameters, the means and covariances of each cluster, either randomly or by
other methods, such as utilizing the k-means algorithm to initialize the EM algorithm. The second step is the E step,
calculating the probability of each observation to belong to each cluster from the initial parameters. The M step re-estimates
the parameters using the values computed in the E step. The outcome is then evaluated and checked for convergence, if not,
the E and M step are re-iterated until convergence.
A significant advantage for the EM algorithm over the k-means, is that it can be applied to various models, where clusters
can have equal or varying sizes, shapes, or orientations [12]. To identify the best fit model and optimal number of clusters, the
Bayesian Information Criterion (BIC) (equation 2) is computed, for each model and a range of numbers of clusters.

AISTech 2017 Proceedings. © 2017 by AIST. 2819


2 ∙ (2)
= likelihood function
df = degrees of freedom in the model
n = number of observations

The BIC is the value for the maximized log likelihood, considering the models, observations, parameters, data dimensions,
and number of clusters [13]. Generally, the optimal model and number of clusters are given by the largest BIC value computed.
However, in some cases singularities can occur and the log likelihood converges to infinity, as a result the BIC value cannot
be computed. Such instances arise when a cluster’s mean converges on a data point, or when the data set has very few
scattered observations [12, 14].
Sample Information
The samples investigated in this study are from an aluminum killed heat, with no calcium treatment; the samples were
processed to be clean with a relatively low inclusion density due to slightly higher Al content, with the steel’s final sulfur
content at 10 ppm. Table 1 below explains the samples identification and acquisition time. The average inclusion chemistries
are given in Figure 3.

Table 1 Heat sampling


Sample ID Sample acquisition
A Ladle sample, after Al addition
B Ladle sample, after C addition
C Tundish sample

100%
80% Mn
60% Ca

40% S
Al
20%
Mg
0%
Sample A Sample B Sample C

Figure 3 Sample Average Inclusion Chemistry


To demonstrate the capabilities of the statistical models used, the number of variables selected for cluster analysis were set to
5, corresponding to the Mg, Al, S, Ca, and Mn contents of the inclusions. As stated earlier, the raw data obtained from the
EDX analysis was converted to chemical compositions using the Merlet algorithm, and thereafter to mole fractions. Which
were used as the input data for clustering, in the form of a n x 5 matrix, where n is the number of inclusions observed in a
sample and the “5” corresponds to the variables mentioned above.
All the cluster analyses performed in this study were carried out in RStudio, an interactive interface for R, a free and open-
source programming language for statistical computing and graphics [15]. For the k-means algorithm, a built-in function was
included in R, and for the EM algorithm the package “MCLUST” was installed [16]. In addition, all the ternary diagrams
shown in this study were also generated in RStudio, using the “GGTERN” package [17].
To visualize the high dimensional data multi-ternary diagrams were used. The Mg-Al-Ca ternary diagram was generated for
all samples, in all the diagrams the inclusions are colored with respect to their clusters. No plots were made to represent the
sulfur or manganese compositions, due to their low contents in all the samples, and in the inclusions.

2820 AISTech 2017 Proceedings. © 2017 by AIST.


RESULTS AND DISCUSSION
K-means Clustering
The k-means algorithm was performed on the first sample, A, for a range of 1 to 6 clusters. To evaluate the optimal number
of clusters for each sample, the within cluster (groups) sum of squares (WSS) was computed for each number of clusters to
generate the elbow plot (k vs WSS), Figure 4. The tables below indicate the cluster means (centroids), along with the cluster
density in terms of number of inclusions and area of inclusions.
It is worthwhile to mention that since k-means is initialized with random parameters, re-applying the algorithm to the same
data set may generate slightly different results. However, this was only the case with a large k value, but when the number of
clusters is selected based on the elbow plot, the algorithm produced the exact same values for the cluster means and cluster
densities.
Sample A

Figure 4 Sample-A: Elbow plot, to identify optimal number of clusters for k-means
The elbow plot, Figure 4, generated for sample A, does not show a distinct slope change that can clearly specify the best fit
number of clusters, rather there’s a compromise between 2 or 3 clusters. This demonstrates the ambiguity of the k-means
algorithm in identifying the optimal number of clusters. Therefore, results were generated for 2 and 3 clusters for
comparison. A summary of the cluster analysis is provided in Table 2. The color of each cluster in the table corresponds to
the colors on the ternary plots, Figure 5 and Figure 6.

Table 2 Sample-A: k-means clustering summary


Sample ID Sample A Scan Area: 11.52 mm2
No. of inclusions 254 Inc. Density: 22 /mm2
No. of Clusters 2
Avg Inc Inc
Cluster Mg Al S Ca Mn Inc/Clus Inc Area/Clus
Area (μm2) ppm
red 2 8% 85% 1% 1% 4% 132 52% 1.96 258.2 38% 22
blk 1 25% 58% 8% 6% 3% 122 48% 3.44 419.2 62% 36
No. of Clusters 3
Avg Inc Inc
Cluster Mg Al S Ca Mn Inc/Clus Inc Area/Clus
Area (μm2) ppm
red 2 8% 86% 1% 1% 4% 126 50% 1.93 242.6 36% 21
blk 1 26% 64% 4% 3% 2% 106 42% 3.71 393.4 58% 34
grn 3 20% 34% 22% 20% 5% 22 9% 1.89 41.5 6% 4

Three ternary plots were generated for each sample, a “Number Fraction”, an “Area Fraction”, and a “Cluster Means” plot.
The number fraction plot simply represents each inclusion as a point on the ternary scales. On the area fraction plot,
inclusions are represented as circles where the size of each circle is relative to the area of the inclusion in μm2. The cluster
means diagram plots the cluster centroids as large filled circles, where the size of each circle corresponds to the inclusion area

AISTech 2017 Proceedings. © 2017 by AIST. 2821


within the cluster, relative to the sample’s total inclusion area. The cluster means plot helps to visualize the overall weight of
each cluster, which is particularly useful when comparing dense and scattered clusters.

Figure 5 Sample-A: Ternary Plots for 2 Clusters. Number Fraction - inclusions chemistry. Area Fraction – inclusion
chemistry relative to size. Cluster Means – cluster centroids with respect to their inclusion area

Figure 6 Sample A: Ternary Plots for 3 Clusters. Number Fraction - inclusions chemistry. Area Fraction – inclusion
chemistry relative to size. Cluster Means – cluster centroids with respect to their inclusion area
Based on the ternary plots, the sample is comprised mainly of alumina and spinel inclusions, with slight calcium pick up in
some inclusions. When 3 clusters are selected, rather than 2, the black cluster (cluster no. 1) is broken down into 2 clusters,
forming a more defined cluster in the spinel region and a small cluster with higher Ca contents, shown in green.
EM Algorithm Clustering
Clustering using the EM algorithm was carried out on all 3 samples. The optimal number of clusters along with its best fit
model were identified by the BIC algorithm mentioned earlier. The BIC is computed for 14 different models, on a range of 1
to 9 clusters for each model. The 14 models represent the covariance structure of the data set, they include different variations
in distribution, shape, volume, and orientation of the clusters [12]. The optimal model with the best fit number of clusters is the
one that generates the highest BIC value. As shown in Figure 7, the optimal number of clusters is 4, 4, and 5 clusters for
samples A, B, and C, respectively. For all the samples, the best fit model identified by the BIC technique was the VEV
model, representing clusters with an ellipsoidal distribution and same shape but varying in volume and orientation.

2822 AISTech 2017 Proceedings. © 2017 by AIST.


Figure 7 BIC plots for all samples, to identify optimal number of clusters for EM
The same trend in the BIC plots can be seen in all samples, where the BIC curve tends to flatten out, or become negatively
sloped, after 3 clusters. In some cases, such as sample A, there is a small compromise between the top BIC values computed,
therefore applying initial constraints on the algorithm to better define the data set could yield more accurate BIC values, and
in turn more precise number of clusters.
Sample A
The output of EM clustering, applied to sample A, is shown in Table 3. Figure 8 represents a visual summary of Table 3,
where the figure on the left is simply a plot of the mean chemistry of each cluster, and the figure on the right displays the
average inclusion area of each cluster as a bar chart, and the percentage of inclusions per cluster (Inc/Clus) as a blue line, and
the inclusion area percentage per cluster (Inc Area/Clus) as an orange line. The Mg-Al-Ca ternary plots for this sample as
shown in Figure 9.
As shown, the EM algorithm identified 4 clusters for this sample, with 2 main clusters, the black and green clusters,
identified as alumina and spinel inclusions, respectively. These 2 clusters comprise the majority of the inclusions,
approximately 79% of the inclusion population, which is expected since this sample was acquired after Al addition. The blue
cluster encompasses complex spinel inclusions with slight calcium pickup, while the red clusters includes the few outlying
inclusions.

Table 3 Sample-A: EM Clustering Summary


Sample ID Sample A Scan Area: 11.52 mm2
No. of inclusions 254 Inc. Density: 22 /mm2
No. of Clusters 4 Model: Ellipsoidal, equal shape
Avg Inc Inc
Cluster Mg Al S Ca Mn Inc/Clus Inc Area/Clus
Area (μm2) ppm
blk 1 9% 86% 1% 1% 4% 125 49% 2.01 251.5 37% 22
grn 3 28% 65% 3% 3% 1% 76 30% 4.15 315.7 47% 27
blu 4 19% 56% 12% 9% 5% 45 18% 2.22 100.1 15% 9
red 2 19% 22% 27% 25% 7% 8 3% 1.26 10.1 2% 1

Figure 8 Sample-A: EM Cluster Summary

AISTech 2017 Proceedings. © 2017 by AIST. 2823


Figure 9 Sample-A: EM Clustering Ternary Plot. Number Fraction - inclusions chemistry. Area Fraction – inclusion
chemistry relative to size. Cluster Means – cluster centroids with respect to their inclusion area
Sample B
Sample B has a fairly similar distribution to A, with a lower inclusion density. Again, the EM algorithm identified 4 clusters,
corresponding to the same types of inclusions found in sample A. However, the table and figures below indicate a decrease in
alumina and more inclusions transitioning to spinel.

Table 4 Sample B: EM Clustering Summary


Sample ID Sample B Scan Area: 14.12 mm2
No. of inclusions 204 Inc. Density: 14 /mm2
Model: Ellipsoidal, equal
No. of Clusters 4
shape
Avg Inc Inc
Cluster Mg Al S Ca Mn Inc/Clus Inc Area/Clus
Area (μm2) ppm
blk 1 25% 54% 10% 9% 2% 65 32% 3.40 220.8 39% 16
blu 4 15% 78% 1% 1% 5% 62 30% 1.83 113.2 20% 8
red 2 29% 66% 2% 2% 1% 46 23% 4.18 192.2 34% 14
grn 3 19% 34% 26% 13% 8% 31 15% 1.48 46.0 8% 3

Figure 10 Sample B: EM Cluster Summary

2824 AISTech 2017 Proceedings. © 2017 by AIST.


Figure 11 Sample B: EM Clustering Ternary Plot. Number Fraction - inclusions chemistry. Area Fraction – inclusion
chemistry relative to size. Cluster Means – cluster centroids with respect to their inclusion area
Sample C
In the tundish sample, an additional cluster is identified, the blue cluster, with a low inclusion density. Although this cluster
has a similar inclusion chemistry compared to the majority of the population, the average inclusion area in this cluster is
relatively lower, as indicated in Table 5 and Figure 12. Moreover, transition of inclusions back to alumina is shown by the
cyan cluster, which suggests re-oxidation of inclusions in the tundish.

Table 5 Sample C: EM Clustering Summary


Sample ID Sample-C Scan Area: 14.12 mm2
No. of inclusions 149 Inc. Density: 11 /mm2
No. of Clusters 5 Model: Ellipsoidal, equal shape
Avg Inc Inc
Cluster Mg Al S Ca Mn Inc/Clus Inc Area/Clus
Area (μm2) ppm
red 2 28% 65% 3% 2% 2% 50 34% 3.28 164.0 42% 12
cyn 5 11% 80% 2% 1% 6% 40 27% 2.54 101.7 26% 7
grn 3 24% 49% 14% 11% 3% 33 22% 2.94 97.1 25% 7
blk 1 7% 12% 31% 44% 6% 13 9% 1.06 13.8 4% 1
blu 4 13% 58% 15% 4% 10% 13 9% 0.95 12.4 3% 1

Figure 12 Sample C: EM Cluster Summary

AISTech 2017 Proceedings. © 2017 by AIST. 2825


Figure 13 Sample C: EM Clustering Ternary Plot. Number Fraction - inclusions chemistry. Area Fraction – inclusion
chemistry relative to size. Cluster Means – cluster centroids with respect to their inclusion area
Comparison of EM and K-means Clustering
To demonstrate the contrast between the clustering methods, both algorithms were applied to the first sample. The obvious
difference was the ambiguity of selecting an optimal number of clusters in the k-means algorithm, since there was no distinct
change in slope of the elbow plot (Figure 4). Furthermore, overlapping of clusters cannot be identified by the k-means, as
opposed to the EM algorithm. Other drawbacks of k-means clustering include, assuming clusters are of the same size and
density [14], and it’s a form of hard clustering, where each observation is assigned to a specific cluster. Compared to the EM
algorithm, which is a form of soft clustering, each observation is assigned a mix proportion to belong to each cluster, and
clusters can vary in size, shape, or orientation, and with the use of the BIC a decisive selection on the number of clusters can
be made. Even though results produced by the BIC can be quite off, applying initial constraints on the algorithm to better
define the data set could produce more accurate results in selecting the number of clusters.
Considering the fundamentals of both clustering techniques, the EM algorithm is more suitable for inclusion cluster analysis.
The main drawback faced was the occasional overfitting of clusters, therefore the next step is to further refine the EM
algorithm and the BIC to ensure compatibility with inclusion analysis. Moreover, additional samples from different heats
need to be examined for clustering. The analysis needs to be applied to varying samples with well-defined inclusion
chemistries to ensure reliability and accuracy. In addition, examining and relating inclusion size to clustering can prove to be
effective and useful in defining cluster parameters.
Some of the issues that need to be considered for refining the algorithm include investigating the possibility of noise or
outliers in the data set and how to account for it in the model. As can be seen from the results presented in this study, some
clusters included outlying inclusions only, therefore identifying the outlying inclusions prior to clustering could result in a
more precise definition of the clusters. Since the optimal number of clusters is chosen based on the BIC, investigating other
methods for evaluating the BIC can also be helpful, such as using the maximum a posteriori (MAP) estimate instead of the
maximum likelihood estimate to calculate the BIC [12], or adjusting the default parameters in the MCLUST function to better
define BIC values. Moreover, by considering the observations’ mix proportions per cluster, the quality of cluster
classifications can be examined, were a high probability of an observation belonging to a specific cluster suggests better
classification.
The samples investigated in this study were relatively clean with low inclusions densities, and the formation of spinel and
alumina inclusions can be clearly identified without clustering, nonetheless the analysis was applied to demonstrate the
capabilities of the clustering techniques. Whether 3, 4, or 5 clusters are identified, clustering proved to be an effective tool in
summarizing a large data set into a few points that are representative of a samples inclusion chemistry. The practicality of this
type of analysis would be evident when applied to complex data sets, with more observations and variables, and the
formation of specific types of inclusions isn’t as apparent. Furthermore, detailed investigation of clusters would be
appropriate for analyzing samples of the same type but from different heats, and to assess differences between heats, bulk
inclusion clustering would be useful.

CONCLUSION
Clustering enabled the reduction of broad inclusion data sets to a handful of observations that can adequately define a
samples chemistry. The final aim of this study is to determine an ideal clustering technique, that can accurately identify types
of inclusions within a steel sample. Thus, proving to be an effective tool in steel processing, by providing a quick method to
evaluate the quality of the heat being processed.

2826 AISTech 2017 Proceedings. © 2017 by AIST.


The k-means and EM algorithms for clustering were investigated, and both algorithms provided decent results, whereby
clusters in the spinel and alumina regions were clearly identified, as would be expected with this specific heat. However, both
techniques have their own flaws. The k-means algorithm has some limitations, it assumes clusters are of the same size and
shape, and overlapping clusters cannot be identified. In addition, although the elbow plot is utilized to verify the optimal
number of clusters, a conclusive number cannot be established. And the main issue raised with regards to the EM algorithm,
was the overfitting of clusters in some cases.
Nonetheless, the EM algorithm is more applicable for inclusion cluster analysis, due to the flexibility of applying several
cluster models, along with a well-defined criterion for selecting the best fit number of clusters, and its ability to identify
overlapping clusters. Therefore, the next phase of this study is to explore the EM algorithm more thoroughly in order to
optimize the clustering technique required for inclusions analysis, in addition to investigating the relation between clusters
and inclusion size.

ACKNOWLEDGEMENTS
We would like to thank the industrial members of the Center for Iron and Steelmaking Research for their continuous support
in this study, as well as their provision of samples.

REFERENCES

[1] S. Yang, J. Li, L. Zhang, K. Peaslee and Z. Wang, "Evolution of MgO.Al2O3 Based Inclusions in Alloy Steel During
Refinig Process," Metallurgical and Mining Industry, vol. 2, no. 2, pp. 87-92, 2010.

[2] B. G. Bartosiaki, J. A. M. Pereira, W. V. Bielefeldt and A. C. F. Vilela, "Assessment of inclusion analysis via manual
and automated SEM and total oxygen content of steel," Journal of Materials Research and Technology, vol. 4, no. 3,
pp. 235-240, 2015.

[3] S. R. Stroy, G. E. Goldsmith, R. J. Fruehan, G. S. Casuccio, M. S. Potter and D. M. Williams, "Study of Casting Issues
using Rapid Inclusion Identification and Analysis," in AISTech, Cleveland, OH, 2006.

[4] A. B. Schmiedt, U. Kamps and E. Cramer, "Statistical Modeling of Non-Metallic Inclusions in Steels and Extreme
Value Analysis," Hochschulbibliothek der Rheinisch-Westfälischen Technischen Hochschule Aachen, 2013.

[5] P. C. Pistorius, A. Patadia and J. Lee, "Correction of Matrix Effects on Microanalysis of Calcium Aluminate
Inclusions".

[6] D. Binu, "Cluster analysis using optimization algorithms with newly designed objective functions," Expert Systems with
Applications, vol. 42, pp. 5848-5859, 2015.

[7] D. Tang and P. C. Pistorius, "Optimizing Speed and Quality of Automated Inclusion Analysis," in Iron and Steel
Technology, Cleveland, OH, 2015.

[8] FEI, "FEI/Aspex Explorer SEM - Standard Operating Procedure," 2013.

[9] C. Merlet, "An Accurate Computer Correction Program for Quantitative Electron Probe Microanalysis," Mikrochmica
Acta, no. 114-115, pp. 363-376, 1994.

[10] P. C. Pistorius and N. Verma, "Matrix Effects in the Energy Dispersive X-Ray Analysis of CaO-Al2O3-MgO Inclusions
in Steel," Microscopy and Microanalysis, vol. 17, no. 6, pp. 963-971, 2011.

[11] M. A. Peeples, "R script for K-means Cluster Analysis," 2011. [Online]. Available:
http://www.mattpeeples.net/kmeans.html. [Accessed 25 5 2016].

[12] C. Fraley and A. Raftery, "Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering,"
Journal of Classification, no. 24, pp. 155-181, 2007.

[13] C. Fraley, A. E. Raftery, T. B. Murphy and L. Scrucca, "MCLUST Version 4 for R: Normal Mixture Modeling for
Model-Based Clustering, Classification, and Density Estimation," University of Washington, Seattle, June 2012.

AISTech 2017 Proceedings. © 2017 by AIST. 2827


[14] C. M. Bishop, Pattern Recognition and Machine Learning, 2006.

[15] "The R Project for Statistical Computing," The R Foundation, [Online]. Available: http://www.r-project.org.

[16] W.-C. Chen, R. Maitra and V. Melnykov, "EM Algorithm for Model-Based Clustering of Finite Mixture Gaussian
Distribution," CRAN, 2015.

[17] N. Hamilton, "An extension to 'ggplot2', for the Creation of Ternary Diagrams," CRAN, 2016.

2828 AISTech 2017 Proceedings. © 2017 by AIST.

You might also like