Professional Documents
Culture Documents
Data Science in Heavy Industry and The Internet of Things
Data Science in Heavy Industry and The Internet of Things
ABSTRAC T
Increasingly cheap and available sensors enable new applications of data science for heavy industries.
From locomotives to wind turbines to solar farms, data scientists close to these industries will be the
first to attempt to turn the data collected on these machines into consistent sources of economic value.
This article discusses an approach to framing industrial analytics problems and goes into detail on one
problem in equipment reliability, predictive maintenance. We discuss a host of challenges associated
with building and implementing analytics using equipment data, and we give recommendations on
how to surmount these challenges through careful data analysis, data collection, and communication.
We also discuss training and getting started on industrial analytics problems.
1. Introduction
The stereotypical environment for a data scientist is decidedly not heavy industrial. Sleek workplaces
furnished with beanbags may seem like a far cry from a factory, a mine, or other environments relying
on heavy equipment to do work. However, the Internet of Things (IoT)—a term referring to the
estimated billions of devices that can collect data with sensors and transmit that data—will change
this image for some practitioners. Heavy equipment can contain hundreds or thousands of sensors,
and, with the rise of IoT, the data collected by these sensors can be accumulated and analyzed to create
economic value.
Increased connectivity of heavy equipment, and, more generally, connectivity of any device with a
sensor, is driven by a number of factors. Decreasing costs of bandwidth, accessibility of Wi-Fi and
cellular networks, and robust cloud infrastructures are making sensor data collection, transmission,
storage, and analysis easier; see the study by Goldman Sachs (2014). This study estimated there were
around two billion connected devices in 2000 and a projected 28 billion connected devices by 2020.
Consumer products such as exercise bracelets and smart thermostats may be the most visible
examples of this phenomenon; however, this same study estimated the opportunity for IoT in the
industrial space alone to be $2 trillion in 2020. These estimates, of course, are based on assumptions
and data collected in 2014; therefore, some caution is warranted when interpreting these numbers.
2
Harvard Data Science Review Data Science in Heavy Industry and the Internet of Things
However, more recent estimates in Gartner (2018), IoT Analytics (2018), and Ericsson (2016), further
indicate the market for IoT is large and growing.
Given these new opportunities, traditional industrial companies, tech companies, and a host of
startups are competing for space in the industrial IoT market. To do this, many are relying on data
scientists to analyze, visualize, and create predictions from these new data streams. Uptake, the
company we work for, is one startup that focuses on equipment reliability and productivity. This
article focuses on our experience building data science solutions for industrial IoT applications. We
first present our approach to framing problems in industrial IoT. Next, we discuss predictive
maintenance, a method of using IoT data to improve maintenance practices. In particular, we use
predictive maintenance to highlight the challenges present in working with sensor data and describe
our approaches to overcoming these challenges. Finally, we discuss training for aspiring industrial
data scientists.
To begin solving problems in industrial IoT, we encourage data scientists to start from the basics of a
company’s business. Our view is that it is critical for data scientists to understand the details of how a
company creates value and, more generally, the key performance indicators (KPIs) that companies
often measure themselves on. Data scientists then measure their performance by showing
improvement on appropriate KPIs.
For example, in the rail industry, failures per locomotive year (FLY) is a core KPI that gets tracked.
Mechanical failures not only result in expensive repairs, but the associated unplanned downtime can
3
Harvard Data Science Review Data Science in Heavy Industry and the Internet of Things
be even more costly. Revenue lost due to unplanned downtime has been estimated at $160,000 per
locomotive per year, and it has been estimated that Class 1 railroads (those generating a minimum of
around $400 million in revenue per year) can realize an annual savings of $80 million if only 10% of
unplanned maintenance is converted to planned maintenance (Predikto, 2017). Reduced FLY lowers
both maintenance and unplanned downtime costs by catching failures before they get serious and
before they affect the overall operation of a rail network. Data scientists in this area can then be
confident they are creating value by focusing on reducing failures.
Data scientists won’t necessarily be asked to tie their work to specific KPIs—a data scientist working in
a purely consultative capacity may simply need to solve a set of problems already defined by a
stakeholder—however, we believe there are a number of advantages to proactively defining problems
and solutions in this way. First, a data scientist’s work is clearly tied to a company’s mission and bottom
line. Second, focusing on business drivers can provide self-evident success criteria for the project and
can improve communication across all stakeholders. And third, issues of scale and solvability tend to
be surfaced earlier, potentially saving data scientists and others time and effort.
An example in electrical power transmission illustrates the third point. System Average Interruption
Duration Index (SAIDI) is a measure of power outages and severity (Institute of Electrical and
Electronics Engineers Standards Association [IEEE SA], 2012). However, outages and equipment
failures in this industry are frequently caused by squirrels (American Public Power Association
[APPA], 2017). The American Public Power Association has even written a tongue-in-cheek “Open
Letter to Squirrels” as a tribute to their ubiquity (APPA, 2019). It may be possible for data scientists to
estimate spatiotemporal averages of ‘squirrel-risk’ as an attempt to protect against squirrel-related
outage events, but of course a data scientist cannot, on any given day, predict whether such an event
will happen.
With the hype around both data science and the Internet of Things, data scientists will be under extra
pressure to create compelling solutions. We believe a concerted focus on the mechanics of how solving
a data science problem leads to business value will help ensure that the problems attempted are
realistically solvable and valuable.
3. Predictive Maintenance
Machines and components inevitably wear, degrade, and break. Per the examples in Section 2, this
can be costly and negatively affect key productivity measures. Companies using heavy equipment
create reliability strategies to deal with wear-out and breakages while simultaneously maintaining
productivity. Periodic oil changes, for example, are part of a reliability strategy. A decision to run a
piece of equipment to failure is a valid reliability strategy as well. Given that reliability is important
4
Harvard Data Science Review Data Science in Heavy Industry and the Internet of Things
for companies across many different industries, we provide a deeper dive on predictive maintenance,
one of the more technical ways data scientists can help improve equipment reliability.
3.1 Background
The math behind predictive maintenance is referred to in some literature as Prognostics (Lee et al.,
2014; Lei et al., 2018; Roemer et al., 2006; Sikorska et al., 2011; Wheeler et al., 2009). Leading up to a
machine failure, signatures of the impending failure—for example, an increasing temperature or a
dropping pressure—can sometimes be captured by sensor data. The prognostic/failure prediction
model focuses on detecting these signatures as soon as possible. If a problem is caught early enough,
repairs may be as minor as tightening a bolt. The longer a potential failure goes undetected, often the
more expensive it is to repair.
Figure 1 is adapted from Blann (2013) and shows this rough phenomenon: a failing component’s
performance and condition degrades as it reaches a total failure point. Along the way it hits a couple
important points in relation to predictive maintenance. Point S is the start of a failing component.
Point P is the point where it is observed in the data. And point F indicates a completely failed
component. Notably P is different from S, indicating the actual onset of a failing component may occur
significantly earlier than it is actually detectable in what data is available. The period between when a
problem is detected (P) and when a component completely fails (F) is what is labeled the PF interval. It
is in this period that attempted corrective actions can be taken to reduce overall costs. Different
components will have different curves and thus different PF intervals. For example, a single failing
bolt on some machines may not be detectable before failure. In this case, point P is on top of point F.
Predictive maintenance is concerned with moving point P to as early in time as possible.
5
Harvard Data Science Review Data Science in Heavy Industry and the Internet of Things
A1 Lack of complete and quality failure information is perhaps the most difficult problem to
solve. Unlike sensors that collect data automatically, documentation of failures and their fixes
usually depends on mechanics in a shop doing this documentation. Unsurprisingly, data quality
for data science is not a priority for most mechanics. Some shops work on paper records as well.
This adds another layer of complexity to getting the right data into the hands of data scientists.
6
Harvard Data Science Review Data Science in Heavy Industry and the Internet of Things
A2 Many mobile machines rely on cellular or satellite connections to transmit data. For older
nonmobile machines, sending data to the cloud often means retrofitting to older hardware. In
both cases, drops in data can occur and connection can be expensive. This creates data that can be
spotty and out of order, contain duplicates, and can force tradeoffs to be made on what data to
collect even before a data scientist has seen data samples. In addition, critical or erratic machine
operation can also cause issues for sensors, creating a scenario where data gaps exist precisely
during the critical periods where data is needed.
A3 Outside of connectivity, sensor configurations also cause headaches because not all sensors are
installed in precisely the same way on even the same types of machines. This causes modeling
issues because some type of central calibration may be needed before a model can be applied
confidently at the desired scale. In addition, not all components have sensors that can be used for
predictive maintenance. Many parts won’t have sensors and other parts may have sensors that do
not serve predictive maintenance purposes.
B1 Replacements are not equivalent to failures. For example, planned maintenance, such as
changing oil every 3,000 miles, results in replacements without failure. Part failures also cause
working parts to fail. For example, a flat tire could cause a collision, further causing other part
replacements. The consequence of this is that even a perfect record of part replacements may
not provide a consistent target to train against when building a machine learning model.
B2 High value failures are rare. Machine prognostics mirror medical prognostics and survival
analysis where events are rare or censored (Ambler et al., 2012 Buyske et al., 2000). While this is
good news for companies operating these machines, data scientists may find difficulty when
gathering even years of data yield only a handful of failure examples to work with. In addition,
for the most complicated machines, there are a wide range of failure types. This can mean that
the value of preventing a single failure type may be negligible, but value grows significantly as
more failures are prevented.
B3 In contrast to rare failure data, sensor signal data can be enormous. Vibration sensors collect
data many thousands of times per second. Nonvibration sensors will collect data once per second
or more frequently. This puts a strain on computation when doing exploratory analysis, and in
some cases, practitioners will need to work only on summaries of the underlying data as opposed
to the raw data itself.
7
Harvard Data Science Review Data Science in Heavy Industry and the Internet of Things
B4 Many sources of data are highly dependent. In a statistical sense, all data coming from a single
piece of equipment are dependent. All data coming from groups of equipment in the same
geographic area are dependent. Even data generated by different pieces of equipment but with
the same operator will be dependent. Data dependencies affect and may limit modeling and
validation approaches.
B5 Machine context matters. Machines age, operate in hot and cold climates, go into tunnels and
through mud and work in otherwise very extreme conditions. Each of these modes of operation
can change the signatures of data coming off a machine.
C1 If failure signatures are detected by a model, acting on a model prediction requires manual
work and logistics. For example, to replace a failing part on a machine, the right new part must
be available at the right maintenance shop. Inventory management presents tremendous
challenges on its own; see, for example, Williams and Tokar (2008) for an overview. Creating the
right prediction and delivering it in such a way that enables the right follow-up workflow can be a
challenge.
C2 Predictive maintenance problems can be “high-stakes” problems (Rudin, 2019). High dollar
amounts—and in some cases, human safety—are connected to actions both taken and not taken.
Consumers of predictions must be able to trust a prediction in order to confidently take the right
actions.
Successful approaches to predictive maintenance and prognostics will confront some of these issues
head-on and side-step others.
8
Harvard Data Science Review Data Science in Heavy Industry and the Internet of Things
to enable model building and handling that data carefully so that the right conclusions from the data
can be drawn.
9
Harvard Data Science Review Data Science in Heavy Industry and the Internet of Things
10
Harvard Data Science Review Data Science in Heavy Industry and the Internet of Things
The final points associated with turning model predictions into real-world value (C1, C2) may be best
addressed through clear model interpretations and communication.
The value of clear, interpretable predictive maintenance predictions will be self-evident in many
contexts. When it is not, an ideal way to quantify value is to conduct an experiment (Taddy, 2019).
11
Harvard Data Science Review Data Science in Heavy Industry and the Internet of Things
Understandably, industry stakeholders do not jump at the chance to have data scientists run
experiments involving their multimillion-dollar assets. Likewise, data scientists should not feel
completely free to conduct any experiment they like, since they will not bear the full cost of running
those experiments. We find that when experimentation is possible, impactful experiments depend on
trust built with stakeholders.
As one example, our team was able to run an A/B test to identify a large set of mis-calibrated machines
for one customer. To start this experiment, we worked with the manager of these machines to identify
a subset to give a special calibration as a treatment. This subset was chosen to maximize
measurement capability and minimize potential impact on operations. The machines outside this
subset were left untreated. By tracking productivity of these machines posttreatment, we proved that
the special calibration improved output. Consequently, all machines were given the calibration,
creating a measurable bump in output for that population of machines. This was a great outcome for
both parties. We achieved this by building trust through quality communication and finding an
acceptably small but measurable way to get to our goal.
Following our recommendations in Section 3, prospective data scientists will need additional focus in
interacting with experts. Industrial analytics isn’t the only place where interfacing with experts is
important. For example, data scientists working on medical applications may communicate directly
with medical doctors. However, given the data challenges described in Section 3, and especially for
predictive maintenance problems, it can be absolutely critical for data scientists to interface with an
industry expert. Importantly, they must be able to do this while maintaining overall control of how a
problem is being solved. To borrow a phrase from Meng (2018), data scientists should strive to be
“Proactive co-investigators/partners, not passive consultants” (p. 51). For the practicing data scientist,
this means bringing data and plots to conversations with specific research questions in mind.
Conversely, data scientists should avoid statements like: ‘the expert said X, so I did X’; or questions
like: ‘does the expert want Y in the model?’ Follow-up questions like ‘was X justified by the data and
our understanding of the problem?’ or, ‘does Y lead to any substantive improvements in the model?’
will help data scientists create better solutions.
12
Harvard Data Science Review Data Science in Heavy Industry and the Internet of Things
Given the importance of interaction with subject matter experts, data scientists with additional
experience with machines and mechanics can significantly speed up model building as well. We have
seen many cases where just knowing relative locations of components on a machine has potentially
saved weeks of model-building time. To jumpstart this process, we have sent data scientists to formal
training events intended for mechanics and other heavy equipment analysts.
For those looking to get their hands on sample data, NASA (2014) collects a number of data sets that
track devices as they fail in either simulations or lab experiments. Turbofans (Saxena & Goebel,
2008), bearings (Lee et al., 2007), and batteries (Saha & Goebel, 2007) are some examples of data sets
that are open to the public. These data sets are great for practicing cross-validation and playing with
methods to find early patterns of failure in these devices. However, many of the data issues we
mentioned in Section 3 may not be present in lab experiments. Practitioners getting started with these
data sets should keep this in mind to make sure their methods do not become ineffective in real data
scenarios.
Industry analysts have high hopes that IoT will bring transformations to many traditional industries.
Using IoT data to change how heavy equipment is operated and maintained is a part of this expected
transformation. While heavy equipment may not be many data scientists’ traditional area of
application, a passive approach to solving problems in this area may ultimately fall short of creating
transformation. Data scientists will be successful in helping realize this future if they play proactive
roles in defining the right problems, gathering the right data, and taking the lead in communication.
Disclosure
The authors are currently or were recently employed by Uptake, a company specializing in analytics
for Heavy Industry.
References
Ambler, G., Seaman, S., & Omar, R. Z. (2012). An evaluation of penalised survival methods for
developing prognostic models with rare events. Statistics in medicine, 31(11–12), 1150–1161.
https://doi.org/10.1002/sim.4371
American Public Power Association. (2017). Defending against outages: Squirrel tracker.
https://www.publicpower.org/blog/defending-against-outages-squirrel-tracker
13
Harvard Data Science Review Data Science in Heavy Industry and the Internet of Things
Berinato, S. (2019, January–February). Data science and the art of persuasion. Harvard Business Review.
Berthold, M. R. (2019). What does it take to be a successful data scientist? Harvard Data Science Review,
1(2). https://doi.org/10.1162/99608f92.e0eaabfc
Blann, D. (2013). Maximizing the P-F interval through condition-based maintenance. Maintworld.
https://www.maintworld.com/Applications/Maximizing-the-P-F-Interval-Through-Condition-Based-
Maintenance
Breiman, L., & Cutler, A. (2005). Random forests–Classification description. Random Forests.
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
Burman, P., Chow, E., & Nolan, D. (1994). A Cross-validatory method for dependent
data. Biometrika, 81(2), 351–358. https://doi.org/10.2307/2336965
Buyske, S., Fagerstrom, R., & Ying, Z. (2000). A class of weighted log-rank tests for survival data when
the event is rare. Journal of the American Statistical Association, 95(449), 249–258.
https://doi.org/10.1080/01621459.2000.10473918
Draper, D. (2013). Bayesian model specification: Heuristics and examples. In P. Damien, P. Dellaportas,
N. G. Polson, & D. A. Stephens (Eds.), Bayesian theory and applications (pp. 409–431). Oxford University
Press.
Gartner. (2018). Gartner identifies top 10 strategic IoT technologies and trends [Press release].
https://www.gartner.com/en/newsroom/press-releases/2018-11-07-gartner-identifies-top-10-
strategic-iot-technologies-and-trends
Glickman, M., Brown, J., & Song, R. (2019). (A) Data in the life: Authorship attribution in Lennon-
McCartney songs. Harvard Data Science Review, 1(1). https://doi.org/10.1162/99608f92.130f856e
Goldman Sachs. (2014). The Internet of Things: Making sense of the next mega-trend.
https://www.goldmansachs.com/insights/pages/internet-of-things/iot-report.pdf
Hanley, T., Daecher, A., Cotteleer, M., & Sniderman, B. (2018). The Industry 4.0 paradox. Deloitte.
https://www2.deloitte.com/us/en/insights/focus/industry-4-0/challenges-on-path-to-digital-
transformation/summary.html
14
Harvard Data Science Review Data Science in Heavy Industry and the Internet of Things
Institute of Electrical and Electronics Engineers Standards Association. (2012). 1366-2012-IEEE Guide for
Electric Power Distribution Reliability Indices. https://standards.ieee.org/standard/1366-2012.html
IoT Analytics. (2018). State of the IoT 2018: Number of IoT devices now at 7B—Market accelerating.
https://iot-analytics.com/state-of-the-iot-update-q1-q2-2018-number-of-iot-devices-now-7b/
Kozyrkov, C. (2019). What great data analysts do—and why every organization needs them. In Strategic
analytics: The insights you need from Harvard Business Review (Advance Edition). Harvard Business
Review Press.
Lee, J., Wu, F., Zhao, W., Ghaffari, M., Liao, L., & Siegel, D. (2014). Prognostics and health management
design for rotary machinery systems—Reviews, methodology and applications. Mechanical Systems and
Signal Processing, 42(1–2), 314–334. https://doi.org/10.1016/j.ymssp.2013.06.004
Lee, J., Qiu, H., Yu, G., Lin, J., & Rexnord Technical Services. (2007). Bearing data set. NASA Ames
Prognostics Data Repository, NASA Ames Research. http://ti.arc.nasa.gov/project/prognostic-data-
repository
Lei, Y., Li, N., Guo, L., Li, N., Yan, T., & Lin, J. (2018). Machinery health prognostics: A systematic
review from data acquisition to RUL prediction. Mechanical Systems and Signal Processing, 104, 799–834.
https://doi.org/10.1016/j.ymssp.2017.11.016
Meng, X.-L. (2018). Conducting highly principled data science: A statistician’s job and joy. Statistics &
Probability Letters, 136, 51–57. https://doi.org/10.1016/j.spl.2018.02.053
Roemer, M., Byington, C., & Kacprzynski, G. (2006). An overview of selected prognostic technologies
with application to engine health management. Proceedings of GT2006, ASME Turbo Expo 2006: Power
for Land, Sea, and Air. https://doi.org/10.1115/GT2006-90677
Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use
interpretable models instead. Nature Machine Intelligence, 1(5), 206–215.
https://doi.org/10.1038/s42256-019-0048-x
15
Harvard Data Science Review Data Science in Heavy Industry and the Internet of Things
Saha, B., & Goebel, K. (2007). Battery data set. NASA Ames Prognostics Data Repository, NASA Ames
Research Center. http://ti.arc.nasa.gov/project/prognostic-data-repository
Saxena, A., & Goebel, K. (2008). Turbofan engine degradation simulation data set. NASA Ames
Prognostics Data Repository, NASA Ames Research Center. http://ti.arc.nasa.gov/project/prognostic-
data-repository
Sikorska, J., Hodkiewicz, M., & Ma, L. (2011). Prognostic modelling options for remaining useful life
estimation by industry. Mechanical Systems and Signal Processing, 25(5), 1803–1836.
https://doi.org/10.1016/j.ymssp.2010.11.018
Taddy, M. (2019). Business data science: Combining machine learning and economics to optimize, automate,
and accelerate business decisions. McGraw-Hill Education.
Wheeler, K. R., Kurtoglu, T., & Poll, S. D. (2009, August 30–September 2). A survey of health
management user objectives related to diagnostic and prognostic metrics. In ASME 2009 International
Design Engineering Technical Conferences and Computers and Information in Engineering Conference. San
Diego. (pp. 1287–1298). American Society of Mechanical Engineers Digital Collection.
https://doi.org/10.1115/DETC2009-87073
Williams, B., & Tokar, T. (2008). A review of inventory management research in major logistics
journals: Themes and future directions. The International Journal of Logistics Management, 19(2), 212–
232. https://doi.org/10.1108/09574090810895960
This article is © 2020 by Michael Horrell, Larry Reynolds, and Adam McElhinney. The article is licensed
under a Creative Commons Attribution (CC BY 4.0) International license
(https://creativecommons.org/licenses/by/4.0/legalcode), except where otherwise indicated with respect to
particular material included in the article. The article should be attributed to the authors identified above.
16