(Special Section) Big Data Analytics PDF

Big Data Analytics
28 Meet the Authors

30 The Four Vs
32 What Is It?
36 Success Stories
41 Getting Started
46 Challenges and Future Research
aided engineering in regulatory documents and the MARCO S. REIS, PhD, is a professor in the
Meet the Authors transfer of modeling technology to manufacturing. He
started his career at Aspen Technology, where he spent
department of chemical engineering at the Univ.
of Coimbra, Portugal (Email: marco@eq.uc.pt),
four years working as a consultant for the polymer, where he is responsible for the Process Systems
RICHARD D. BRAATZ, PhD, is the Edwin R. Gilliland petrochemical, and fine chemical manufacturing Engineering (PSE) research group. He currently
Professor of Chemical Engineering at the Massachu- industries. After receiving his PhD, he joined Pfizer, serves as president of the European Network for
setts Institute of Technology (MIT), where he does where he spent nine years as a modeling and simula- Business and Industrial Statistics (ENBIS) and of
research in systems and control theory and its appli- tion scientist actively participating in the development Associação Para o Desenvolvimento da Engenharia
cation to chemical and biological systems (Email: of new medicines and the improvement of commercial Química (PRODEQ). He lectures on process systems
braatz@mit.edu). He was the Millennium Chair and manufacturing operations using model-based control engineering, quality technology and management,
Professor at the Univ. of Illinois at Urbana-Champaign tools. His research interests include the theory and management and entrepreneurship, and process
and a visiting scholar at Harvard Univ. before moving application of multivariate statistical methods, opti- improvement. His research interests are centered
to MIT. He has consulted or collaborated with more mization, advanced thermodynamics, and advanced on the field of process systems engineering (system
than 20 companies including IBM, DuPont, Novartis, control. He has received multiple awards, including identification, fault detection and diagnosis, con-
and United Technologies Corp. Honors include the the Pfizer Achievement Award (2009), the AIChE Food, trol, and optimization), statistical process control
Donald P. Eckman Award from the American Auto- Pharmaceutical, and Bioengineering Div. Award (2010), of complex large-scale processes, data-driven
matic Control Council, the Curtis W. McGraw Research and the Pfizer Manufacturing Mission Award (2011). multiscale modeling, chemometrics, design of
Award from the Engineering Research Council, the He received a BSc in chemical engineering and an MSc experiments, and industrial statistics. Other areas
IEEE Control Systems Society Transition to Practice in chemical and computer systems engineering, both of interest include multivariate image analysis, sys-
Award, and the CAST AIChE Computing in Chemical from the Instituto Tecnológico y de Estudios Superiores tems biology, and process improvement through ini-
Engineering Award. His 400+ publications include the de Monterrey (ITESM), in Mexico, and a PhD in chemi- tiatives such as Six Sigma and lean manufacturing.
textbook, Fault Detection and Diagnosis in Industrial cal engineering from McMaster Univ., in Canada. He is He has published about 60 articles in international
Systems. He received a BS in chemical engineer- an active member of AIChE and collaborates with the journals or book series, four book chapters, and two
ing from Oregon State Univ. and an MS and PhD in American Association of Pharmaceutical Scientists. books, and authored or coauthored 100+ presenta-
chemical engineering from the California Institute of tions in international congresses. He received a
Technology. He is a Fellow of the International Federa- CHAITANYA KHARE is an EMI Development Engineer Licentiate degree in chemical engineering and a
tion of Automatic Control, IEEE, and the American and has been leading the design and implementation PhD in chemical engineering, both from the Univ. of
Association for the Advancement of Science. of an EMI effort, redefining how Dow aggregates, Coimbra, Portugal.
visualizes, interprets, and utilizes its plant data in real
LEO H. CHIANG, PhD, is Senior Technical Manager time. His work has brought about an overall diagnostic MARY BETH SEASHOLTZ, PhD, is the Technol-
at the Dow Chemical Co. in Freeport, TX (Email: dashboard tool that is currently being shared by the ogy Leader for the Data Services Capability in the
hchiang@dow.com). He is the leader of Dow’s technology center and the operations and R&D groups Analytical Technology Center at the Dow Chemical
Chemometrics and Manufacturing Analytics across the enterprise to monitor plant health in real Co. (Email: mseasholtz@dow.com). She began her
departments and is responsible for partnering with time. Khare’s work was recognized by two prestigious career applying chemometrics to process analyzers.
academia to develop and transfer emerging data awards — the 2015 Golden Mousetrap Award, and Now, her primary responsibility is to drive the tech-
analytics technologies for Dow. He has developed the 2015 Manufacturing Leadership Award (Frost and nology needed to use data to make money. For this,
and implemented several systems techniques to Sullivan). Khare obtained a diploma in chemical engi- she focuses on several areas, including statistics
solve complex manufacturing problems, resulting neering from Mumbai Univ., India, a BE in petrochemi- and chemometrics, as well as software platforms.
in 11 Manufacturing Technology Center Awards. In cal engineering from Pune Univ., India, and an MS in In 2015, she and her team were awarded the 2015
2010, he received the Vernon A. Stenger Award, chemical engineering from Twente Univ., The Nether- Golden Mousetrap Award in Design Tools Hardware
which is the highest individual honor for analytical lands. He spent more than seven years in hydrocarbons & Software: Analysis & Calculation Software from
sciences research and development at Dow. He has research at Dow before moving into the manufacturing Design News. They also were awarded the 2015
authored 25 peer-reviewed papers, 33 conference analytics group in the Analytical Technology Center in Manufacturing Leadership Award: Big Data and
presentations, and 2 books published by Springer 2014. Khare is certified as a Black Belt in Six Sigma. Advanced Analytics Leadership from Manufacturing
Verlag. His textbook, Fault Detection and Diagnosis Leadership Community (Frost and Sullivan). She
in Industrial Systems, is available in English and Chi- JOHN F. MACGREGOR, PhD, is President and CEO of has more than 25 years of experience in the field of
nese and has received over 1,300 citations according ProSensus Inc. (Email: john.macgregor@prosensus. chemometrics. She received a BS in chemistry and
to Google Scholar. He received a BS from the Univ. ca), a company that provides specialized engineering mathematics from Lebanon Valley College, and an
of Wisconsin–Madison, and an MS and PhD from the consulting and state-of-the-art software for the analy- MS in applied mathematics and a PhD in analytical
Univ. of Illinois at Urbana-Champaign, all in chemical sis and interpretation of big data from the process chemistry, both from the Univ. of Washington.
engineering. He is an active member of AIChE. industries, and for the online monitoring, control,
and optimization of processes based on developed DAVID WHITE is a senior analyst at the ARC Advisory
LLOYD F. COLEGROVE, PhD, is the Director of Data models. He is a Distinguished University Professor Group (Email: DWhite@ARCweb.com), where he
Services and the Director of Fundamental Problem Emeritus at McMaster Univ., Canada, where he spent is responsible for research into analytics and big
Solving in the Analytical Technology Center at the 36 years in the Chemical Engineering Dept. and in the data. He uses his 20 years of experience from many
Dow Chemical Co. (Email: lfcolegrove@dow.com). He McMaster Advanced Control Consortium (MACC), and industries to research the art and science of getting
began his Dow career in R&D in polymer science and where his research group developed many advanced the right information, to the right people, at the right
quickly moved to improvement of analytical method big data methods in collaboration with the large time. Choosing an appropriate analytics solution is
ology in plant labs. While in a role as business quality international sponsor companies of MACC. He has vital. But, many other factors are also crucial for an
leader for four business units, he embarked on his received many awards, including the Shewhart Medal analytics project to create lasting business value.
big data journey in manufacturing before the term big from the American Society for Quality; the Herman With this in mind, his research has two goals: To
data came into use. He established the first applied Wold Medal from the Swedish Chemical Society; help technology buyers get the most value from
statistics group in Dow manufacturing and developed the Century of Achievement Award and the R. S. their investments in analytics; and to help suppliers
the vision that is taking the company from merely col- Jane Award from the Canadian Society for Chemical shape their analytics product and marketing strate-
lecting data to actively using data. Colegrove has more Engineering; the Computing and Systems Technology gies. Immediately before joining ARC, he researched
than 29 years of experience in chemical research and Award from AIChE; the Nordic Process Control Award business analytics for the Aberdeen Group, serving
manufacturing. He holds a BS in chemistry and a PhD from the Scandinavian Control Society; and the Guido clients such as SAP, IBM, Qliktech, and Tableau.
in chemical physics from Texas A&M Univ. Stella Award from the World Batch Forum. He received Before that, he worked in marketing roles for
a BEng from McMaster Univ. in Canada, and an MS companies such as Oracle, Cognos, Dimensional
SALVADOR GARCÍA MUÑOZ, PhD, is the team leader of in statistics and chemical engineering and a PhD in Insight, and Progress Software. He received a BS in
the modeling and simulation department at Eli Lilly and statistics, both from the Univ. of Wisconsin–Madison. computer science from the Univ. of Hertfordshire and
Co. (Email: sal.garcia@lilly.com). His current respon- He is a Fellow of the Royal Society of Canada, the an MBA from Cranfield Univ., both in the UK.
sibilities span the drug substance and drug product Canadian Academy of Engineering, and the American
areas, with particular focus on the use of computer- Statistical Association.
RICHARD D. BRAATZ LEO H. CHIANG LLOYD F. COLEGROVE CHAITANYA KHARE JOHN F. MACGREGOR MARCO S. REIS MARY BETH SEASHOLTZ DAVID WHITE
28 www.aiche.org/cep March 2016 CEP Copyright © 2016 American Institute of Chemical Engineers (AIChE)
Come to the 2016 AIChE ®
Spring Meeting and Discover How

Companies are Using Big Data
Learn at 15 Sessions Focused on
Operationalizing Big Data and Analytics
M ore than ever companies in the process industries are

using data to improve their business. This second topical
conference devoted to the applications of big data features
case studies and sessions including:
Data Management in Refineries
Big Data Analytics and Statistics
Big Data Analytics Data Visualization
Big Data Analytics Fundamental Modeling
Big Data Analytics and Smart Manufacturing
View the program Plus sessions offering industry and vendor perspectives and
and register at a plenary session.
www.aiche.org/spring. And you can attend any of the other sessions at the
2016 AIChE® Spring Meeting and 12th Global Congress
© 2016 AIChE 2435_16 • 02.16 on Process Safety.
Special Section: Big Data Analytics
BIG DATA
The Four Vs
B
ig data is a big topic with a lot of potential. Before realizing this poten-
tial, however, we need to get on the same page about what big data is,
how it can be analyzed, and what we can do with it.
The term big data is somewhat misleading, as it is not only the size
(volume) of the data set that makes it big data. Size is just one aspect, and
it describes the sheer amount of data available. A study conducted by Peter
Lyman and Hal R. Varian of the Univ. of California, Berkeley, estimates that
the amount of new data stored each year has increased by 30%/yr between
1999 and 2002, to 5 trillion gigabytes. Ninety-two percent of the new data was
stored on magnetic media, mostly on hard disks. For reference, 5 trillion giga-
bytes is equivalent to the data stored in 37,000 libraries the size of the Library
of Congress, which houses 17 million books. And, according to IBM, the
amount of data created each day is expected to grow to 43 trillion gigabytes by
2020, from about 2.3 trillion gigabytes of data per day in 2005. In the chemi-
cal process industries (CPI), data are coming from many sources, including
employees, customers, vendors, manufacturing plants, and laboratories.
In addition to volume, big data is characterized by three other Vs —
velocity, variety, and veracity. Velocity refers to the rate at which data
are coming into your organization. Data are now streaming continuously
into servers in real time. IBM puts this in context — the New York Stock
Exchange captures 1,000 gigabytes of trade information during each trad-
ing session. Furthermore, according to Intel, every minute, 100,000 tweets,
700,000 Facebook posts, and 100 million emails are sent.
All data are not equal. Variety describes the heterogeneity of data being
generated. One distinction is whether the data are structured or unstructured.
Structured data include digital data from online sensors and monitoring
devices, while unstructured data are not as neat, such as customer feedback
in the form of paragraphs of text in an email. Realizing the benefits of big
data will require the simultaneous analysis and processing of many different
forms of data, from market research information, to online sensor measure-
ments, to images and spectrographs.
The fourth V, veracity, refers to the quality of data and uncertainty in the
data. Not all data are meaningful. Data quality depends on the way the data are
collected (bias issues may emerge that are very difficult to detect), on whether
the data are updated or no longer make sense (due to time-varying changes
in the system), and on the signal-to-noise ratio (measurement uncertainty),
among other factors.
But the potential of big data is not merely the collection of data. It’s the
thoughtful collection and analysis of the data combined with domain know
ledge to answer complex questions. By acting on the answers to these ques-
tions, CPI companies will be able to improve operations and increase profits.
Big data analytics is a more appropriate term to emphasize the potential of
big data.
AIChE recognizes the importance of big data and has organized topical
conferences on big data analytics at its meetings, including at the upcoming
Spring Meeting being held in Houston, TX, April 10–14.
The articles in this special section explore the topic of big data analytics
and its potential for the CPI.
In the first article, David White introduces big data. “A common miscon-
ception is that big data is a thing,” White writes. “A more accurate metaphor
is that big data enables a journey toward more-informed business and opera-
tional decisions.” White discusses this journey, emphasizing the need for a
new approach to analytics that eliminates delays and latency. He concludes
with recommendations to help you as you embark on the big data journey.
Salvador García Muñoz and John MacGregor provide several examples
of big data success stories in the second article. The examples include the
analysis and interpretation of historical data and troubleshooting process
problems; optimizing processes and product performance; monitoring and
controlling processes; and integrating data from multivariate online analyz-
ers and imaging sensors. Because these examples involve the use of latent
variable methods, the authors briefly discuss such analytics and why they are
suitable in the era of big data.
Once you see big data’s potential, how do you get started? In the third
article, Lloyd Colegrove, Mary Beth Seasholtz, and Chaitanya Khare answer
this question. The first steps involve identifying a place to get started, a project
where big data analytics will pay off, and then selecting a software package
appropriate for that project. “Once you have found an analytics opportunity and
decided on a data analytics software package, the truly hard work starts — con-
vincing your organization to move forward and then taking those first steps,”
the authors write. Drawing on their experience at Dow Chemical, they describe
a strategy that has worked for them. They then talk about moving beyond the
initial success and using big data on more than just a few small projects.
Looking to the future, Marco Reis, Richard Braatz, and Leo Chiang identify
challenges and research directions aimed at realizing the potential of big data.
The fourth article explores some of the challenges related to the four Vs and
some potential areas of research that could address these challenges. “Big data
creates new possibilities to drive operational and business performance to higher
levels,” they write. “However, gaining access to such potential is far from trivial.
New strategies, processes, mindsets, and skills that are not yet in place are neces-
sary.” Pointing back to the previous articles, the authors end on a high note: “Big
data offers new opportunities for managing our operations, improving processes
at all levels, and even adapting the companies’ business models. So the important
question is: Can we afford not to enter the big data era?”
CEP extends a special thanks to Leo H. Chiang for serving as guest editor of this special section.
Copyright © 2016 American Institute of Chemical Engineers (AIChE) CEP March 2016 www.aiche.org/cep 31
BIG DATA
What Is It?
David White Big data can pave the way to greater
ARC Advisory Group
business and operational insight. New approaches
and technologies are poised to help you
navigate your journey.
M
uch mystery surrounds big data. A 2014 survey stresses three Vs: volume (the amount of data managed),
by the ARC Advisory Group found that 38% of velocity (the rate of incoming data), and variety (the type of
respondents did not understand what big data is or data). Sometimes veracity (the quality and accuracy of the
why they should care about it (1). incoming data) is also included in the definition.
A common misconception is that big data is a thing. A Think about your current experience in terms of the orig-
more accurate metaphor is that big data enables a journey inal three Vs (volume, velocity, and variety) to determine
toward more-informed business and operational decisions. whether your company is already on the big data journey.
Most companies have already embarked upon a big-data Volume: Is the amount of data you manage growing at an
journey — whether they realize it or not. For many industrial accelerating rate?
companies, big data manifests itself as data from the Indus- Velocity: Is the rate at which data are generated acceler-
trial Internet of Things (IIoT). The IIoT connects intelligent ating? Are the people consuming the information gleaned
physical entities, such as sensors, devices, machines, assets, from the data demanding more-timely insight?
and products, to each other, Internet services, and applica- Variety: Are the number and types of data sources you
tions. The IIoT is built upon current and emerging technolo- use for analytics growing rapidly?
gies, such as mobile and intelligent devices, wired and wire- If you answered yes to any of these questions, you are
less networks, cloud computing, analytics, and visualization already dealing with big data; if you had two or more yes
tools. answers, you are further along the journey. A common
IIoT data have surpassed what most industry observers thread through each question is growth (demand). It is not
had anticipated, enabling data to arrive faster, from more necessarily difficult to manage a large volume of data, but it
data sources, and in greater volume. This has profoundly is stressful to manage a body of data that is growing rapidly
impacted analytics. The classic architecture used for busi- year after year.
ness intelligence, operational intelligence, and analytics is no A big-data journey must strike a balance between
longer adequate. New analytics technologies are necessary, data supply (data management) and information demand
and they will need to be placed closer to the data source to (managers asking for different information or more-timely
be effective. This article introduces new analytics technolo- updates). The aim of any big data project must be to add
gies and related supporting technologies, and presents some business value — by enabling cost reductions, productivity
early big-data success stories. gains, or revenue increases. Many older big data projects
never reached the point where they were adding value. For
What is big data? example, it is not unusual to find projects based on plant
ARC’s survey results are, in many ways, not surprising. historians that accumulate data for years without business
Many industrial organizations do not understand big data. or operational managers taking full advantage of the data
A definition of big data established 20 years ago (2) through analytics.
The big data journey relatively easy to design a car that achieves 50 miles to the
Figure 1 presents a traditional business intelligence gallon. However, creating a car that does both simultane-
(BI)/analytics infrastructure of operational systems and a ously is beyond current mainstream automotive technology.
data warehouse. The operational systems may include enter- Similarly, running transactional applications and analytics
prise resource planning (ERP) systems, which help manage using the same database not only increases the workload,
financials, supply chain, manufacturing, operations, report- but also makes the workload more complex to manage. The
ing, and human resources; manufacturing execution systems performance of the two workloads is difficult to optimize to
(MES), which track and document the transformation of the satisfaction of all users.
raw materials to finished goods; supply chain management When the transactional applications and BI are separate,
software; and financial and accounting software. both can be optimized. However, maintaining multiple data-
Delays and latency (green arrows) are built into this bases involves higher costs and, more importantly, intro-
traditional system. It often takes time for all transactions to duces delays of days or weeks.
be entered into the operational systems. For example, hand- A novel in-memory database architecture (3) can sup-
written maintenance records need to be transcribed into digi- port both transactional and analytics workloads in a single
tal records so they can be analyzed and integrated with other database. Americas Styrenics benefited from this approach
data sources. There is usually a delay between capturing after a divestiture forced an IT overhaul in just a year. For an
data in operational systems and copying the data to the data enterprise application such as SAP, this time constraint was
warehouse that supports reporting, dashboards, and analyt- clearly a challenge.
ics. As a consequence, managers do not see the current state The solution was to adopt SAP’s Business Suite on the
of operations or business through BI/analytics. Instead, they HANA enterprise cloud. Because the entire database is
see the state of the business the last time the data warehouse stored in memory rather than on a hard disk, reading and
was refreshed, rendering the information a day, week, or writing data is much faster (microseconds vs. milliseconds).
even month out of date. Analytics are optimized because the data are stored in
Delays in business insight — operational equipment columns instead of rows. This scheme enables both trans-
effectiveness, ontime shipments, overtime expenses, etc. actional and analytics workloads to be served successfully
— can increase costs or cause opportunities to be missed. from a single database.
At a minimum, this is an inconvenience, but in the future
such delays could be catastrophic. Some data generated by Get closer to the data source
the IIoT are time-critical, and demand immediate action to Another way to reduce latency is to place analytics
maintain product quality or avoid costly equipment failure. closer to data generation. This can be particularly valuable
New approaches to analytics that minimize or eliminate when it is critical to maintain quality and uptime, such as in
data latency (i.e., eliminate the green arrows in Figure 1) are production monitoring applications. In an industrial setting,
gaining attention. delays in assembling and accessing information can cause
serious issues, like expensive machine failures.
Use one database Analytics that are closer to the data source are able to
One approach to eliminating data latency is to merge the intercept data and perform time-critical analyses almost
operational databases with the data warehouse, creating a immediately after the data are generated. The analysis occurs
single database that can support recording transactions and
Transactions
analytics. This eliminates the need for periodic batch trans- and
fers to keep the data warehouse up-to-date. Operational Data
Operational Systems
This idea has been proposed before using a traditional • Enterprise Resource
relational database management system (RDBMS), but in Planning (ERP)
most cases that proved to be impractical. RDBMSs first • Manufacturing Execution
Systems (MES)
came to commercial prominence in the 1980s, and their role
• Supply Chain
was to record simple transactional information, such as bank Management (SCM)
account withdrawals, telephone call details, or sales orders. • Financials
The need for analytics arose later. The database structure Data
Warehouse
required to support high-performance analytics was often
very different from the structure required to support high-
p Figure 1. In a traditional business intelligence (BI) architecture, trans
performance transaction processing.
actions and operational data are fed to operational systems that organize
Consider designing a car. It is possible to build a car and manage the data, which are then stored in a data warehouse. The green
that comfortably exceeds 200 miles per hour, and it is also arrows represent points where latency is introduced into the system.
before the data are written to long-term data storage. Predictive analytics, used in applications such as predic-
This is possible, in part, because processing and com- tive maintenance, are also more effective when moved closer
munication technologies have become smaller, cheaper, and to the data source. For example, GE uses predictive analytics
yet more powerful. Microprocessors can be embedded in to monitor the performance of over 1,500 of its turbines and
industrial devices, such as pumps or turbines, and software, generators in 58 countries. Each turbine typically has more
such as predictive analytics algorithms, can execute on the than 100 physical sensors and over 300 virtual sensors that
device itself. Time-critical functions, such as device opti- monitor factors such as temperature, pressure, and vibra-
mization or failure alerts, can be performed using analytics tion. Data are routed over the Internet to a central moni-
on the device. For other functions that are not time-critical, toring location (more than 40 terabytes of data have been
such as historical trend analyses, data are still aggregated in transferred so far). At the data center, predictive analytics
traditional data storage. algorithms check current and recent operating performance
Schwering & Hasse Elektrodraht utilizes complex event against models of more than 150 failure scenarios to detect
processing (event stream processing) technology to manage signs of impending failure.
the quality of the copper magnet wire it produces. Copper Routing data into central storage makes efficient use of
magnet wire, which is coated with a thin layer of insulation, highly skilled workers, allowing specialist technicians to
is a critical component of many electrical products, such as monitor all of the turbine and generator installations. This
transformers and motors. It is made to fine tolerances, and single-database method also helps to ensure that predictive
because it is embedded within other components, quality is models are continually refined and improved, and enables
critical. Failures can trigger expensive product recalls. best practices to be shared with all customers simultaneously
To ensure product quality, the manufacturing process is rather than on an incremental basis during system upgrades.
monitored continuously via about 20,000 measurements per GE estimates that this approach collectively saved its cus-
second across the factory. These measurements come from tomers $70 million last year.
about 20 different sources and include, for example, oven SAS has applied similar technology in deepwater oil
temperatures, material feed rates, and cooling fan speeds. fields. Electrical submersible pumps (ESPs) are used to
The quality of the wire insulation is also physically checked increase oil well production, but unexpected failure of an
at 25-mm intervals along the wire to ensure voltage isola- ESP is costly, causing hundreds of millions of dollars in lost
tion. Measurements are fed into Software AG’s Apama production as well as requiring $20 million to replace the
Streaming Analytics, which monitors the production process pump. The predictive maintenance application developed
in real time from start to finish. using SAS draws on data stored in historians and other
This monitoring scheme has changed the way the fac- sources to monitor the performance of thousands of ESPs
tory operates. In the past, if a wire did not meet quality and detect abnormal operation. Operators are able to gain
requirements, the entire spool of wire had to be scrapped. a three-month lead time on potential failures and reduce
Real-time production monitoring reduces scrap and pro- downtime to six hours.
vides more information about the production process and
quality of each spool of wire. The cloud
The cloud will be key to IIoT data management and
1,000 analytics. The IIoT is accelerating the growth of big data,
generating an unprecedented volume of data. Although
100 analytics are vital in extracting value from IIoT data, without
strong data management capabilities, high-quality analytics
Cost, U.S.$/GB
10 are impossible.
Raw disk storage, like other electronic technologies,
1 follows Moore’s Law. In just 20 years, the price of raw disk
storage has dropped from $950 per gigabyte in 1994 to
0.10 $0.03 per gigabyte in 2014 (Figure 2).
Although storage has become cheaper, salaries of IT
0.01 staff have kept pace with inflation. As a result, the cost of
1995 2000 2005 2010 2015
Year
storage is not tied to the technology, but rather to the peo-
ple charged with supporting the technology. As demand for
p Figure 2. The cost of raw disk storage has fallen dramatically over the
storage and processing grows, the choice is to pay people
last two decades. If the per-gigabyte cost in 1994 is represented by the
height of the Empire State Building, the cost in 2014 would be comparable to procure, commission, and maintain the infrastructure, or
to the length of an almond. Source: Adapted from (4). outsource this work to the cloud and deploy skilled IT staff
to more important tasks. engine, NoSQL engines shift some of that responsibility to
Another argument favoring the cloud is that projects can application programmers. This tradeoff is worth making for
be initiated sooner and completed faster. Companies will be some applications, but not for all.
able to undertake more projects and accelerate the rate of
innovation. On-demand, cloud-based data warehouses are Recommendations
available from many software and service providers. Users For most industrial enterprises, big data will manifest
pay for a monthly subscription, hourly usage, or a combina- through the IIoT and will require a different approach to
tion of data storage and query volume. Setup is faster than analytics. It will be difficult to extract maximum value from
the traditional approach of procuring and commissioning high volumes of fast-moving data with traditional data archi-
hardware and software for an on-premise corporate data tecture. I recommend the following actions as you embark
warehouse. on your big data journey.
Amazon Redshift provides a high-performance data • Start small and focus on a real business problem. Many
warehouse that is scalable on-demand (up to 1.6 petabytes). devices can be connected to the IIoT in a short period. How-
It would take an organization months to procure, install, and ever, it is best to only connect the things that are associated
commission that amount of disk storage onsite, but setting with business problems. The fastest way to impede a big data
up the cloud-based service requires only 20 minutes. The project is to expend resources with no measurable end result.
cloud service also offers flexibility. An organization can • Pursue a project that promises quick and easy value.
rapidly set up a data warehouse, use it intensively for a short Do not go out of your way to find a particularly nasty prob-
period of time, and then discontinue it. lem to solve. Your first big data project will be challenging,
This ability to set up, scale, and tear down a data ware- so pick an easy project. Use data you already have, instead
house on-demand shatters conventional data warehouse of data that require new sensors. Find a project that requires
economics. A wide range of analytics projects that were only one or two new technologies. The project should have a
considered not viable before can now be executed for a few relatively short time frame (months, not years).
hundred dollars. • Assemble a multidisciplinary team. Any successful IT
project requires a blend of technical expertise and domain-
NoSQL databases specific business or operational expertise. Treating an IIoT
RDBMSs have long been the dominant tool for orga- project as purely a technical problem will yield technically
nizing and managing enterprise data, and they continue to correct but useless insights and recommendations. The IT
dominate. However, an alternative class of databases, collec- team may need to learn and implement new technologies,
tively known as nonrelational or NoSQL (e.g., MongoDB, but will require operational and business insight to ensure
Cassandra, Redis, and HBase), have gained popularity there is value in implementation.
because they meet the changing demands of data manage- • Measure return on investment for future projects. Before
ment and the growth of very-large-scale applications. you start your first big data project, ensure that you have a
RDBMSs are from a bygone era when a database server process in place to measure the return on investment. If you
typically ran in a single box that contained a central process- cannot demonstrate tangible value for your first project, there
ing unit (CPU), memory, and a hard disk. Scalability was likely will not be a second project. Make sure you understand
restricted to scaling within the box. The processor could the objectives as agreed upon with business leadership, and
be updated to run faster or the single-CPU card could be document progress toward those objectives. CEP
upgraded to a dual-CPU card, but only within the confines

of the box. That is known as vertical scalability, and it places
fundamental limits on how much the data volume, through- Literature Cited
put, and response rate can be scaled. 1. ARC Advisory Group, “What’s Driving Industrial Invest-
NoSQL databases are designed out-of-the-box to deliver ment in BI and Analytics?,” www.arcweb.com/strategy-reports/
performance, scalability, and high availability across dis- Lists/Posts/Post.aspx?List=e497b184-6a3a-4283-97bf-
ae7b2f7ef41f&ID=1665&Web=a157b1d0-c84d-440a-a7da-
tributed computing environments. To gain these advantages 9b99faeb14cc (Sept. 4, 2014).
over RDBMS, however, many NoSQL databases take a 2. Laney, D., “3D Data Management: Controlling Data Volume,
more relaxed approach to consistency. The nodes in the data- Velocity, and Variety,” META Group, Inc. (Feb. 6, 2001).
base eventually all have a consistent view of the data, which 3. ARC Advisory Group, “SAP HANA: The Real-Time Database
is often referred to as eventual consistency. This approach as Change Agent?,” www.arcweb.com/strategy-reports/Lists/
Posts/Post.aspx?ID=1656 (Aug. 22, 2014).
pushes more responsibility for data integrity and consistency
4. Komorowski, M., “A History of Storage Costs (update),”
onto the application logic. Whereas an RDBMS usually www.mkomo.com/cost-per-gigabyte-update (Mar. 9, 2014).
centralizes responsibility for data integrity in the database
BIG DATA
Success Stories in the
Process Industries
Salvador García Muñoz Big data holds much potential for optimizing
Eli Lilly and Co.
John F. MacGregor
and improving processes. See how it has
ProSensus, Inc. already been used in a range of industries,
from pharmaceuticals to pulp and paper.
B
ig data in the process industries has many of the Latent variable methods
characteristics represented by the four Vs — volume, Historical process data generally consist of measure-
variety, veracity, and velocity. However, process ments of many highly correlated variables (often hundreds
data can be distinguished from big data in other industries to thousands), but the true statistical rank of the process, i.e.,
by the complexity of the questions we are trying to answer the number of underlying significant dimensions in which the
with process data. Not only do we want to find and interpret process is actually moving, is often very small (about two to
patterns in the data and use them for predictive purposes, but ten). This situation arises because only a few dominant events
we also want to extract meaningful relationships that can be are driving the process under normal operations (e.g., raw
used to improve and optimize a process. material variations, environmental effects). In addition, more
Process data are also often characterized by the pres- sophisticated online analyzers such as spectrometers and
ence of large numbers of variables from different sources, imaging systems are being used to generate large numbers of
something that is generally much more difficult to handle highly correlated measurements on each sample, which also
than just large numbers of observations. Because of the require lower-rank models.
multisource nature of process data, engineers conducting a Latent variable methods are uniquely suited for the
process investigation must work closely with the IT depart- analysis and interpretation of such data because they are
ment that provides the necessary infrastructure to put these based on the critical assumption that the data sets are of
data sets together in a contextually correct way. low statistical rank. They provide low-dimension latent
This article presents several success stories from dif- variable models that capture the lower-rank spaces of
ferent industries where big data has been used to answer the process variable (X) and the response (Y) data with-
complex questions. Because most of these studies involve out over-fitting the data. This low-dimensional space is
the use of latent variable (LV) methods such as principal defined by a small number of statistically significant latent
component analysis (PCA) (1) and projection to latent variables (t1, t2, …), which are linear combinations of the
structures (PLS) (2, 3), the article first provides a brief measured variables. Such variables can be used to con-
overview of those methods and explains the reasons such struct simple score and loading plots, which provide a way
methods are particularly suitable for big data analysis. to visualize and interpret the data.
The scores can be thought of as scaled weighted aver- unit is used to evaporate and collect the solvent contained in
ages of the original variables, using the loadings as the the initial charge and to dry the product to a target residual
weights for calculating the weighted averages. A score plot solvent level.
is a graph of the data in the latent variable space. The load- The objective was to determine the operating conditions
ings are the coefficients that reveal the groups of original responsible for the overall low yields when off-specification
variables that belong to the same latent variable, with one product is rejected. The problem is highly complex because
loading vector (W*) for each latent variable. A loading it requires the analysis of 11 initial raw material conditions,
plot provides a graphical representation of the clustering of 10 time trajectories of process variables (trends in the evolu-
variables, revealing the identified correlations among them. tion of process variables), and the impact of the process
The uniqueness of latent variable models is that they variables on 11 physical properties of the final product.
simultaneously model the low dimensional X and Y The available data were arranged in three blocks:
spaces, whereas classical regression methods assume that • the time trajectories measured through the batch,
there is independent variation in all X and Y variables which were characterized by milestone events (e.g., slope,
(which is referred to as full rank). Latent variable models total time for stage of operation), comprised Block X
show the relationships between combinations of variables • Block Z contained measurements of the chemistry of
and changes in operating conditions — thereby allowing the incoming materials
us to gain insight and optimize processes based on such • Block Y consisted of the 11 physical properties of the
historical data. final product.
The remainder of the article presents several industrial A multiblock PLS (MBPLS) model was fitted to the three
applications of big data for: data blocks. The results were used to construct score plots
• the analysis and interpretation of historical data and (Figure 1), which show the batch-to-batch product quality
troubleshooting process problems variation, and the companion loading plots (Figure 2), which
• optimizing processes and product performance show the regressor variables (in X and Z) that were most
• monitoring and controlling processes highly correlated with such variability.
• integrating data from multivariate online analyzers Contrary to the initial hypothesis that the chemistry vari-
and imaging sensors. ables (Z) were responsible for the off-spec product, the analy-
sis isolated the time-varying process variables as a plausible
Learning from process data cause for the product quality differences (Figure 1, red) (4).
A data set containing about 200,000 measurements was This was determined by observing the direction in which the
collected from a batch process for drying an agrochemical product quality changes (arrow in Figure 1) and identifying
material — the final step in the manufacturing process. The the variables that line up in this direction of change (Figure 2).
Variables z1–z11 line up in a direction that is close to perpen-
dicular to the direction of quality change.
On-Spec (High Residual Solvent)
On-Spec
Off-Spec
6 0.7
0.6 Level1
Temp1
4 0.5
0.4
2
0.3
W* [2]
0 0.2
t2
Weight Wet
0.71 Cake
Z11
–2 Time4 Temp2 Z6
0
Time2
–0.1 Z3 Z5 Z4 Z2
–4 Z7
Z9
–0.2 Z1 Z10 Time3
Z8 Time1 TempSlope
–6 –0.3
–6 –4 –2 0 2 4 6 –0.4 –0.3 –0.2 –0.1 0 0.1 0.2 0.3 0.4
t1 W* [1]
p Figure 1. A score plot of two latent variables shows lots clustered by p Figure 2. A companion loading plot reveals the process parameters that
product quality. Source: (4). were aligned with the direction of change in the score plot. Source: (4).
Optimizing process operations tion). Implementing this optimization routine in real time
The manufacture of formulated products (such as significantly improved the quality of the product produced
pharmaceutical tablets) generates a complex data set that in this manufacturing process (Figure 4).
extends beyond process conditions to also include informa- Selecting the materials from inventory to be used in
tion about the raw materials used in the manufacture of each manufacturing a product is not as simple as choosing those
lot of final product, and the physical properties of the raw that will produce the best lot of product. If you choose
materials. This case study can be represented by multiple materials aiming to produce the best next lot, you will
blocks of data: the final quality of the product of interest inevitably consume the best materials very fast; this may
(Y), the weighted average for the physical properties of the be acceptable for a low-volume product. For high-volume
raw materials used in each lot (RXI), and the process and products, however, using this same calculation will lead to
environmental conditions at which each lot was manufac- an undesired situation where the best materials have been
tured (Z). These blocks of data were used to build a MBPLS depleted and the less-desirable raw materials are left. In
model that was later embedded within a mixed-integer non- this case, it is better to perform the optimization routine
linear programming (MINLP) optimization framework. The for the best next campaign (a series of lots), which will
physical properties of the lots of material available in inven- account for the fact that more than one acceptable lot of
tory are represented by data block XA and the properties of product is being manufactured. The optimization calcula-
the lots of material used to manufacture the final product are tion in this latter case will then balance the use of inventory
represented by data block X. and enable a better management of desireable vs. less-
The objective for the optimization routine was to deter- desirable raw materials for the entire campaign of manu-
mine the materials available in inventory that should be factured product.
combined and the ratios (r) of those that should be blended The MINLP objective function must be tailored to the
to obtain the best next lot of finished product. The square of material management needs for the given product so that
the difference between the predicted and the target quality of it adequately considers operational constraints, such as
the product was used to choose the lots and blending ratios. the maximum number of lots of the same material to
The underlying calculations reduce the problem to the blend (5, 6).
score space, where the differences in quality — in this case
tablet dissolution — correspond to different locations on Monitoring processes
the score plot (Figure 3). The MINLP optimization routine Perhaps the most well-known application of principal
identified the candidate materials available in inventory components analysis in the chemical process industries
that should be blended together to make the final product (CPI) is its use as a monitoring tool, enabling true multi-
so that the score for the next lot lands in the score space variate statistical process control (MSPC) (7, 8). In this
corresponding to the desired quality (i.e., target dissolu- example, a PCA model was used to describe the normal
variability in the operation of a closed spray drying system
in a pharmaceutical manufacturing process (9). The system
1.5
Best-Next-Lot Approach
1
{
45 USL
Slow Dissolution Historical Best-Next-Campaign Approach
Data
0.5 40
Dissolution, %
35
0
t3
30 Target
–0.5 25 LSL
20
–1 Target Dissolution
15
Fast Dissolution
{
Quality Problems
–1.5 Lots of Finished Goods
–2 –1.5 –1 –0.5 0 0.5 1 1.5 2
t1 p Figure 4. A control chart of the degree of dissolution of a pharmaceuti-
cal tablet reveals the onset of quality problems. Quality problems are
p Figure 3. The dissolution speed of a pharmaceutical tablet is identified reduced by the implementation of a best-next-lot solution, then eliminated
on a score plot of the latent variables. Source: (5). by the best-next-campaign approach. Source: (6).
Drying Gas Flowrate
Controlled by Supply
T FS
Fan Speed
Thermal
HEPA Process Mass Flow
Heater Supply
Filter Sensor Fan
Drying
P Chamber
Condenser
Exhaust Pressure
Transducer
FS HEPA
Filter
T
P
Feed
Data Logging Cyclone
Pump Baghouse Exhaust
of Product Exhaust Pressure Fan
Collection Controlled by Exhaust
Weight Fan Speed
p Figure 5. A closed-loop spray drying system in a pharmaceutical manufacturing facility is being monitored by the measurement of 16 variables that a PCA
model projects into two principal components. Source: (9).
(Figure 5) includes measurements of 16 process variables, Control of batch processes

which can be projected by a PCA model into two principal Multivariate PLS models built from process data that
components (t1 and t2), each of which describes a differ- relate the initial conditions of the batch (Z), the time-varying
ent source of variability in the process. A score plot that process trajectories (X), and the final quality attributes (Y)
updates in real time can then be used as a graphical tool (10) provide an effective way to control product quality and
to determine when the process is exhibiting abnormal productivity of batch processes. Those models can be used
behavior. This is illustrated in Figure 6, where the red dots online to collect evolving data of any new batch (first the
indicate the current state of the process, which is clearly initial data in Z and then the evolving data in X), which are
outside of the normal operating conditions (gray markers). then used to update the predictions of final product qual-
It is important to emphasize that this model could be ity (Y) at every time interval during the batch process. At
used to effectively monitor product quality without the certain critical decision points (usually each batch has one
need to add online sensors to measure product properties. or two), a multivariate optimization routine is run to identify
Building an effective monitoring system requires a good control actions that will drive the final quality into a desired
data set that is representative of the normal operating con- target region and maximize productivity while respecting all
ditions of the process. operating constraints (11–13).
Figure 7 displays one quality attribute of a high-value
Normal food product before and after this advanced process
6
Operating
Conditions 0.08
Abnormal
4
Operating 0.07
Final Product Quality Attribute
Conditions With Control

2 0.06
0 0.05
t2
0.04
–2
0.03
–4 0.02 No Control
–6 0.01
–10 –8 –6 –4 –2 0 2 4 6 8 10 0
t1 –0.4 –0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4
Deviation from Target

p Figure 6. A score plot of the two principal components describing
the closed-loop spray drying system (Figure 5) shows that the process is p Figure 7. Advanced control eliminated the variation in the final product
operating under abnormal conditions. Source: (9). quality attribute of a food product. Source: (9).
Multivariate latent variable methods and paper mill. Real-time measurements on many process
reduce a problem to manageable variables were combined with images from a color camera
capturing the combustion region of the kiln. The information
diagnostics and simple plots. extracted from the combustion zone images and data from
the process data blocks were combined using the online
control method was implemented over many thousands of multivariate model to assess combustion stability and make
batches. The process control method reduced the root- 2-hr-ahead predictions of the exit lime quality.
mean-square deviation from the target for all final product
quality attributes by 50–70% and increased batch produc- Concluding remarks
tivity by 20%. Contextually correct historical data is a critical asset
that a corporation can take advantage of to expedite asser-
Analyzing information from tive decisions (3). A potential pitfall in the analysis of big
advanced analyzers and imaging sensors data is assuming that the data will contain information
The use of more-sophisticated online analyzers (e.g., just because there is an abundance of data. Data contain
online spectrometers) and image-based sensors for online information if they are organized in a contextually cor-
process monitoring is becoming more prevalent in the rect manner; the practitioner should not underestimate the
CPI. With that comes the need for more powerful meth- effort and investment necessary to organize data such that
ods to handle and extract information from the large and information can be extracted from them.
diverse data blocks acquired from such sophisticated Multivariate latent variable methods are effective tools
online monitors. Latent variable methods provide an effec- for extracting information from big data. These methods
tive approach (14). reduce the size and complexity of the problem to simple
Consider a soft sensor (i.e., virtual sensor software that and manageable diagnostics and plots that are accessible to
processes several measurements together) application for all consumers of the information, from the process design-
predicting the quality of product exiting a lime kiln at a pulp ers and line engineers to the operations personnel. CEP
Literature Cited
1. Jackson, E., “A User’s Guide to Principal Components,” 1st ed., Multivariate SPC Methods for Monitoring and Diagnosing Process
John Wiley and Sons, Hoboken, NJ (1991). and Product Performance,” Journal of Quality Technology, 28 (4),
pp. 409–428 (1996).
2. Höskuldsson, A., “PLS Regression Methods,” Journal of Chemo-
metrics, 2 (3), pp. 211–228 (June 1988). 9. García Muñoz, S., and D. Settell, “Application of Multivariate
Latent Variable Modeling to Pilot-Scale Spray Drying Monitoring
3. Wold, S., et al., “PLS — Partial Least-Squares Projection to
and Fault Detection: Monitoring with Fundamental Knowledge,”
Latent Structures,” in Kubiny, H., ed., “3D-QSAR in Drug
Computers and Chemical Engineering, 33 (12), pp. 2106–2110
Design,” ESCOM Science Publishers, Leiden, The Netherlands,
(2009).
pp. 523–550 (1993).
10. Kourti, T., et al., “Analysis, Monitoring and Fault Diagnosis of
4. García Muñoz, S., et al., “Troubleshooting of an Industrial Batch
Batch Processes Using Multiblock and Multiway PLS,” Journal of
Process Using Multivariate Methods,” Industrial and Engineering
Process Control, 5, pp. 277–284 (1995).
Chemistry Research, 42 (15), pp. 3592–3601 (2003).
11. Yabuki, Y., and J. F. MacGregor, “Product Quality Control
5. García Muñoz, S., and J. A. Mercado, “Optimal Selection of
in Semibatch Reactors Using Midcourse Correction Policies,”
Raw Materials for Pharmaceutical Drug Product Design and
Industrial and Engineering Chemistry Research, 36, pp. 1268–1275
Manufacture Using Mixed Integer Non-Linear Programming and
(1997).
Multivariate Latent Variable Regression Models,” Industrial and
Engineering Chemistry Research, 52 (17), pp. 5934–5942 (2013). 12. Yabuki, Y., et al., “An Industrial Experience with Product Quality
Control in Semi-Batch Processes,” Computers and Chemical Engi-
6. García Muñoz, S., et al., “A Computer Aided Optimal Inventory
neering, 24, pp. 585–590 (2000).
Selection System for Continuous Quality Improvement in Drug
Product Manufacture,” Computers and Chemical Engineering, 60, 13. Flores-Cerrillo, J., and J. F. MacGregor, “Within-Batch and
pp. 396–402 (Jan. 10, 2014). Batch-to-Batch Inferential Adaptive Control of Semi-Batch
Reactors,” Industrial and Engineering Chemistry Research, 42,
7. MacGregor, J. F., and T. Kourti, “Statistical Process Control of
pp. 3334–3345 (2003).
Multivariable Processes,” Control Engineering Practice, 3 (3),
pp. 403–414 (1995). 14. Yu, H., et al., “Digital Imaging for Online Monitoring and Control
of Industrial Snack Food Processes,” Industrial and Engineering
8. Kourti, T., and J. F. MacGregor, “Recent Developments in
Chemistry Research, 42 (13), pp. 3036–3044 (2003).
BIG DATA
Getting Started
on the Journey
Lloyd F. Colegrove This article discusses some experiences
Mary Beth Seasholtz
Chaitanya Khare and challenges in establishing an
The Dow Chemical Co. enterprise manufacturing intelligence (EMI) platform
at a major chemical manufacturing company, and
recommends steps you can take to convince your
management to harness big data.
I
t’s often said that a journey of a thousand miles starts this big data journey, the why is often irrelevant or ignored.
with one step. This holds true for any effort to harness The chemical industry cannot afford to ignore the why
the power of the data in plant historians, laboratory infor- that accompanies the what. The what might be the plant
mation management systems (LIMS), and online analyzers. trending outside of normal operating conditions; the why
The previous two articles in this special section dis- is the reason for that trend. The what might be that product
cuss what big data is and why engineers should care about quality does not meet customer requirements; the why is the
data. This article instructs you on how to tackle the vexing reason for that as determined from the data.
problem of convincing your organization to start the journey In the chemical industry, engineers typically should not
into big data. It discusses how we helped the Dow Chemical respond until they know the why. Chemistry and chemical
Co. move from merely collecting data to actively harnessing engineering principles do not trend up or down based on
the data. We offer our experiences, reveal some of the chal- consumer whim or stock market variation. Ignoring the why
lenges we faced and successfully met, and explain how we can lead to operational (i.e., reliability and productivity) and
overcame these obstacles. safety peril. So, given our focus on both the what and the
why, how do the ideas of big data translate to the chemical
The what and the why manufacturing industry?
In many sectors, big data means using large sets of At a chemical plant that has taken advantage of big data,
unstructured data to predict buying patterns or market trends. the right people will instantly receive alerts to let them know
These data can also be used to define triggers — also called when continuous processes are trending outside of normal
the what — that alert companies to engage in an activity. or batch processes are not progressing properly. These alerts
What refers to an event that signals an action is required. will be accompanied by tools for making improvements —
For example, data analysis indicates Internet shoppers are that is, a method for efficient diagnosis of the problem, a
interested in a particular product. This trend might prompt collaborative environment for discussion about what to do,
you to meet with suppliers, restock, or increase production. and a way to store the experience to learn from it.
The what allows companies to react appropriately in their Typically, both the alerting system and the tools for mak-
supply chain and market context. Because of the fickleness ing improvements are built into a single platform, referred
of trends in business, the what that matters today might not to as enterprise manufacturing intelligence (EMI). EMI can
matter tomorrow. In this example, you are not necessarily mean different things to different people; in this instance, we
interested in why a particular product’s sales pattern is behav- mean a platform that encompasses the automated sampling
ing this way, you just want to respond to the market trend. In of data from data sources, as well as collation, affinitization,
and analysis of those data — all in real time. We are not add- What software to use
ing new measures or systems to generate more data. We are The term EMI platform generally refers to a system that
using the data we already collect and store to achieve ever- can be comprised of one or more programs and approaches
increasing value. that first aggregate data and then allow for contextual
In the program that we established at Dow, we aimed to analysis, visualization, and access to this information and
achieve “total data domination.” This entailed mastering all knowledge by a broad spectrum of users. The platform
digital and alphanumeric data that related to a given opera- supports problem identification (the what) and problem
tion, and correlating and displaying the data to be meaning- resolution (the why).
ful for every user. Before we chose an EMI platform that was the right
fit for our needs, our early attempts to use data in a more-
Where to find big data opportunities efficient manner were often conducted on an electronic
To determine where better data tools are needed, listen spreadsheet (i.e., Excel). However, this method of record-
to the way that data are discussed on a daily basis. For ing the data only tracked and conveyed recent plant data
example, at the beginning of our process, we often heard, “I (i.e., only the last few plant problems). While spreadsheets
don’t trust the lab data.” In attempting to ensure accuracy in are acceptable for personal use, they are insufficient for
the analytical measurement — and raise the trust within our institutional deployment as an EMI platform. More and
operations — we discovered two things: bigger spreadsheets are not the way to deploy an enterprise
• The analytical measurement was typically performing data analytics platform, nor can you necessarily count on
its duties as prescribed and expected. the appropriate application of numerical analysis in self-
• The recipients of the data (e.g., process operations, plant programmed analytics.
engineering, and improvement and quality release functions, Once you have made the commitment to an enterprise
among others) were unable to use the data properly. approach, you need to select a software package. Take a few
The recipients did not understand the effect of natural for a test drive. We sought out many vendors, chose three
variation inherent in the manufacturing process, the analyti- packages to test, and then went back to each vendor to see
cal method, the sampling technique, etc. This lack of under- if they could withstand our critique and give us the technol-
standing was costing a lot of money, and frustrating plant ogy we could use. Only one vendor at that time was willing
custodians and customers alike. Data can bring great value to work with us, and they became our first analytics partner
if understood within the proper context, but the context was — a partnership that has grown over the past 13 years.
almost always missing. During the evaluation process, you must consider
Look for operations in continual crisis. Many big data whether the tool will fit in your plant’s workflow. Beware
opportunities are identified during post-mortem root-cause of the hammer-and-nail problem as applied to software —
investigations. Other opportunities stem from known crisis i.e., once you have a hammer, everything starts to look like
situations, where the plant works feverishly to solve a recog- a nail.
nized problem but the issue continues to escalate. Failing to Carefully consider whether you want a tailored solu-
identify a root cause may result in reduced production rates tion or a general solution. A tailored solution will exactly
or even unexpected downtime. meet your workflow needs, but tailored solutions are
In both of these cases, it’s easy to demonstrate to a des- harder to upgrade and maintain — and thus are often more
perate audience that the issue could have been avoided if the resource-intensive. A general solution is easy to upgrade
data had been evaluated in a timely manner. After a process and maintain and requires fewer resources to support (an
disruption, it’s clear that use of data analytics can help the important benefit especially in a large organization); how-
plant avoid a shutdown, reduce production delays, and/or ever, a general solution might not exactly fit your needs.
deliver better value and service to its customers. Because it was less resource-intensive, we chose the
After the storm has passed. Once the root causes of a general solution — with one caveat. We worked with the
crisis are understood, plant personnel may think they have vendor to improve their product, and along with the vendor’s
the situation under control. But, as time passes and engi- other clients, we slowly molded the original product to be
neering personnel change, commitment to maintaining the generally favorable to our workflows. At times, the original
necessary changes can wane. This is especially true if the designers’ intentions for their tools did not match our usage
recommendations coming out of the root-cause investiga- criteria, and we were able to communicate this to our partner
tions were contrary to local lore. If a plant is struggling with to resolve the gap. This makes codevelopment a win-win for
consistently employing the necessary operating discipline, everyone. The platform itself is not the competitive advan-
an EMI platform can be employed to drive the continued use tage — the way we implement it becomes our competitive
of the operating discipline. advantage.
The first steps of the big data journey
Once you have found an analytics opportunity and
decided on a data analytics software package, the truly hard
work starts — convincing your organization to move for- Co
Pu unter
shb
ward and then actually taking those first steps (Figure 1). ack
Expect some
Form a team. First, gain the cooperation of a variety of Li resistance, but
plant engineers, lab and/or quality control personnel, perhaps Su sten be prepared to
Sto cces for
ries s counter it.
a statistician or chemometrician, an IT infrastructure expert,
Listen for success
and a dedicated coordinator. S
P tart a stories — they point to
Pro ilot
You can choose to rely, somewhat, on the software ven- gra shifts in culture.
m
dor and consulting firms if the full complement of expertise Set up a pilot program within the
is not available at your company. An outside consultant can F
Tea orm a
lab or plant.
serve as an excellent check on your organization’s under- m
Bring together a diverse team consisting of
standing of statistics if you do not have your own internal plant engineers, quality control personnel,
IT experts, etc.
experts.
If yours is a small company with very few operations, p Figure 1. The first steps of the big data journey typically garner some
pushback. Place a metaphorical baited hook in front of skeptics by
having such a robust team may not be necessary. However, demonstrating the power of improved analytics.
larger companies need to dedicate resources starting from
the beginning of this journey, because a lack of IT infra to reduce it”), additional samples taken to more fully char-
structure and support can derail the best of intentions. acterize a feed stream, an interruption avoided because the
It may be best to bring the plant quality and process impurity never exceeded the specification, or an operational
control experts onboard first, because improved data analysis improvement identified are all signs of the desired culture
will only make their jobs easier. But anyone you come change.
across who is committed to using data in a more transforma- We often have to show our clients how their results are
tive way is a good potential partner. changing because of our tools and approaches. The success
The goal of bringing together a diverse team is to suc- stories are told slightly differently, and depend on what part
cessfully deploy the initial applications, while at the same of the organization you are working with.
time striving toward a complete EMI platform with support- After the small success stories are recognized for what
ing documentation, training, support, IT infrastructure, etc. they mean, it is time to expand the pilot within the business
Such a team is helpful for a sustained engagement along the or plant. This is where the EMI platform will get its first test
big data journey. with deployments beyond the original pilot. Engineers will
Start with a pilot program. Next, set up a pilot program see familiar tools appearing all around the business, which
within the lab or plant. This environment should contain one will generate discussion.
or more data sources that are already collecting data. The The pushback. As we expanded beyond our initial pilot
pilot program involves collating the data within the pro- project, we tried to convince others across the company
spective EMI platform so that the team can experiment and that there is value in data. In any software introduction
explore how improved analytics can better explain measure- there is likely to be resistance. “It is too expensive,” “It’s
ment capability and variation. Once the connection between not my way,” and “Why do I need that?” are just three of
the data and some plant events are well understood, the team the responses you may hear. If you have already attempted
can then learn how monitoring the data can prevent prob- to change the analytics culture in your company, you have
lems. This activity marries context to data. heard these and more. To address the skepticism, we tried a
Your starting point may depend on how knowledgeable new approach that can be compared to fishing.
your company, or clients, are. Across our vast company, we What exactly do fish have to do with analytics, you ask.
noticed that many users (regardless of role) needed training If we place a hook and bait in front of a potential user, we
in basic statistics because they did not have sufficient under- may get them to bite. We tempt them by telling success
standing of the science of numerical variation. You cannot stories from other areas of the company that demonstrate the
create context for the data until the science of variation power of improved analytics.
(statistics) is understood. We guide, we explain, we mentor, we hand-hold, all to
The lightbulb goes on. During the development and early provide support as they take the first tentative steps toward
use of the EMI tool, look for success stories that point to better data usage. If we are successful in this fishing expedi-
subtle shifts in the culture. A well-timed phone call (e.g., “I tion, the mere mention that we might remove the tool will
saw that impurity X is increasing; let’s talk about what to do elicit protests.
Article continues on next page
Moving beyond initial success global manufacturing infrastructure and the cultures repre-
At this point, our small team had several successes and sented both locally and across the enterprise. Potential users
the acceptance of data analytics started to spread across the can cite many reasons to deflect the opportunity you bring,
company. We even developed our own training modules for so you need to be more flexible to overcome these fears and
new adopters, which included a nearly math-less treatise on misgivings.
how to use statistics and understand variation. These courses We have devised strategies with the sole purpose of
proved popular to a workforce starved for a better under- getting the analytics approaches in front of users, regardless
standing of how to explore their data but unable or unwilling of their misgivings. The personnel who both lead and deliver
to go back to school for a degree in statistics. the big data analytics tools must have as much flexibility as
With these successes, our diverse team began to estab- the platform itself. It is imperative for long-term success that
lish an EMI solution and expand our analytics approach. the users become comfortable with their data and trust the
Remember that we are not talking about adding new message the data are delivering.
measures or systems to generate more data. We are simply Perfection is a myth. There is no such thing as the perfect
using the data we already collect and store to achieve ever- tool or perfect approach for harnessing big data. For exam-
increasing value. ple, we adopted an attitude that if we gain around 80% of
As we incrementally expanded the purpose and func- what we seek, we should move on in the process.
tion of the EMI, we began to recognize discrete uses for our Forgive your data. Accept the data as they come to you.
approach based on the kinds of decisions that need to be Do not get into the argument of prescribing how data should
made (Figure 2): be structured. As you develop a system that harnesses big
• transactional (control room) data, keep in mind that it is not your job to change, modify,
• tactical (engineering and daily to weekly control) or replace existing databases.
• operational (across multiple plants and/or unit It should not matter what type of system your data are
operations) stored in, nor how antiquated the data storage is — a tool
• strategic (across an entire business). must be able to access all sources of data in some manner
The data analytics needs of each of these levels are very without requiring intermediate programming to make the
different. However, the platform and the underlying IT connection to the data.
infrastructure (e.g., collating of data, contextual analysis, Look for all data that are available — even data dis
visualization, and propagation across the enterprise) are not connected from the lab or plant historians such as readings
unique — they are the same for all levels of operation. from remote online analyzers.
If you are thinking about implementing an EMI platform, If you have “bad data,” that is, you think your measure-
keep in mind this advice: ment systems are lacking capability or you think you are not
Implementation must have flexibility. No matter how analyzing the right place at the right time, you may think that
you envision an implementation, be aware of the quirks of a your big data journey is over before it ever begins. This is
not true! It will become readily apparent in the initial analy-
sis if your data are poor, which will reveal the first, most
• Large-scale capital decisions significant, improvement opportunity to the operation. It can
across the business
Strategic • Annually also point to the measurements that need to be improved in
capability (precision or accuracy) or perhaps frequency, etc.
• Decisions at the plant level;
optimizing across multiple
Operational unit operations Engage your audience
• Quarterly
Analytics must be tailored to the process and personnel
• Course corrections
that they are aimed at.
Tactical • Daily to monthly If you applied an analytics solution to an operating
environment without an understanding of that environment,
• Change one the analytics solution would be doomed to marginalization
Transactional variable at a time or outright failure. It would be destined for marginalization,
• Hourly
because ultimately the tools and approaches would become
the purview of a select few. It could be an outright failure
p Figure 2. The data collected and analyzed in the EMI platform can be if the tool never rises above personal-use status, such as an
used to make four types of decisions. Each type of decision has a different
timescale and scope. Transactional decisions are the most frequent but electronic spreadsheet or personal calculator.
smallest in scope, whereas strategic decisions are the least frequent but In order for true progress to be made, the team members
largest in scope. and staff have to be open to viewing and interacting with
their data in a way they are not accustomed to, and the ana- The translation from knowledge to wisdom will take
lytics team needs to be willing to assemble the analytics in a longer than the translation from information to knowledge.
way that the users can relate to. We have implemented a cutting-edge approach to gain
Early in the design of our analytics approach, we com- information from our data, and we have built an enterprise
mitted to aligning our tools and the underlying approach to approach that can be applied at all levels (transactional to
the workflow that our first internal customers desired. We strategic). The next stop on our big data journey is to bring
did not sub-optimize the tool by tailoring the programming; vetted information to bear automatically in real time, when
rather, we ensured that the tool was used in a manner that the user needs this pre-contextualized information. It will
was familiar to and digestible by a given operational team. take time to build, connect, and systematize a process for
Figure 3 illustrates the higher-order process that was devel- incorporating new information, past information, and even
oped in collaboration with the operations team. valuable (but nonvalidated) local operational lore, but the
Our team sought a tool that alerted us to problems and payoff will be worth it.
opportunities and triggered discussion among operations and We have made great strides — away from a culture of
subject matter experts, hence providing an environment that just collecting data but not using them to their fullest poten-
embraces both the what and the why. The tool would also tial, to a culture of realizing that potential. In the future, we
prompt users by displaying internally vetted and docu- hope to completely automate the process of analyzing the
mented (i.e., company-proprietary) knowledge and wisdom, data, so that the system can spot problems even before con-
which could then be harnessed to develop appropriate action trol limits and rules are violated. When that occurs, proper
plans. Finally, we wanted to develop a way for users to control of the plant will become second nature to process
encode new knowledge and wisdom back into the tool so engineers and operators. New staff will have information
that it may be used again at an appropriate time — this step and tools immediately available to them, which will shorten
has not yet come to fruition and is ongoing. learning induction times and lead to safer, more reliable
plant operations and practices.
Closing thoughts We promised at the beginning of this article to tell you
Our big data approach seeks to actively turn data into how to convince your organization to tackle data analytics.
information (through the application of statistics), trans- The best advice we can give is to move slowly and deliber-
late that information into knowledge (via contextualization ately. Start small, look for small data opportunities within
of data), and then convert that knowledge into wisdom your plants and operations, and build momentum with suc-
(e.g., maintaining optimal operation 24/7), while avoiding cess stories. Starting the process on a scale that is too grand
any surprises. will only elicit blank stares and possibly overwhelm your
staff and operational team. At Dow, our big data journey
started within the lab environment, and steadily grew, until
the advent of the first EMI platform that met our needs.
Alert!
In our quest for total data domination, we are already
Discussion
improving reliability, reducing costs, increasing value, and
Real-time triggered providing safer operations. These efforts will provide our
tracking and between customers with a level of product consistency and reliability
Data, notification technical and
calculations, dashboard onsite staff that will become the new standard in the CPI. CEP
predictive
models,
I
Integrate Consult
“big data”
lea
learning into existing Acknowledgments
e
enterprise knowledge e
The authors wish to acknowledge the editorial contributions of Jim
Petrusich, Vice President, Northwest Analytics, Inc.
Plant Agree on
m
makes actions
cha
changes
Additional Resources
Colegrove, L., “Data Initiative Improves Insights,” Chemical
Processing, www.chemicalprocessing.com/articles/2015/data-
initiative-improves-insights/ (Mar. 12, 2015).
Neil, S., “Big Data Dilemma — Finding the Hidden Value,” Auto
mation World, www.automationworld.com/industrial-internet-
p Figure 3. When designing an analytics approach to harness big data, things/big-data-dilemma-finding-hidden-value/ (June 28, 2015).
you can use data to trigger conversations that enable timely actions to
avoid potential plant problems.
BIG DATA
Challenges and Future
Research Directions
Marco S. Reis
Univ. of Coimbra, Portugal
The big data movement is creating opportunities
Richard D. Braatz
for the chemical process industries to improve
Massachusetts Institute of their operations. Challenges, however, lie ahead.
Technology
Leo H. Chiang
The Dow Chemical Co.
T
he big data movement is gaining momentum, with governor of Kansas. The magazine sent out 10 million post-
companies increasingly receptive to engaging in cards — considered a massive amount of data at that time
big data projects. Their expectations are that, with — to gain insight into the voting tendencies of the populace.
massive data and distributed computing, they will be able The Digest collected data from 2.4 million voters, and after
to answer all of their questions — from questions related to triple-checking and verifiying the data, forecast a Landon
plant operations to those on market demand. With answers victory over Roosevelt by a margin of 57% to 43%. The
in hand, companies hope to pave new and innovative paths final result, however, was a landslide victory by Roosevelt
toward process improvements and economic growth. of 61% versus Landon’s 37% (the remaining votes were
An article in Wired magazine, “The End of Theory: The for a third candidate). Based on a much smaller sample of
Data Deluge Makes the Scientific Method Obsolete” (1), approximately 3,000 interviews, George Gallup correctly
describes a new era in which abundant data and mathematics predicted a clear victory for Roosevelt.
will replace theory. Massive data is making the hypothesize- Literary Digest learned the hard way that, when it
model-test approach to science obsolete, the article states. comes to data, size is not the only thing that matters.
In the past, scientists had to rely on sample testing and Statistical theory shows that sample size affects sample
statistical analysis to understand a process. Today, computer error, and the error was indeed much lower in the Digest
scientists have access to the entire population and therefore poll. But sample bias must also be considered — and this is
do not need statistical tools or theoretical models. Why is especially critical in election polls. (The Digest sample was
theory needed if the entire “real thing” is now within reach? taken from lists of automobile registrations and telephone
Although big data is at the center of many success directories, creating a strong selection bias toward middle-
stories, unexpected failures can occur when a blind and upper-class voters.)
trust is placed in the sheer amount of data available — Another example that demonstrates the danger of
highlighting the importance of theory and fundamental putting excessive confidence in the analysis of big data
understanding. sets regards the mathematical models for predicting loan
A classic example of such failures is actually quite dated. defaults developed by Lehman Brothers. Based on a very
In 1936, renowned magazine Literary Digest conducted an large database of historical data on past defaults, Lehman
extensive survey before the presidential election between Brothers developed, and tested for several years, models
Franklin D. Roosevelt and Alfred Landon, who was then for forecasting the probability of companies defaulting on
their loans. Yet those models built over such an extensive process description must capture the actual influence of
database were not able to predict the largest bankruptcy in each manipulated input variable on the process outputs. Its
history — Lehman Brothers’ own. construction requires experimentation — i.e., the active
These cases illustrate two common flaws that under- collection of process data via a design of experiments
mine big data analysis: (DOE) program for process optimization or via system
• the sample, no matter how big, may not accurately identification (SI) experiments for process control.
reflect the actual target population or process Future research is needed to determine ways to use
• the population/process evolves in time (i.e., it is DOE in the context of big data to complement the infor-
nonstationary) and data collected over the years may not mation already available and increase the data’s value for
accurately reflect the current situation to which analytics predictive, control, and optimization activities. This will
are applied. likely require methods to selectively remove data with
These two cases and other well-known blunders show that very little informative value. The presence of such data
domain knowledge is, of course, needed to handle real prob- is not only unnecessary for developing models, but also
lems even when massive data are available. Industrial big data detrimental, as their presence induces a bias in the models
can benefit from past experiences, but challenges lie ahead. toward highly sampled regions of the operational space.
Like any new, promising field, big data must be viewed The modern theory of optimal DOE may provide a suitable
in terms of its capabilities as well as its limitations. Some of framework to begin addressing this challenge.
these limitations are merely challenges that can be addressed Information-poor data sets. Despite the sheer volume
— enabling companies to make the most out of new oppor- of industrial data, the relevant or interesting information
tunities created by data, technology, and analytics (Figure 1). may happen on only a few, dispersed occasions. Examples
This article outlines ten critical challenges regarding big include batches with abnormally excellent quality or runs
data in industrial contexts that need to be addressed, and that experience several types of process upsets.
suggests some emerging research paths related to them. The Current data mining and knowledge discovery tools
challenges are discussed in terms of the four Vs that define the (3, 4) can handle very large volumes of data that are rich
context of big data: volume, variety, veracity, and velocity. in information. Such tools include methodologies such as
partial least-squares regression, least-absolute-shrinkage
Volume challenges and selection operator (LASSO) regression, and ensemble
methods (e.g., random forests and gradient boosting),
Big data is, first of all, about han- among others. However, by design, those methods are not
dling massive amounts of data. suited to analyze information-poor data sets, in which the
However, in industrial processes, interesting information is rare and scattered. And, tradi-
the first thing to realize is that not tional data visualization tools — which are recommended
all data are created equal. Several for any data analysis activity, especially to identify poten-
challenges arise from this point. tially interesting outlying data points — may not always be
Meaningful data. Most industrial big data projects rely
on happenstance data, i.e., data passively collected from
processes operating under normal operating conditions most
of the time. Thus, a large amount of data is indeed available,
but those data span a relatively narrow range of operating
Data
conditions encountered during regular production situations.
Data sets collected under those circumstances may be
suitable for process monitoring and fault detection activi-
ties (2), which rely on a good description of the normal
operating conditions (NOC) as a reference to detect any Big Data
assignable or significant deviation from such behavior.
However, their value is limited for predictive activities, and
even more so for control and optimization tasks. Prediction Technology Analytics
can only be carried out under the same conditions found in
the data used to construct the models. As a corollary, only
when all the NOC correlations linking the input variables p Figure 1. The big data movement stems from the availability of data,
are respected can the model be used for prediction. high-power computer technology, and analytics to handle data character-
For process control and optimization activities, the ized by the four Vs — volume, variety, veracity, and velocity.
effective when applied to big data. For example, creating a of these heterogeneous sources of information found in
classical plot from big data might produce what looks like industrial processes, for instance, through the development
a black cloud of data points that is not useful. of more flexible multiblock methodologies. Such method-
An engineer who is not able to rely on visualization ologies incorporate the natural block-wise structure of data,
might be tempted to perform some sort of massive statisti- where each block may carry information about distinct
cal testing to pinpoint abnormal situations or to extract aspects of the problem and present a characteristic struc-
potentially interesting correlations, only to find a very large ture and dimensionality.
number of such situations (or correlations). That is a con- Heterogeneous data. Variety does not originate only
sequence of the extreme power of the tests, induced by the from the presence of different data structures to be handled
massive number of observations used. The significant events simultaneously. Another source of variety is the presence
detected may not (and most of the time will not) have any of data in the same data set that were collected when the
practical relevance because of their small impact. process underwent meaningful changes, including in its
The situation can be even worse when an engineer structure (e.g., new equipment was added, procedures were
cleans the data using an algorithm that automatically changed). By not taking such changes into account during
removes outlying observations from data sets prior to the analysis of the entire data set, you may fall into the trap
analysis. Such algorithms often incorporate standard rules of mixing apples with oranges — an issue that also raises
of an empirical nature that eliminate the data embedded concerns of data quality, which is discussed in the veracity
with the rare gems of information. section of this article. Overlooking heterogeneity in time is
Future research should focus on the development of detrimental for analytical tasks such as process monitoring
analytical methods applicable to information-poor data, and quality prediction.
including visualization tools that can condense large A future research path to address this challenge is
amounts of data while being sensitive to abnormal observa- developing methods to detect and handle these issues, as
tions, and sound ways of detecting outlying (but interest- well as to deal with the time-varying nature of processes,
ing) observations (and variable associations), namely by namely through evolutionary and adaptive schemes (6).
incorporating the available domain knowledge. Such schemes can adapt to complex and/or changing
conditions by continuously seeking the optimal operational
Variety challenges settings or by periodically retuning the models (through
re-estimation or recursive updating approaches).
Big data is also characterized by Multiple data-management systems. Data are also col-
its complexity. The complexity lected from a variety of sources across the company’s value
of industrial data can arise from chain, from raw materials, plant operations, and quality
different sources, and is usually laboratories, to the commercial marketplace. Each stage
related to the variety of objects usually has its own data-management system, and each
to be analyzed. Different chal- records data in a different way.
lenges arise depending on the origin of the complexity. Future efforts should be directed toward the develop-
Multiple data structures. In addition to the usual scalar ment of integrated platforms that link all of the different
quantities (temperature, pressure, and flow measurements), sources of data in the value chain. Market data, in par-
data collected in modern industrial settings also include ticular, have not been included in conventional models
other data structures arranged as higher-order tensors, such used in the chemical process industries (CPI). Data-driven
as one-way arrays (e.g., spectra, chromatograms, nuclear methods — which incorporate the time-delayed structure
magnetic resonance [NMR] spectra, particle-size distri- of the processes and use different types of data aggregation
bution curves), two-way arrays (e.g., data obtained from — should be developed to make this integration effective.
analytical techniques such as gas chromatography with A priori knowledge. Some knowledge about the main
mass spectrometry [GC-MS] and high-performance liquid sources of variety affecting a massive data set is usually
chromatography with diode array detection [HPLC-DAD]), available. However, making use of it in conventional indus-
and three-way and higher-order arrays (e.g., hyperspectral trial analytics is not straightforward. Big data methods tend to
images, color videos, hyphenated instruments). These data be of a black box type, lacking the flexibility to incorporate
structures are examples of profiles (5), abstractly defined as a priori knowledge about the processes under analysis.
any data array, indexed by time and/or space, that charac- Incorporating information about the structure of the pro-
terize a product or process. cesses in data-driven analysis is an important research path
Future research should focus on developing analyti- for the future, especially in the fields of fault diagnosis and
cal platforms that can effectively incorporate and fuse all predictive modeling (7–9). Fault diagnosis requires informa-
tion about the causal structure of the systems, which con- The sources of variability are actually the core of many
ventional data-driven monitoring methods cannot provide. improvement activities, in particular those aimed at reducing
Predictive modeling also requires this type of knowledge, in process variation and increasing product quality and consis-
particular for process control and optimization applications. tency. Big data cannot replace the need to understand how
Bayesian approaches (10, 11) and data transformation based data are acquired and the underlying mechanisms that gener-
on network inference, together with hybrid gray-box model- ate variability, and statistical engineering principles should
ing frameworks, are potential ways to introduce a priori be brought to the analysis of big data sets in the future (15).
knowledge into data-driven modeling.
Velocity challenges
Veracity challenges
In big data scenarios, large
A major concern in the analysis quantities of data are collected
of massive data sets has to do at high speed. This creates sev-
with the quality of data, i.e., eral challenges in the imple-
their veracity. As previously mentation of online collection
mentioned, quantity does not techniques and in defining the
imply quality. On the contrary, appropriate granularity to adopt for data analysis.
quantity creates more opportunities for problems to occur. Data with a high time resolution. The high speed at
To make matters worse, the detection of bad observations in which data are collected in modern chemical plants produces
massive data sets through visualization techniques is more information with very fine time granularity, i.e., the data have,
challenging and automatic-cleaning algorithms cannot be by default, a high time resolution (on the order of minutes, or
relied on either. Data quality also depends on the way the even seconds). This default is a conditioning factor for all the
data are collected (bias issues may emerge that are very subsequent stages of data analysis, as the usual practice is to
difficult to detect), on whether the information is updated avoid throwing out potentially valuable data. Consequently,
or no longer makes sense (due to time-varying changes in the analysis is prone to producing over-parameterized models.
the system), and on the signal-to-noise ratio (measurement It is important to select the most effective resolution
uncertainty), among other factors. (16) for your particular data analysis. A default resolution
Uncertainty data. In addition to the collected data, infor- selected by a third party with no knowledge of your specific
mation associated with uncertainty is also available. Measure- data will probably not be appropriate.
ment uncertainty is defined as a parameter associated with the Future research should develop sound ways for select-
result of a measurement that characterizes the dispersion of ing the proper resolution, including the possibility of using
the values that could reasonably be attributed to the quantity multiple time resolutions (17) that take into account the
to be measured (12). Combining uncertainty data with the raw variables’ dynamic and noise features.
measurements can improve data analysis, empirical modeling, Adaptive fault detection and diagnosis. The high speed
and subsequent decision-making (13, 14). of data collection provides the potential for fast detec-
Specification of measurement uncertainty in big data con- tion and diagnosis of faults, failures, and other abnormal
texts and developing methods that take advantage of know conditions. Many effective methods for fault detection and
ledge about uncertainty should be explored in more depth. identification of associated variables are available, includ-
Unstructured variability. Process improvement activi- ing techniques that account for dynamics (18–21).
ties require a careful assessment of the multiple sources A limitation of the standard data-based fault diagnosis
of variability of the process, which are typically modeled methods is that they rely on historical data that were col-
using suitable mathematical equations (ranging from first- lected, analyzed, and labeled during past abnormal conditions
principles models to purely data-driven approaches). The (22, 23). One way around this requirement is to incorporate
analysis should involve both the deterministic backbone of causal information from the process flowsheet (24).
the process behavior, as well as the unstructured aspects of Drawing on ideas from the machine learning commu-
the process arising from stochastic sources of variability, nity (25), a more effective solution could be to treat fault
including disturbances, sample randomness, measurement diagnosis as an online learning problem. Adaptive learn-
noise, operators’ variation, and machine drifting. Jumping ing methods could generate fault diagnosis systems that
into the analysis of massive data sets while overlooking the become increasingly effective over time, with the objective
main sources of unstructured variability is ill-advised, and of moving toward prognostics (i.e., the early prediction
is contrary to a reliable statistical engineering approach to of future operational problems) instead of learning about
addressing process improvement activities. abnormal conditions after a catastrophic incident.
Article continues on next page
Final thoughts As we have discussed, big data by itself will not answer
Big data creates new possibilities to drive operational all of your questions. Processes evolve over time, under quite
and business performance to higher levels. However, restrictive operating conditions, and data just reflect this real-
gaining access to such potential is far from trivial. New ity. We cannot expect data to tell us more than the information
strategies, processes, mindsets, and skills that are not yet in contained in the data. But big data and domain knowledge can
place are necessary. In addition, challenges emerge when be used synergistically to move forward and answer impor-
big data problems are considered in industrial contexts. tant questions, to design better experiments, or to determine
This article has summarized ten such challenges to be additional sensors needed to address those questions.
addressed in the future — to make this journey an insight- Big data offers new opportunities for managing our oper-
ful learning experience and a successful business opportu- ations, improving processes at all levels, and even adapting
nity for companies. We also believe the dominating ideas the companies’ business models. So the important question
and premises of big data need to evolve and mature. is: Can we afford not to enter the big data era? CEP
Literature Cited
1. Anderson, C., “The End of Theory: The Data Deluge Makes the els for the Chemical Processing Industry,” Industrial and Engineering
Scientific Method Obsolete,” Wired, www.wired.com/2008/06/ Chemistry Research, 54 (37), pp. 9159–9177 (Aug. 31, 2015).
pb-theory/ (June 23, 2008). 14. Reis, M. S., and P. M. Saraiva, “Integration of Data Uncertainty
2. Chiang, L. H., et al., “Fault Detection and Diagnosis in Industrial in Linear Regression and Process Optimization,” AIChE Journal,
Systems,” Springer-Verlag London (2001). 51 (11), pp. 3007–3019 (Nov. 2005).
3. Han, J., and M. Kamber, “Data Mining: Concepts and Tech- 15. Hoerl, R., and R. D. Snee, “Statistical Thinking: Improving
niques,” Morgan Kaufmann, San Francisco, CA (2001). Business Performance,” Duxbury Press, Pacific Grove, CA (2001).
4. Wang, X. Z., “Data Mining and Knowledge Discovery for Process 16. Reis, M. S., and P. M. Saraiva, “Generalized Multiresolution
Monitoring and Control,” Springer-Verlag London (1999). Decomposition Frameworks for the Analysis of Industrial Data
5. Reis, M. S., and P. M. Saraiva, “Prediction of Profiles in the Pro- with Uncertainty and Missing Values,” Industrial and Engineering
cess Industries,” Industrial and Engineering Chemistry Research, Chemistry Research, 45 (18), pp. 6330–6338 (Aug. 9, 2006).
51 (11), pp. 4524–4266 (Feb. 27, 2012). 17. Reis, M. S., and P. M. Saraiva, “Multiscale Statistical Process
6. Rato, T. J., et al., “A Systematic Comparison of PCA-Based Statistical Control with Multiresolution Data,” AIChE Journal, 52 (6),
Process Monitoring Methods for High-Dimensional, Time-Dependent pp. 2107–2119 (June 2006).
Processes,” AIChE Journal, 62 (1), pp. 127–142 (Jan. 2016). 18. Russell, E. L., et al., “Fault Detection in Industrial Processes Using
7. Reis, M. S., et al., “Challenges in the Specification and Integration Canonical Variate Analysis and Dynamic Principal Component
of Measurement Uncertainty in the Development of Data-Driven Analysis,” Chemometrics and Intelligent Laboratory Systems, 51,
Models for the Chemical Processing Industry,” Industrial and pp. 81–93 (2000).
Engineering Chemistry Research, 54 (37), pp. 9159–9177 19. Zhu, X., and R. D. Braatz, “Two-Dimensional Contribution Map for
(Aug. 31, 2015). Fault Detection,” IEEE Control Systems, 34 (5), pp. 72–77 (Oct. 2014).
8. Reis, M. S., and P. M. Saraiva, “Integration of Data Uncertainty 20. Jiang, B., et al., “Canonical Variate Analysis-Based Contributions
in Linear Regression and Process Optimization,” AIChE Journal, for Fault Identification,” Journal of Process Control, 26, pp. 17–25
51 (11), pp. 3007–3019 (Nov. 2005). (Feb. 2015).
9. Chiang, L. H., and R. D. Braatz, “Process Monitoring Using the 21. Jiang, B., et al., “Canonical Variate Analysis-Based Monitoring of
Causal Map and Multivariate Statistics: Fault Detection and Identi- Process Correlation Structure Using Causal Feature Representa-
fication,” Chemometrics and Intelligent Laboratory Systems, 65 (2), tion,” Journal of Process Control, 32, pp. 109–116 (Aug. 2015).
pp. 159–178 (Feb. 28, 2003). 22. Chiang, L. H., et al., “Fault Diagnosis in Chemical Processes Using
10. Bakshi, B. R., et al., “Multiscale Bayesian Rectification of Data Fisher Discriminant Analysis, Discriminant Partial Least Squares
from Linear Steady-State and Dynamic Systems without Accurate and Principal Component Analysis,” Chemometrics and Intelligent
Models,” Industrial and Engineering Chemistry Research, 40 (1), Laboratory Systems, 50, pp. 240–252 (2000).
pp. 261–274 (Dec. 6, 2000). 23. Jiang, B., et al., “A Combined Canonical Variate Analysis and
11. Yu, J., and M. M. Rashid, “A Novel Dynamic Bayesian Fisher Discriminant Analysis (CVA-FDA) Approach for Fault
Network-Based Networked Process Monitoring Approach for Diagnosis,” Computers and Chemical Engineering, 77, pp. 1–9
Fault Detection, Propagation, Identification, and Root Cause (June 9, 2015).
Diagnosis,” AIChE Journal, 59 (7), pp. 2348–2365 (July 2013). 24. Chiang, L. H., et al., “Diagnosis of Multiple and Unknown Faults
12. Joint Committee for Guides in Metrology, “Evaluation of Using the Causal Map and Multivariate Statistics,” Journal of
Measurement Data — Guide to the Expression of Uncertainty in Process Control, 28, pp. 27–39 (April 2015).
Measurement,” JCGM 100:2008, JCGM, Paris, France, p. 134 25. Severson, K., et al., “Perspectives on Process Monitoring of Indus-
(Sept. 2008). trial Systems,” in Proceedings of the 9th IFAC Symposium on Fault
13. Reis, M. S., et al., “Challenges in the Specification and Integration of Detection, Supervision, and Safety for Technical Processes, Paris,
Measurement Uncertainty in the Development of Data-Driven Mod- France (Sept. 2–4, 2015).

(Special Section) Big Data Analytics PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(Special Section) Big Data Analytics PDF

Uploaded by

Copyright:

Available Formats

Big Data Analytics

28 Meet the Authors

Spring Meeting and Discover How

M ore than ever companies in the process industries are

upgraded to a dual-CPU card, but only within the confines

(Figure 5) includes measurements of 16 process variables, Control of batch processes

Conditions With Control

Deviation from Target

You might also like