You are on page 1of 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/315472746

Big Data for Industry 4.0: A Conceptual Framework

Conference Paper · December 2016


DOI: 10.1109/CSCI.2016.0088

CITATIONS READS
9 4,299

5 authors, including:

Mert Onuralp Gökalp Kerem Kayabay


Middle East Technical University Middle East Technical University
15 PUBLICATIONS   19 CITATIONS    7 PUBLICATIONS   13 CITATIONS   

SEE PROFILE SEE PROFILE

Mehmet Ali Akyol Altan Koçyiğit


Middle East Technical University Middle East Technical University
5 PUBLICATIONS   13 CITATIONS    41 PUBLICATIONS   161 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Manufacturing Stream View project

All content following this page was uploaded by Mert Onuralp Gökalp on 08 October 2017.

The user has requested enhancement of the downloaded file.


2016 International Conference on Computational Science and Computational Intelligence

BIG DATA FOR INDUSTRY 4.0:


A CONCEPTUAL FRAMEWORK
Mert Onuralp Gökalp, Kerem Kayabay, Mehmet Ali Akyol, P. Erhan Eren, Altan Koçyi÷it
Informatics Institute, Middle East Technical University, Ankara, Turkey
E-mail:{ gmert@metu.edu.tr, kayabay@metu.edu.tr, aliakyol@metu.edu.tr, ereren@metu.edu.tr,
kocyigit@metu.edu.tr }
Abstract— Exponential growth in data volume originating continuously comes up with with new models which can use
from Internet of Things sources and information services distributed architectures to process data more quickly and
drives the industry to develop new models and distributed efficiently. However, available analysis methods are
tools to handle big data. In order to achieve strategic
insufficient to use high speed data flowing from various
advantages, effective use of these tools and integrating results
to their business processes are critical for enterprises. While sources due to their lower level complexities and
there is an abundance of tools available in the market, they are shortcomings [1].
underutilized by organizations due to their complexities. The utilization and installation of existing big data analytics
Deployment and usage of big data analysis tools require
platforms require significant expertise and know-how in
technical expertise which most of the organizations don’t yet
possess. Recently, the trend in the IT industry is towards data science and IT domain because of their complex
developing prebuilt libraries and dataflow based programming infrastructures and programming models. This may hinder
models to abstract users from low-level complexities of these the adoption of big data technologies in the Industry 4.0
tools. After briefly analyzing trends in the literature and domain. Hence, a programming model for big data
industry, this paper presents a conceptual framework which platforms which provides higher level abstractions is
offers a higher level of abstraction to increase adoption of big necessary from the perspective of widespread user adoption.
data techniques as part of Industry 4.0 vision in future
enterprises. Latest trends in the big data domain is moving towards
providing a level of abstraction to utilize popular data
Keywords: Industry 4.0; big data; data flow based processing platforms [2]. Apache Beam [3] implements its
programming languages; machine learning; data mining. dataflow programming model on multiple runners like
Apache Spark [4] and Apache Flink [5]. Apache SAMOA
I. INTRODUCTION
[6] enables programmers to apply machine learning
Businesses need to process data into timely and valuable algorithms on data streams. Applications developed with
information for their decision making and process SAMOA can be executed on Apache Storm [7] [8], Apache
optimization activities. Today’s competitive business S4 [9], and Apache Samza [10]. In this paper, after briefly
environment forces enterprises to process high speed data discussing trends in the data science literature and industry,
and integrate valuable information in production processes. a visual and dataflow based architectural framework is
For example, the concept of Industry 4.0 is expected to proposed to abstract programmers away from complexities
change production in the near future. In this concept, of underlying data processing platforms. This can enable
machines in a smart manufacturing plant interact with their enterprises to easily incorporate data mining and machine
environments. Ordinary machines transform into context- learning techniques into their business process monitoring
aware, conscious, and self-learner devices. This and improvement activities.
transformation gives these devices the capability to process
real-time data to self-diagnose and prevent potential II. RELATED WORKS
disruptions in production process. Furthermore, when such a The data flow based programming model, which aims to
machine is assigned a task, it can self-calibrate and facilitate the development and orchestration of services, is a
prioritize between tasks to optimize production quality or commonly used approach in data analysis frameworks.
efficiency. Especially, IoT scenarios require the coordination of
As approaches like Industry 4.0 gain popularity, the computing resources across the network. Traditional
characteristics of data to be analyzed change. Some programming tools are typically more complex when
processes require high speed data whose value diminishes compared to visual and data flow based programming tools,
over time. Heterogeneous IoT devices and sensors produce as they require developers to learn new protocols and APIs,
unstandardized and unstructured data. IT industry create data processing components, and link them together
[11]. In the data flow based programming model,

978-1-5090-5510-4/16 $31.00 © 2016 IEEE 429


431
DOI 10.1109/CSCI.2016.87
applications are modelled as directed graphs of ‘black box’ tools, MOA and ADAMS are focused on data mining and
nodes that exchange data along connected arcs. Hence, machine learning algorithms that can be applied on stream
developers do not need to know internal details of building data. However, none of these systems support distributed
blocks comprising the application. execution environments to handle big data effectively.
Several data flow based programming models have been Mahout [21] is a data mining and machine learning library
proposed in the literature. WoTKit and Node-Red, which for Hadoop. The Mahout library provides clustering,
are two notable frameworks of this kind, allow users to regression, classification and model analysis algorithms.
model their applications via browser based visual editors. Google, Amazon, Yahoo and Facebook utilize Mahout in
After receiving real time updates from sensors and other their data mining and machine learning applications. Since
external sources, WoTKit processor [11] can be used to Mahout is designed for batch processing applications, it
process input data and respond to changes in the does not support real time stream processing. Moreover, it
environment. Node-Red [12] is another important service doesn’t provide a visual programing model.
that is implemented on Node.js where users can implement
their applications in JavaScript. In both Node-Red and III. CONCEPTUAL FRAMEWORK
WoTKit application development environments, complex In order to exploit the potential of big data technologies as
applications can be modelled as directed graphs by dragging part of Industry 4.0, challenges which hinder the adoption of
and dropping programming blocks into a flow canvas. such technologies should be tackled first. These challenges
ClickScript [13] is yet another data flow based include handling large amount of unstructured data coming
programming service for modelling home automation from IoT devices, expertise barriers, resource management,
applications visually on a graphical user interface. The and delivery of results to appropriate channels. Hence, a
visual and data flow based programming model enables framework which can facilitate the development and
users to receive data from external sources, including social deployment of big data analytics is necessary.
media and IoT devices, and forward data to external systems
such as Twitter and e-mail. In this section, we explain our conceptual framework
architecture which enables system engineers to model,
The data analysis term refers to utilization of business develop and deploy their own big data use cases for Industry
intelligence and analytics technologies. This corresponds to 4.0 applications, even when they have limited or no
applying statistical and data mining techniques in experience in big data analytics. We describe functionalities
organizations to produce additional business value [14]. of major modules and how these modules can be integrated
There are various open source and commercial tools for in a cloud environment. The system architecture is
machine learning and data mining applications which have delineated in Figure 1. The proposed framework
been developed to execute popular algorithms such as architecture in this paper is an extended version of the
classification, clustering, and anomaly detection. Some of architecture proposed in our previous study [22] in which
these applications support distributed processing across queries are defined as Groovy scripts. Accordingly, the
computing nodes to handle big data use cases. There are framework utilizes a data flow based visual programming
also tools which provide a visual programming model to model to facilitate flexible application development.
users who do not have any know-how in data analytics.
Orange [15] is a notable visual data mining and machine
learning tool. In Orange platform, each visual component
represents a data analytics algorithm and these components
communicate with each other through data channels.
KNIME [16] , KEPLER [17] and RapidMiner [18] are other
open source data mining tools which support visual
programming environments where data mining algorithms
are provided as readily available programming elements.
Each operator has data input and output ports which can be
dragged and dropped to link elements and linked elements
form a directed graph. These tools are designed for batch
data processing. They are hardly scalable because their
performance is limited by the server’s processing
capabilities. Figure 1. Architectural Conceptual Framework

MOA [19] and ADAMS [20] are similar to KNIME, The architecture of the proposed conceptual framework
KEPLER and RapidMiner in regards to application design consists of the following modules: Big data application
and execution. While the latter are batch processing oriented

432
430
design, pre-processing input data streams, distributed weeks/months. There is no “one-size-fits-all” big data
infrastructure, and distribution of results. solution. Instead, each big data platform has its own
advantages and disadvantages. Therefore, the proposed
Big Data Application Design module allows system
framework is aimed to support multiple big data platforms
engineers to develop their own big data applications with a
such as Storm, Spark and Flink. Hence, according to
visual editor. Applications are represented as directed
specific characteristics of an application under design, one
graphs where vertices represent data mining and machine
of the supported platforms can be chosen. Moreover, by
learning algorithms as well as programming constructs,
considering the designed application logic and use cases, the
while edges represent data streams which correspond to
framework itself can offer a suitable big data platform to run
intermediate results as shown in Figure 2. The programming
the application.
nodes take and produce data in a common standard to
handle data from various sources and to be integrated with The Results of the applications may be forwarded to
other programming nodes. Thus, the application logic can interested parties in different forms. Each distribution
be built by just connecting the programming nodes without channel is defined as a programming node in the visual
worrying about their internal details and interfaces. editor. Thus, users may select more than one distribution
channel to deliver the results. In this way, certain problems
in the production may be forwarded to right staff as
notifications. The results can also be used as inputs to
actuators and, hence, manufacturing processes can be
controlled and even improved. It is also possible to deliver
the results to external entities via web services for data
visualization or monitoring purposes.
IV. CONCLUSION
There is an abundance of tools and application frameworks
for processing big data, yet new tools continue to emerge
especially for stream data. These tools are commonly open
sourced after being developed by Internet based companies
including Google, Twitter, Linkedin, and Yahoo according
to their business requirements. Low level complexities of
Figure 2. Programming Model data processing platforms make them suitable for
programmers who have the knowledge and experience on
In this setting, a large number of Data Sources should be data science. On the other hand, people who have expertise
integrated to the platform to collect information regarding and deep knowledge in a specific domain only may not be
different aspects of a factory. Due to their heterogeneous able to use these tools. As a result, real time data coming
nature, these data sources may generate data in disparate from various sources cannot be integrated to business
formats. Therefore, data variety is an important challenge processes in an enterprise.
that can hinder the adoption of big data analytics in Industry
Specialized for the big data domain, data flow based visual
4.0 domain. Hence, the Preprocessing Input Data Streams
programming models can solve this problem by allowing
module plays a central role in our framework to convert data
programmers to iteratively develop new techniques which
into a common format for further processing. This is based
can utilize real time data. In organizations, people can
on data standardization to define a common standard for
quickly design and develop small programs to investigate
receiving structured, semi-structured and unstructured data
whether there are efficiency or quality issues in production
from various number of resources.
and service processes. We see this approach as an important
Deployed applications need fast and scalable infrastructures step towards the Industry 4.0 vision.
to handle big data use cases effectively. Therefore, big data
In this paper, we propose a conceptual framework which
platforms are established on a Distributed Infrastructure.
can be utilized in a smart enterprise. Its main components
User defined applications are deployed automatically on the
are designed to abstract users away from low level
distributed infrastructure to handle unique characteristics of
complexities like data standardization, platform specific
the big data. On the other hand, the requirements of big data
development, resource management, protocols, and APIs.
applications vary according to use cases. For instance, a
The framework handles collection of data from IoT and
monitoring application needs to process stream data and
Web based data sources, implementation of big data
produce results in a real-time manner. However, a predictive
analytics applications containing machine learning and data
analytics application needs to deal with bulk data to detect
mining components, translation of visually designed
potential risks about the production in upcoming

433
431
programs to platform specific ones, management of jobs [22] M. O. Gokalp, A. Kocyigit, and P. E. Eren, “A Cloud Based
among processing units, and delivery of results to people Architecture for Distributed Real Time Processing of Continuous
Queries,” in Proceedings - 41st Euromicro Conference on
and services. From this perspective, the framework Software Engineering and Advanced Applications, SEAA 2015,
facilitates the integration of big data analytics with business 2015, pp. 459–462.
processes by providing an end to end approach.
REFERENCES
[1] J. Lee, H. A. Kao, and S. Yang, “Service innovation and smart
analytics for Industry 4.0 and big data environment,” in Procedia
CIRP, 2014, vol. 16, pp. 3–8.
[2] K. Kayabay, M. O. Gökalp, M. A. Akyol, A. Koçyi÷it, and P. E.
Eren, “Big Data for Future Enterprises: Current State and
Trends,” in 3rd International Management Information Systems
Conference, øzmir, 2016, pp. 298–307.
[3] “Apache Beam.” [Online]. Available:
http://beam.incubator.apache.org. [Accessed: 10-Nov-2016].
[4] “Apache Spark.” [Online]. Available: http://spark.apache.org.
[Accessed: 28-Oct-2016].
[5] “Apache Flink.” [Online]. Available: http://flink.apache.org.
[Accessed: 28-Oct-2016].
[6] “Apache Samoa.” [Online]. Available:
https://samoa.incubator.apache.org. [Accessed: 10-Nov-2016].
[7] “Apache Storm.” [Online]. Available: http://storm.apache.org.
[Accessed: 28-Oct-2016].
[8] A. Toshniwal et al., “Storm@twitter,” Proc. 2014 ACM
SIGMOD Int. Conf. Manag. data - SIGMOD ’14, pp. 147–156,
2014.
[9] “Apache S4.” [Online]. Available:
http://incubator.apache.org/s4/. [Accessed: 10-Nov-2016].
[10] “Apache Samza.” [Online]. Available: http://samza.apache.org.
[Accessed: 10-Nov-2016].
[11] M. Blackstock and R. Lea, “WoTKit,” in Proceedings of the
Third International Workshop on the Web of Things - WOT ’12,
2012, pp. 1–6.
[12] “Node-Red.” [Online]. Available: https://nodered.org. [Accessed:
12-Nov-2016].
[13] S. Mayer, N. Inhelder, R. Verborgh, and R. Van De Wallet,
“User-friendly configuration of smart environments,” in 2014
IEEE International Conference on Pervasive Computing and
Communication Workshops, PERCOM WORKSHOPS 2014,
2014, pp. 163–165.
[14] H. Chen, R. H. L. Chiang, and V. C. Storey, Business
Intelligence and Analytics: From Big Data to Big Impact, vol.
36, no. 4. 2012.
[15] J. Demšar, B. Zupan, G. Leban, and T. Curk, “Orange: From
Experimental Machine Learning to Interactive Data Mining,”
Knowl. Discov. Databases PKDD 2004, pp. 537–539, 2004.
[16] M. R. Berthold et al., “KNIME-the Konstanz information miner:
version 2.0 and beyond,” AcM SIGKDD Explor. Newsl., vol. 11,
no. 1, pp. 26–31, 2009.
[17] I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludascher, and S.
Mock, “Kepler: an extensible system for design and execution of
scientific workflows,” Sci. Stat. Database Manag. 2004.
Proceedings. 16th Int. Conf., vol. I, pp. 423–424, 2004.
[18] I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler,
“YALE: Rapid prototyping for complex data mining tasks,”
Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., vol.
2006, pp. 935–940, 2006.
[19] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “MOA
Massive Online Analysis,” J. Mach. Learn. Res., vol. 11, pp.
1601–1604, 2011.
[20] P. Reutemann and J. Vanschoren, “Scientific Workflow
Management with ADAMS,” Knowl. Discov. Databases, pp.
833–837, 2012.
[21] “Apache Mahout.” [Online]. Available:
https://mahout.apache.org. [Accessed: 12-Nov-2016].

434
432

View publication stats

You might also like