You are on page 1of 4

Fusion Engineering and Design 129 (2018) 330–333

Contents lists available at ScienceDirect

Fusion Engineering and Design


journal homepage: www.elsevier.com/locate/fusengdes

Conceptual design of new data integration and process system for KSTAR T
data scheduling

Taehyun Tak , Jaesic Hong, Kaprai Park, Woongryul Lee, Taegu Lee, Hyunsun Han, Giil Kwon,
Jinseop Park
National Fusion Research Institute (NFRI), Daejeon, Republic of Korea

A R T I C L E I N F O A B S T R A C T

Keywords: The KSTAR control and data acquisition systems mainly use data storage layer of MDSPlus for diagnostic data
KSTAR and channel archiver for EPICS-based control system data. In addition to these storage systems, KSTAR has
Fusion various types of data such as user logs from Relational Database (RDB) and various types of logs from the control
Dataflow system. A large scientific machine like KSTAR is needed to implement various types of use cases for scheduling
Automation
data and data analysis. The goal of a new data integration and process system is to design the KSTAR data
Many-task
scheduling on top of the Pulse Automation and Scheduling System (PASS) according to KSTAR events. The
Big Data
KSTAR Data Integration System (KDIS) is designed by using Big Data software infrastructures and frameworks.
The KDIS handles events that are synchronized with the KSTAR EPICS events and other data sources such as the
rest API and logs for integrating and processing data from different data sources and for visualizing data. In this
paper, we explain the detailed design concept of KDIS and demonstrate a data scheduling use case with this
system.

1. Introduction As the first step of our challenge for the next generation of data
processing environment in KSTAR, we considered a new data orches-
Systemically, a large scientific machine such as KSTAR [1] is im- tration scheme and prototyped a new data integration and process
plemented with a complex system design including heterogeneous system named the KSTAR Data Integration System (KDIS). This system
hardware and software platforms. In other words, integrating the con- supports a wide range of KSTAR pulse operation automation and ex-
trol and data systems is one of our major missions for the KSTAR op- perimental result process with offline data processing (for non-real-time
eration and experiment. In order to integrate the systems, KSTAR or soft real-time works). Therefore, the focus is on building a new
adopted EPICS [2] as its main integrated control framework and a frameworks that uses Big Data processing methods rather than the le-
channel archiver for its storage. For diagnostic systems, we developed a gacy process such as single process application with storage. The KDIS
standard framework [3] for the sequential archiving operation of di- is an operation-related data-cycle ecosystem synchronized with KSTAR
gitized data with MDSPlus [4]. In addition, various type of file formats, events (data), with a distributed computing system including general-
file systems, and relational database mechanisms were used for its own purpose Big Data open-source frameworks and various libraries. In this
purpose according to each type of data. The generated data from con- paper, we will describe the development and demo results of the KDIS.
trol and diagnostic system could be stored, analyzed, or refined in the
various use cases. Since the first operation of KSTAR in 2008, its control 2. KSTAR data integration system
and data storage systems have been stabilized and matured to perform
experiments for the KSTAR mission. The main purpose of the KDIS is to establish a data orchestration
On the other hand, the amount and type of KSTAR data has been system with a dataflow that covers the storage, processing, and servi-
increasing every year as the system is upgraded. Therefore, the com- cing of generated and processed data from KSTAR.
plexity of data and its use cases are also increasing. Finally, a need has For various KSTAR data-intensive applications and data scheduling
arisen for a specific system that can efficiently develop complex data use cases, an appropriate architecture design is necessary and our top
use cases such as large amounts of dataflow processing, various algo- priority. The Lambda architecture design paradigm [5] covers com-
rithm processing, and visualization. puting arbitrary functions on arbitrary data use cases by dividing the


Corresponding author.
E-mail address: thtak@nfri.re.kr (T. Tak).

https://doi.org/10.1016/j.fusengdes.2018.01.015
Received 22 June 2017; Received in revised form 22 December 2017; Accepted 3 January 2018
Available online 17 January 2018
0920-3796/ © 2018 Elsevier B.V. All rights reserved.
T. Tak et al. Fusion Engineering and Design 129 (2018) 330–333

Fig. 1. The framework design of KDIS and its interconnection among KSTAR systems.

Table 1 of jobs. This scheduler interfaces with EPICS data stream for event
The KDIS System Hardware Configuration – Clustered Servers (4EA). processing according to scheduling algorithms. It launches logics on
tasks in the many-task framework depending on the condition. Pro-
Component Specification
cessed data will be provided via a data service framework with visua-
CPU Intel (R) Xeon® CPU E5-2640 v4 2.4 GHz * 2 lization method.
RAM 128 GB RAM For implementing functionalities of framework architecture, the
Storage 500 GB SSD * 2
KDIS is configured with Big Data open-source solutions to each fra-
meworks. Tables 1 and 2 lists the major hardware and software com-
ponents of the KDIS. The KDIS manages resources of clustered systems
Table 2
The KDIS System Software Configuration. using a Hadoop Yarn scheduler [6]. The scheduler manage system re-
source with all applications running on the KDIS system. Each open-
Component Specification source solution has convenient mechanisms and features such as
managing and monitoring applications, supporting in-memory com-
OS CentOS 7.2
Resource Scheduler Hadoop Yarn puting, supporting convenient development, and supporting interfaces
Stream Processing F/W Spring Cloud Dataflow (SCDF) with a well-structured architecture. The KDIS takes advantages from
Many-Task F/W Apache Spark well-developed Big Data open-source solutions. Detailed introductions
Web Application Server Apache Tomcat (Spring MVC + Hibernate) of frameworks are described through this chapter.
Data Storages HDFS, Geode, HBase, PostgreSQL
Library Environment Many types of open-source or internally developed
libraries are used by types of applications 2.1. Stream processing framework

The stream processing framework is mainly composed of a structure


data related mechanisms into three layers: the batch layer, the serving
to deal with both of polling and of pushing stream processing appli-
layer, and the speed layer. The KDIS is inspired by the Lambda archi-
cations by publishing, subscribing, and storing data. The stream pro-
tecture design pattern on the point of handling arbitrary data use cases
cessing applications are developed under the open-source framework
dividing mechanisms into the layers, and redefines the role of each
Spring Cloud dataflow [7], which is a toolkit for building data in-
layer according to the KSTAR dataflow.
tegration and real-time data processing pipelines. Applications are de-
To simplify the functionalities, the KDIS is defined as three frame-
veloped as micro-services and connected to each other with pipelines
works by architectural design: stream processing framework for appli-
which framework support. The micro-service applications could be
cations implementing stream processing algorithms such as event pro-
developed with the role of source, process, or sink. This framework
cessing and reactive programming, many-task processing framework
allows applications to support a various data-processing use cases, from
for tasks implementing data analysis logics with batch data processing,
import and export to event streaming and predictive analytics. This
and service framework supporting data view to users. The designed
open-source framework supports the KDIS runtime environment of
architecture and interconnection between KSTAR data and the KDIS is
stream applications by powerful usability such as supporting stream
shown in Fig. 1. KSTAR data interfaces are connected to the KDIS via a
design flow diagram GUI, deployment on Hadoop, program code
network with various interface libraries such as EPICS client, MDSplus
minimization by using abstracted network program APIs, and more.
client, memory DB, RDB, message interface and so on. Implemented
On this conceptual system, launching tasks with a task-scheduling
applications on the stream processing framework could communicate
algorithm is mainly developed for batch atomization to analyze data by
the KSTAR system such as pulse automation system, data storages, and
KSTAR sequence. This function is accomplished in connection with a
stream by polling and pushing mechanism. On the conceptual system,
KSTAR data source such as EPICS process variable (PV). Through this
we developed the task scheduler as one of main features for deployment
functionality of stream application, KSTAR data will be scheduled by

331
T. Tak et al. Fusion Engineering and Design 129 (2018) 330–333

Fig. 2. Implemented applications on the KDIS frameworks.

Fig. 3. Visualization of pulse-based processed data by tasks on KDIS Web service.

KSTAR operation events. cluster-based big data solution that saves such effort, data task can be
created and deployed efficiently in terms of productivity.
Apache Spark was designed to support fast iterative data analyses of
2.2. Many-task framework
large datasets with many cutting-edge libraries such as machine
learning and a variety of analytical methods. It supports parallel tasks,
Many-task computing [8] is a computational paradigm, which aims
tracking computational lineage, and optimization of data with Direct
to emphasize using many computing resources over short periods of
Access Graphs (DAGs).
time to accomplish many computational tasks. We adopted this concept
Therefore, the KDIS is composed of Apache Spark [9] as many-task
in our framework to handle many data tasks.
framework with all necessary analysis and data interface libraries such
Normally, the data generated by the KSTAR system runs on a spe-
as EPICS, MDSPlus, HDF5, RDB interface, and various types of analysis
cific system depending on the user’s requirements. Scientists and en-
libraries. This framework handles a set of batch-style applications and
gineers of KSTAR have used their own dedicated workflows of data
provides advantages like simplifying hardware resource, supplying data
methodology by running single applications or running applications
related method, and supporting faster calculation for many-task appli-
with parallel frameworks. In order to develop data applications, de-
cations to build new analysis functionalities. Thanks to these ad-
veloper should consider full application composition such as automa-
vantages, the KDIS many-task framework can enable scientists to take
tion interface layer, library layer, and software framework. Considering

332
T. Tak et al. Fusion Engineering and Design 129 (2018) 330–333

advantage of benefits arising from the innovation of Big Data analytics. • T2 – Fault log collection: Analyzing Collected interlock system logs
On the conceptual design, each task is launched by a specific KSTAR and finding root causes.
event scheduled with task-schedule algorithm. Launched tasks process • T3, T4 – Operation parameter and timing parameter snapshot:
and move data to storage according to their own purpose. The demo Collecting operation parameter for each pulse. Stream snapshot
tasks are introduced in chapter 3. from the EPICS PV.
• T5 – Timing system automation: Setting prearranged template
2.3. Data service framework and interface timing setting to timing system in KSTAR.
• T6 – Startup image visualization: Extracting specific time window of
We defined functionality of data service framework as supporting camera image during the plasma operation (for startup debugging).
data access method to the user. On the conceptual design, the frame-
works is composed of two layer: a data-storing scheme and a data Web 3.3. Visualization on web service
service.
The RDB schema was developed for storing well-defined data and For an integrated view of each pulse experiment of KSTAR, we de-
managing Web service. NoSQL on the HDFS [10] and memory DB were veloped the KDIS data visualization service via a Web. It shows the
configured for uncountable logs. results processed by each task application. It supports functionalities
In order to support the visualization method, we choose a Web such as searching pulse by parameters and displaying various results as
service for common user access. The KDIS Web service supports in- shown in Fig. 3.
tegrated visualization interface with processed data via the Web com-
ponents. The Web service is composed of Apache Tomcat [11], Spring 4. Conclusion
frameworks, persistent layer and libraries for the view of grid of tables,
plots, and charts. The system was designed as a data integration system for KSTAR,
which performs scheduled processes on stream and batch data ac-
3. Implementation and result cording to KSTAR events with a user interface service, including
hardware and software infrastructures, applications, and libraries. The
The KDIS was developed as a concept design to evaluate its potential conceptual design and development of the data integration system was
usability by applying a Big Data ecosystem for KSTAR data and its use completed and commissioned in the 10th campaign of KSTAR in 2017
case. Therefore, we developed demo applications applying stream, task, under the name of KDIS. Through the prototyped system, we secured an
and its visualization frameworks. We will introduce demo applications efficient data orchestration environment, which is suitable for KSTAR
for data orchestration with the KSTAR operation. As a conceptual dataflow. Through demo applications with frameworks, we proved the
system, we targeted automated offline data processes based on sched- possibility of increasing the efficiency and convenience of data pro-
uled batch application by the operational event. Through this work, we cessing by applying Big-Data-related technologies to KSTAR.
determined that effective use cases could be developed with the KDIS. The scope of this paper includes the design and its demonstration.
Fig. 2 shows the developed dataflow demo applications with frame- The development of KDIS is one-step in building the infrastructural
works. Detailed functionalities are introduced via subchapters. levels for now. The development of various algorithms, use cases,
performance measurements, visualization methods, and intelligent data
3.1. Stream processing applications mining studies are put forward as future work.

On Fig. 2, two types of dataflow applications with pipeline were Acknowledgments


developed. Each module of the pipeline is developed as micro-service.
The developed application is as follows: This work was supported by the Korean Ministry of Science, ICT
under the KSTAR project. We would also like to thank all members of
• S1 – KSTAR Pulse Automation System Event with task scheduling the KSTAR control research team.
logics: In order to schedule data tasks, we developed task-scheduler.
It manages EPICS PV events, and executes data tasks in the many- References
task framework. The pipeline S1 was composed of two process:
○ A PASS [12] Event synchronized process, which interfaces and [1] G.S. Lee, et al., Design and construction of the KSTAR tokamak, Nucl. Fusion 41
process with EPICS PV data. (2001).
[2] Experimental Physics and Industrial Control System. (n.d.). Retrieved from http://
○ A task launching process by scheduling algorithm, which inter- www.aps.anl.gov/epics [Accessed 10 May, 2017].
faces with the deployment of scheduled spark task. [3] Woongryol Lee, Mikyung Park, Taegu Lee, Sangil Lee, Sangwon Yun, Jinseop Park,
• S2 – Central Control System Log with memory database archiving: et al., Design and implementation of a standard framework for KSTAR control
system, Fusion Eng. Des. 84 (2009) 867–874.
The Pipeline S2 is used to test the archive central control system log [4] MdsPlus. (n.d.). Retrieved from http://www.mdsplus.org/ [Accessed 10 May,
for real time viewer. It store filtered data to memory database. 2017].
[5] Bijnens, M. H. (n.d.). Lambda Architecture. Retrieved from http://lambda-
architecture.net/ [Accessed 5 Apr, 2016].
3.2. Task applications [6] Vinod Kumar Vavilapalli, et al., Apache hadoop yarn: yet another resource nego-
tiator, Proceedings of the 4th Annual Symposium on Cloud Computing, ACM, 2013,
The tasks scheduled from stream processing S1 are synchronized to p. 5.
[7] Spring Cloud Dataflow. (n.d.). Retrieved from http://cloud.spring.io/spring-cloud-
a specific KSTAR system event such as a certain shot stage operated by
dataflow [Accessed 10 May, 2017].
PASS or synchronized to the completion of the digitizer’s archiving [8] Raicu Ioan, Many-Task Computing: Bridging the Gap Between High-throughput
operation. Fig. 3 shows the result of processed data through a Web Computing and High-performance Computing, The University of Chicago, 2009,
service. 2018.
[9] Matei Zaharia, et al., Spark: cluster computing with working sets, HotCloud (2010)
The developed application is as follows: (10.10-10: 95).
[10] Dhruba Borthakur, et al., HDFS architecture guide, Hadoop Apache Project

• T1 – Shot Summary: Processing data by each shot in order to make (2008) 53.
[11] Project, A. T. (n.d.). Apache Tomcat®. Retrieved from http://tomcat.apache.org/
summary of pulse via retrieving data from MDSPlus and channel [Accessed 10 May, 2017].
archiver. Archiving summary data as table on the database and plot [12] Woongryol Lee, et al., Conceptual design and implementation of pulse automation
image on the file system. and scheduling system for KSTAR, Fusion Eng. Des. 96 (2015) 830–834.

333

You might also like