Professional Documents
Culture Documents
Conceptual design of new data integration and process system for KSTAR T
data scheduling
⁎
Taehyun Tak , Jaesic Hong, Kaprai Park, Woongryul Lee, Taegu Lee, Hyunsun Han, Giil Kwon,
Jinseop Park
National Fusion Research Institute (NFRI), Daejeon, Republic of Korea
A R T I C L E I N F O A B S T R A C T
Keywords: The KSTAR control and data acquisition systems mainly use data storage layer of MDSPlus for diagnostic data
KSTAR and channel archiver for EPICS-based control system data. In addition to these storage systems, KSTAR has
Fusion various types of data such as user logs from Relational Database (RDB) and various types of logs from the control
Dataflow system. A large scientific machine like KSTAR is needed to implement various types of use cases for scheduling
Automation
data and data analysis. The goal of a new data integration and process system is to design the KSTAR data
Many-task
scheduling on top of the Pulse Automation and Scheduling System (PASS) according to KSTAR events. The
Big Data
KSTAR Data Integration System (KDIS) is designed by using Big Data software infrastructures and frameworks.
The KDIS handles events that are synchronized with the KSTAR EPICS events and other data sources such as the
rest API and logs for integrating and processing data from different data sources and for visualizing data. In this
paper, we explain the detailed design concept of KDIS and demonstrate a data scheduling use case with this
system.
1. Introduction As the first step of our challenge for the next generation of data
processing environment in KSTAR, we considered a new data orches-
Systemically, a large scientific machine such as KSTAR [1] is im- tration scheme and prototyped a new data integration and process
plemented with a complex system design including heterogeneous system named the KSTAR Data Integration System (KDIS). This system
hardware and software platforms. In other words, integrating the con- supports a wide range of KSTAR pulse operation automation and ex-
trol and data systems is one of our major missions for the KSTAR op- perimental result process with offline data processing (for non-real-time
eration and experiment. In order to integrate the systems, KSTAR or soft real-time works). Therefore, the focus is on building a new
adopted EPICS [2] as its main integrated control framework and a frameworks that uses Big Data processing methods rather than the le-
channel archiver for its storage. For diagnostic systems, we developed a gacy process such as single process application with storage. The KDIS
standard framework [3] for the sequential archiving operation of di- is an operation-related data-cycle ecosystem synchronized with KSTAR
gitized data with MDSPlus [4]. In addition, various type of file formats, events (data), with a distributed computing system including general-
file systems, and relational database mechanisms were used for its own purpose Big Data open-source frameworks and various libraries. In this
purpose according to each type of data. The generated data from con- paper, we will describe the development and demo results of the KDIS.
trol and diagnostic system could be stored, analyzed, or refined in the
various use cases. Since the first operation of KSTAR in 2008, its control 2. KSTAR data integration system
and data storage systems have been stabilized and matured to perform
experiments for the KSTAR mission. The main purpose of the KDIS is to establish a data orchestration
On the other hand, the amount and type of KSTAR data has been system with a dataflow that covers the storage, processing, and servi-
increasing every year as the system is upgraded. Therefore, the com- cing of generated and processed data from KSTAR.
plexity of data and its use cases are also increasing. Finally, a need has For various KSTAR data-intensive applications and data scheduling
arisen for a specific system that can efficiently develop complex data use cases, an appropriate architecture design is necessary and our top
use cases such as large amounts of dataflow processing, various algo- priority. The Lambda architecture design paradigm [5] covers com-
rithm processing, and visualization. puting arbitrary functions on arbitrary data use cases by dividing the
⁎
Corresponding author.
E-mail address: thtak@nfri.re.kr (T. Tak).
https://doi.org/10.1016/j.fusengdes.2018.01.015
Received 22 June 2017; Received in revised form 22 December 2017; Accepted 3 January 2018
Available online 17 January 2018
0920-3796/ © 2018 Elsevier B.V. All rights reserved.
T. Tak et al. Fusion Engineering and Design 129 (2018) 330–333
Fig. 1. The framework design of KDIS and its interconnection among KSTAR systems.
Table 1 of jobs. This scheduler interfaces with EPICS data stream for event
The KDIS System Hardware Configuration – Clustered Servers (4EA). processing according to scheduling algorithms. It launches logics on
tasks in the many-task framework depending on the condition. Pro-
Component Specification
cessed data will be provided via a data service framework with visua-
CPU Intel (R) Xeon® CPU E5-2640 v4 2.4 GHz * 2 lization method.
RAM 128 GB RAM For implementing functionalities of framework architecture, the
Storage 500 GB SSD * 2
KDIS is configured with Big Data open-source solutions to each fra-
meworks. Tables 1 and 2 lists the major hardware and software com-
ponents of the KDIS. The KDIS manages resources of clustered systems
Table 2
The KDIS System Software Configuration. using a Hadoop Yarn scheduler [6]. The scheduler manage system re-
source with all applications running on the KDIS system. Each open-
Component Specification source solution has convenient mechanisms and features such as
managing and monitoring applications, supporting in-memory com-
OS CentOS 7.2
Resource Scheduler Hadoop Yarn puting, supporting convenient development, and supporting interfaces
Stream Processing F/W Spring Cloud Dataflow (SCDF) with a well-structured architecture. The KDIS takes advantages from
Many-Task F/W Apache Spark well-developed Big Data open-source solutions. Detailed introductions
Web Application Server Apache Tomcat (Spring MVC + Hibernate) of frameworks are described through this chapter.
Data Storages HDFS, Geode, HBase, PostgreSQL
Library Environment Many types of open-source or internally developed
libraries are used by types of applications 2.1. Stream processing framework
331
T. Tak et al. Fusion Engineering and Design 129 (2018) 330–333
KSTAR operation events. cluster-based big data solution that saves such effort, data task can be
created and deployed efficiently in terms of productivity.
Apache Spark was designed to support fast iterative data analyses of
2.2. Many-task framework
large datasets with many cutting-edge libraries such as machine
learning and a variety of analytical methods. It supports parallel tasks,
Many-task computing [8] is a computational paradigm, which aims
tracking computational lineage, and optimization of data with Direct
to emphasize using many computing resources over short periods of
Access Graphs (DAGs).
time to accomplish many computational tasks. We adopted this concept
Therefore, the KDIS is composed of Apache Spark [9] as many-task
in our framework to handle many data tasks.
framework with all necessary analysis and data interface libraries such
Normally, the data generated by the KSTAR system runs on a spe-
as EPICS, MDSPlus, HDF5, RDB interface, and various types of analysis
cific system depending on the user’s requirements. Scientists and en-
libraries. This framework handles a set of batch-style applications and
gineers of KSTAR have used their own dedicated workflows of data
provides advantages like simplifying hardware resource, supplying data
methodology by running single applications or running applications
related method, and supporting faster calculation for many-task appli-
with parallel frameworks. In order to develop data applications, de-
cations to build new analysis functionalities. Thanks to these ad-
veloper should consider full application composition such as automa-
vantages, the KDIS many-task framework can enable scientists to take
tion interface layer, library layer, and software framework. Considering
332
T. Tak et al. Fusion Engineering and Design 129 (2018) 330–333
advantage of benefits arising from the innovation of Big Data analytics. • T2 – Fault log collection: Analyzing Collected interlock system logs
On the conceptual design, each task is launched by a specific KSTAR and finding root causes.
event scheduled with task-schedule algorithm. Launched tasks process • T3, T4 – Operation parameter and timing parameter snapshot:
and move data to storage according to their own purpose. The demo Collecting operation parameter for each pulse. Stream snapshot
tasks are introduced in chapter 3. from the EPICS PV.
• T5 – Timing system automation: Setting prearranged template
2.3. Data service framework and interface timing setting to timing system in KSTAR.
• T6 – Startup image visualization: Extracting specific time window of
We defined functionality of data service framework as supporting camera image during the plasma operation (for startup debugging).
data access method to the user. On the conceptual design, the frame-
works is composed of two layer: a data-storing scheme and a data Web 3.3. Visualization on web service
service.
The RDB schema was developed for storing well-defined data and For an integrated view of each pulse experiment of KSTAR, we de-
managing Web service. NoSQL on the HDFS [10] and memory DB were veloped the KDIS data visualization service via a Web. It shows the
configured for uncountable logs. results processed by each task application. It supports functionalities
In order to support the visualization method, we choose a Web such as searching pulse by parameters and displaying various results as
service for common user access. The KDIS Web service supports in- shown in Fig. 3.
tegrated visualization interface with processed data via the Web com-
ponents. The Web service is composed of Apache Tomcat [11], Spring 4. Conclusion
frameworks, persistent layer and libraries for the view of grid of tables,
plots, and charts. The system was designed as a data integration system for KSTAR,
which performs scheduled processes on stream and batch data ac-
3. Implementation and result cording to KSTAR events with a user interface service, including
hardware and software infrastructures, applications, and libraries. The
The KDIS was developed as a concept design to evaluate its potential conceptual design and development of the data integration system was
usability by applying a Big Data ecosystem for KSTAR data and its use completed and commissioned in the 10th campaign of KSTAR in 2017
case. Therefore, we developed demo applications applying stream, task, under the name of KDIS. Through the prototyped system, we secured an
and its visualization frameworks. We will introduce demo applications efficient data orchestration environment, which is suitable for KSTAR
for data orchestration with the KSTAR operation. As a conceptual dataflow. Through demo applications with frameworks, we proved the
system, we targeted automated offline data processes based on sched- possibility of increasing the efficiency and convenience of data pro-
uled batch application by the operational event. Through this work, we cessing by applying Big-Data-related technologies to KSTAR.
determined that effective use cases could be developed with the KDIS. The scope of this paper includes the design and its demonstration.
Fig. 2 shows the developed dataflow demo applications with frame- The development of KDIS is one-step in building the infrastructural
works. Detailed functionalities are introduced via subchapters. levels for now. The development of various algorithms, use cases,
performance measurements, visualization methods, and intelligent data
3.1. Stream processing applications mining studies are put forward as future work.
• T1 – Shot Summary: Processing data by each shot in order to make (2008) 53.
[11] Project, A. T. (n.d.). Apache Tomcat®. Retrieved from http://tomcat.apache.org/
summary of pulse via retrieving data from MDSPlus and channel [Accessed 10 May, 2017].
archiver. Archiving summary data as table on the database and plot [12] Woongryol Lee, et al., Conceptual design and implementation of pulse automation
image on the file system. and scheduling system for KSTAR, Fusion Eng. Des. 96 (2015) 830–834.
333