Nextflow in Bioinformatics Executors Performance - 2023 - Future Generation Co

Future Generation Computer Systems 142 (2023) 328–339
Contents lists available at ScienceDirect
Future Generation Computer Systems

journal homepage: www.elsevier.com/locate/fgcs
Nextflow in Bioinformatics: Executors Performance Comparison Using

Genomics Data
∗
Viktória Spišaková a,b , , Lukáš Hejtmánek a , Jakub Hynšt c
a
Institute of Computer Science, Masaryk University, Brno, Czech Republic
b
Faculty of Informatics, Masaryk University, Brno, Czech Republic
c
Center of Molecular Medicine, Central European Institute of Technology, Masaryk University, Brno, Czech Republic
article info a b s t r a c t
Article history: Processing big data is a computationally demanding task which has usually been fulfilled by HPC batch
Received 30 June 2022 systems. These complex systems pose a challenge to scientists due to their cumbersome nature and
Received in revised form 3 November 2022 changing environment. The scientists often lack deeper informatics understanding and experiment
Accepted 14 January 2023
reproducibility is increasingly becoming a hard request on the research validity.
Available online 16 January 2023
A new computational paradigm — containers — are meant to contain all dependencies and
Keywords: persist the state which help reproducibility. They have gained a lot of popularity in the informatics
Kubernetes community but HPC community remains skeptical and doubts that container platforms are appropriate
HPC for demanding tasks or that such infrastructure can reach significant performance.
Cloud In this paper, we observe the performance of various infrastructure types (HPC, Kubernetes, local)
Performance comparison on a Sarek Nextflow bioinformatics workflow with real life genomics data of various sizes. We analyze
Genomics obtained workload trace and discuss pros and cons of utilized infrastructures. We also show some
Nextflow
approaches perform better in terms of available resources but others are more suitable for diversified
Big data
workflows. Based on the results, we provide recommendations for life science groups which plan to
analyze data in large scale.
© 2023 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
1. Introduction intended for microservices and web applications. In addition,

there are reports from various scientific user groups that they do
For last few decades, academic high performance computing not consider container platforms suitable for HPC workloads due
(HPC) relies heavily on batch processing and job scheduling sys- to their high failure rates.
tems such as Slurm, Condor, or OpenPBS. Those systems are stable In this paper, we present a comparison of a traditional HPC
and popular among IT skilled scientists. However, user experience system based on OpenPBS and a Kubernetes based container
they offer is not sufficient for users from less technically oriented platform. The systems are compared from both performance as
disciplines such as life sciences. Moreover, such users often lack well as user perspectives. We study possible overhead of contain-
skills to develop their own software tools, they usually depend ers against overhead of complex scheduling of traditional batch
on third-party (even unmaintained) software packages instead. systems. In addition, or orthogonally, we compare performance
Computing reproducibility, i.e. the ability to repeat computations abilities of a single local machine with a larger infrastructure
with identical results, is increasingly important for those user under both computing approaches. While such a comparison may
communities, and at the same time, very difficult to achieve seem ridiculous at first glance, there is a plethora of researchers
under these conditions. using a simple local computing environment who would neces-
Tools such as Conda or Docker intend to address both user ex- sarily need to transition to larger infrastructures in near future to
perience and reproducibility and their popularity is rising. While simply cover current processing demands. From user perspective,
Conda is in essence a quite limited package manager, Docker and we study usability and stability of both approaches, i.e. how
other container platforms have the potential to become future difficult it is to run the computation and whether it is prone to
generation computing systems. Some container platforms such fail.
as Kubernetes were not designed with HPC in mind, they are We aimed at providing real data based results instead of
synthetic experiments together with sufficiently big data as such
∗ Corresponding author at: Institute of Computer Science, Masaryk University, experiments are less affected by short anomalies and disruptions.
Brno, Czech Republic. We chose computation, workflow management system (WMS)
E-mail address: spisakova@ics.muni.cz (V. Spišaková). and data used in 1000 Czech genome project (A-C-G-T, https://
https://doi.org/10.1016/j.future.2023.01.009
0167-739X/© 2023 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
V. Spišaková, L. Hejtmánek and J. Hynšt Future Generation Computer Systems 142 (2023) 328–339
www.acgt.cz/en/) that is using Whole-Genome Sequencing (WGS) 3. After MapReads processes finish for all samples, two pro-
genomics method. It is a robust Next-Generation Sequencing cesses — BamQC and MarkDuplicates — are executed. The
(NGS) technique with the potential to expand knowledge about latter must be completed before multitude of BaseRecal-
diseases, and to improve diagnostics and personalized medicine ibrator processes are spawned.
[1,2]. This technique has become one of the most widely used 4. BaseRecalibrator processes are characterized by low
application in the genomics research and is providing tremendous resource requirements, comparatively short run time and
quantities of sequencing data [3]. The project supports European high quantity. All BaseRecalibrator processes must fin-
joint declaration to analyze and provide access to one million ish before GatherBQSRReports ‘‘barrier’’ task.
human genomes by 2022 [4]. 5. After GatherBQSRReports, ApplyBQSR processes are ex-
The WMS used in this project is Nextflow (https://www. ecuted. They are very similar to BaseRecalibrator pro-
nextflow.io) [5]. Several comparative studies [6,7] indicate that cesses in terms of resource needs, run time and volume.
Nextflow is one of the best WMS that can be utilized — assess- 6. A MergeBamRecal task again serves as a barrier that waits
ments in the Table 1 of [6] mark Nextflow as the highest ranked. for completion of all ApplyBQSR processes.
Nextflow is just a workflow manager, the computation is done 7. Several different processes run after MergeBamRecal as
by Sarek pipeline (https://github.com/nf-core/sarek). The pipeline three independent computation paths:
contains gold standard tools such as BWA-mem [8] and Genome
Analysis Toolkit (GATK) [9] that are commonly employed in many (a) First path comprises the only process — BamQC.
bioinformatics workflows. We regard this task as a typical rep- (b) The second path comprises HaplotypeCaller and
resentative of genomics workflows — the computation consists GenotypeGVCFs that run in parallel with the same
of several serialized steps where some of them contain number number of instances as BaseRecalibrator. Par-
of parallel tasks with varying degree of parallelism. Results of allelism of both processes is coupled meaning that
this study can therefore be generalized to various computation process GenotypeGVCFs process for particular re-
workflows in life sciences. gion can run after process HaplotypeCaller for
the same region.
2. Nextflow Sarek pipeline (c) Third path comprises MantaSingle and VEP Manta
processes.
We used freely available Sarek pipeline (version 2.7.2) for WGS
data analysis in A-C-G-T project and for the execution compari- 8. The rest of pipeline processes, e.g. further statistical anal-
son. Sarek pipeline is deposited in the nf-core pipeline repository yses, run without any unexpected requirements and in
https://nf-co.re/sarek/2.7. reasonable time.
It is written in the Nextflow language and it comes with
In the following sections, we denote pipeline processes as
Docker containers and Conda environments, which make it easy
short or long, depending on their runtime. The processes falling
to deploy on local computers or cluster environments. Sarek is
into each category are:
portable across computational platforms and well documented.
The A-C-G-T project primary goal is to detect germline vari- 1. short — CreateIntervalBeds, Output_documentation,
ants and their frequency in Czech population, we used in this ApplyBQSR, get_software_versions, BaseRecalib-
project, and for this study, Sarek germline variants detection rator, MantaSingle Manta, GatherBQSRReports, Hap-
pipeline. It comprises sequencing reads mapping (alignment) lotypeCaller, Genotype GVCFs, Vcftools Manta,
to genomics reference using BWA-mem algorithm [8]. Conse- Vcftools, VEP Manta, CompressVCFvep Manta, Sam-
quent steps, including read deduplication, base quality recali- toolsStats, VEP, ConcatVCF, BcftoolsStats,
bration and HaplotypeCaller genotyping, are done using GATK
BcftoolsStats Manta, MultiQC,
software. Software Manta [10] is used to detect structural DNA
CompressVCFvep
sequence variants. All types of variants are annotated to inspect
2. long — FastQCFQ, MapReads, BamQC, MergeBamRecal
their effect to phenotype using VEP [11]. Other state-of-the-
art software for genotyping are offered to complement rou-
3. Sequencing dataset and genomics reference
tine GATK best practice https://gatk.broadinstitute.org/hc/en-us/
articles/360035535932-Germline-short-variant-discovery-SNPs-
In this paper, we used two kinds of sequencing datasets:
Indels- scheme such as Strelka2 [12] or Mutect2 [13]. However,
these are not tested in this study. 1. Small dataset to simulate computation suitable even for
From execution point of view, Sarek is comprised of a set of local or desktop computer. It contains only a single sample.
named processes that are executed for each sample from dataset. 2. Large dataset that cannot be sensibly computed on local
For one sample, some processes are run as single task, some are or desktop computer. It consists of 51 samples, which are
parallelized across regions1 of the genome. Actual Sarek pipeline comparable in size.
executions happens as following:
In case of the small dataset, we used freely available WGS
1. Execution timeline starts with creation of three small tasks dataset of the son (HG002) from the Ashkenazi Jewish trio,
— CreateIntervalBeds, Output_documentation, provided by Genome in a Bottle (GIAB) consortium (https://www.
get_software_versions — that are negligible in the size nist.gov/programs-projects/genome-bottle), with 30x coverage
of resource requests and time duration. (85 GB for forward/reverse fastq gzip compressed files). This
2. After initial processes, 2 larger tasks — FastQCFQ2 and dataset contains only one pair of input files thus effective par-
MapReads are executed. Process MapReads runs for each allelism is limited — significant speed up is not expected on large
data sample and parallelism is limited up to 32 CPU cores, scale infrastructure.
thus represents considerable portion of overall pipeline run In case of the large dataset, we used in-house WGS data
time even on a large scale infrastructure. (2.9 TB for forward/reverse fastq files), which was sequenced
as a part of the A-C-G-T project. This dataset consists of 51 pairs
1 There are 133 regions in human reference genome. (for each individual) of input fastq files so that MapReads and
2 Process names with QC suffix denote quality control steps that are optional. other processes can run in parallel and can significantly utilize
However, it is strongly advised to perform these quality checks. large infrastructure.
329
For MapReads process, human genome assembly sequence is Storage

required as a reference. We used recent human GATK GRCh38
version from AWS iGenomes (https://ewels.github.io/AWS-iGeno The same NFS storage system accommodating inputs, inter-
mes/) with the size of 3.1 GB. mediate results, and final outputs was connected to both dedi-
cated environments and shared containerized environments. The
4. Execution environments storage system consists of 4 server nodes, each with 32 hyper-
threaded CPU cores AMD EPYC 7302P, 256 GB RAM, and single
We performed pipeline runs in three infrastructure types — lo- 10 Gbps network interface. The servers are connected via fiber
cal, grid, and containerized. Because both grid and containerized channel to external all-flash disk array with capacity 600 TB.
environments are provided to large scientific user community The storage array is organized into RAID 6 equivalent parity. The
as shared distributed computational environments, we created storage servers use Rocky Linux 8.6 and IBM Spectrum Scale 5.1.3
their synthetic (dedicated) versions. The sizes and utilization of for data storage.
shared (grid, containerized) infrastructures are not the same but Various NFS storage systems with total capacity 15 PB were
they yield realistic results from real life. However, such execution connected to the shared grid environment. Metacentrum grid
times and performance cannot be compared fairly and dedicated facilities are spread across whole Czech Republic which can result
infrastructures are the best way for conducting objective mea- in varying latency between computing nodes and NFS storage —
surements without additional interference. Therefore, final re- node with assigned task and physical data storage server can be a
sults were obtained by analysis of runs in — local, dedicated grid, country’s distance apart (depending on particular nodes and the
shared grid, dedicated containerized, and shared containerized — storage).
environments.
Non-Uniform Memory Access (NUMA)
Local environment
Both dedicated environments (grid and containerized) used
A local machine equipped with 64 CPU cores Intel Xeon E7- NUMA aware configuration. Shared grid environment has this
4890, 197 GB RAM, and a local SSD with 6.6 TB capacity. The feature enabled as well. Shared containerized environment does
machine is virtualized on VMWare platform and runs Ubuntu not employ NUMA.
20.04 Linux operating system. It has been chosen because similar
resources can be acquired in cloud or virtualization platform 5. Performance evaluation
fairly easily by user.
We evaluate runs of both small and large datasets in all five
Shared grid environment execution environments. All variants were executed three times
and we present average run times, standard deviation, and total
Shared heterogeneous HPC infrastructure from Czech NREN — runtimes. Since Sarek pipeline is heterogeneous, processes run
Metacentrum (https://metavo.metacentrum.cz/en/index.html) — times are truly diverse. Therefore, we have categorized process
which hosts 30846 CPU cores of various types and nodes’ RAM run time into two groups — short and long running.
range is 64 GB–9.78 TB. The infrastructure runs OpenPBS 19.0.0 In the following, we denote: Local for local environment, PBS-
HPC batch computing system. It must be noted the user cannot Shared, PBS-Dedicated for shared and dedicated grid environment
use all these resources at once. respectively, and K8s-Shared and K8s-Dedicated for shared and
dedicated Kubernetes cluster infrastructure respectively.
Dedicated grid environment
5.1. Small dataset
Dedicated cluster of 8 nodes, each with 128 hyperthreaded
CPU cores AMD EPYC 7543, 512 GB RAM, 60 TB of local NVME The Fig. 1 shows number of processes per task which are exe-
SSD, and two 10 Gbps network interfaces in active–active mode. cuted. Number 133 corresponds to the number of distinct genome
The cluster runs Ubuntu 20.04 and OpenPBS 19.0.0 HPC batch regions. Processes of the highlighted tasks (4) are executed in
computing system. The PBS-server itself was not committed parallel across regions instead of whole genome.
solely to this environment; all nodes were connected to shared Processes of parallelized tasks require 1 (BaseRecalibrator),
PBS-server but grouped into queue reserved only for this experi- 2 (ApplyBSQR and HaplotypeCaller), or 8 (GenotypeGVCFs)
ment. CPUs. In K8s-Shared, K8s-Dedicated, PBS-Dedicated, this amount
of resources was available for all processes except Genotype-
Shared containerized environment GVCFs. Therefore, the first three processes could run in parallel,
depending on scheduler’s ability but GenotypeGVCFs had to be
Cluster of 20 nodes, each with 128 hyperthreaded CPU cores fragmented. In case of PBS-Shared, most resources were occupied
AMD EPYC 7543, 512 GB RAM, 6.8 TB of local NVME SSD, and two by jobs of other users so CPU cores were not instantly available
10 Gbps network interfaces in active–active mode. The cluster and processes had to wait in queues. Local was able to run
runs Ubuntu 20.04 and Kubernetes 1.21.11 container orchestra- half, quarter, or sixteenth of processes in parallel due to limited
tor. resources depending on process CPU requests respectively.
The Figs. 2 and 3 present average run times for a single
Dedicated containerized environment instance of all types of processes. Presented run times are run
times of the processes themselves, without any wait time. Short
Dedicated cluster of 8 nodes, each with 128 hyperthreaded processes produce almost no difference among K8s-Dedicated,
CPU cores AMD EPYC 7543, 512 GB RAM, 60 TB of local NVME K8s-Shared, and PBS-Dedicated. PBS-Shared is significantly slower
SSD, and two 10 Gbps network interfaces in active–active mode. which was caused by higher latency over wide area network in
The cluster runs Ubuntu 22.04 and Kubernetes 1.24.1 container grid. The Local does not possess comparable CPU, so slower run
orchestrator. time is expected.
330
Fig. 1. Number of Sarek processes for small dataset.
Fig. 2. Average duration of short Sarek processes for small dataset.
For long processes (Fig. 3), differences are more significant but it does not suggest grid environment should not be used. In
indicating K8s-Shared is slightly slower. We have analyzed and fact, due to high number of resources, it can run multiple datasets
confirmed that slowdown is caused by Kubernetes not perform- in parallel while the local machine is fully utilized by a single
ing CPU pinning by default, therefore causing less effective mem- dataset.
ory access for some process kinds. On the contrary, PBSPro used For short processes (Fig. 4), such as ApplyBQSR, BaseReca-
in grid environment is more NUMA aware and pins processes librator, HaplotypeCaller, and GenotypeGVCFs, only a
to CPU cores to prevent relocation among all cores. The impor- slight difference between K8s-Dedicated, K8s-Shared, and PBS-
tance of NUMA aware configuration can be observed in K8s- Dedicated run time was measured. Those cases are run in parallel
Dedicated where we used such configuration and reached the so any scheduler-caused delay is amortized. In case of PBS-
same performance as in PBS-Dedicated. Shared, all parallel processes — BaseRecalibrator, Apply-
In case of PBS-Shared, heterogeneous infrastructure spanning BQSR, HaplotypeCaller, and GenotypeGVCFs — are signifi-
through whole Czech Republic causes significant standard devi- cantly slower compared to other infrastructure types. Slowdown
ation. Also, run times are longer because data was located on is caused by large number of processes that need to wait for
slower storage and had to transfer over longer distance. When free resources and as they are relatively short-lived (except Hap-
comparing Local and PBS-Shared, run times are close to each other lotypeCaller), waiting is not amortized in overall run time.
331
Fig. 3. Average duration of long Sarek processes for small dataset.
Fig. 4. Total duration of short Sarek processes for small dataset.
In Local, there is a significant slowdown of HaplotypeCaller The Fig. 7 shows total wait time of all pipeline processes. The
process because only quarter of them can run in parallel due total wait time does not imply the pipeline run is delayed by
to CPU limit and long run time. Slowdown for single instance such time but that many processes run in parallel so they wait
processes — FastQCFQ and VEP might be caused by slower CPUs in parallel as well. Local has lowest wait time as there is no
assigned in shared grid and local machine. scheduler that could delay the processes. Kubernetes scheduler
Long processes are present only in a few instances so total run in K8s-Shared and K8s-Dedicated imposes very low delay while
times (Fig. 5) and their analysis are the same as for average run PBS-Shared generates significantly higher wait times due to high
times. volume of users who have been in queue before or have higher
The Fig. 6 shows total run time for each infrastructure kind. priority.
K8s-Shared is slightly slower compared to PBS-Dedicated as a
result of MapReads and BamQC processes which are memory 5.2. Large dataset
intensive and have performance penalties if not pinned to CPU
cores. As discussed above, K8s-Dedicated installation confirms The Fig. 8 shows number of processes per task that have to
there are no performance problems if NUMA aware configuration run. Because this dataset is composed of 51 samples, it has at
is used. The Fig. 6 suggests that it might not be worthy using least 51 processes for majority of tasks so they are able to run
grid environment if appropriate local computer is available — in parallel and consume more resources than generally available
the total running times are almost the same and the complexity in K8s-Dedicated, K8s-Shared, and PBS-Dedicated.
of working with grid environment from user perspective might In case of PBS-Shared, most resources were occupied by other
bring additional delay. tasks so they were not instantly available and computations had
332
Fig. 5. Total duration of long Sarek processes for small dataset.
Fig. 6. Total duration of Sarek pipeline for small dataset.
Fig. 7. Total wait time of Sarek pipeline processes for small dataset.
333
Fig. 8. Number of Sarek processes for large dataset.
to wait in queues despite being able to run completely in parallel. per single process in K8s-Shared due to already mentioned NUMA.
Local was not able to accommodate this mass of tasks therefore In the case of K8s-Dedicated, total duration of MapReads is even
we used only a single sample and extrapolated times to all 51 longer because scheduler is unable to utilize all available CPU
samples. cores — 25 CPU cores in total were unused compared to PBS-
As the large dataset has huge number of processes, Nextflow Dedicated and K8s-Shared where only 9 CPU cores in total4 were
does not provide detailed analysis including wait time so we unused. The FastQCFC process is not memory consuming thus
cannot provide wait analysis as for the small dataset. not affected by NUMA configuration and performance is the same
The Figs. 9 and 10 present average run times (without wait for all K8s-Dedicated, K8s-Shared, and PBS-Dedicated.
time) for a single instance of all task types. Short processes pro- For Local, total time was extrapolated from a single sample to
duce almost no difference among K8s-Dedicated, K8s-Shared, and 51 samples. It can be assumed that for MapReads, three instances
PBS-Dedicated except for MantaSingle Manta and VEP where could run in parallel5 so the total time would be 17 times longer.
K8s-Dedicated and K8s-Shared are faster for about 18% and 8% In the case of BaseRecalibrator and ApplyBQSR, the duration
resp. This is caused by slower access to Sarek tools data from would be exactly 51 times longer.
PBS-Dedicated as it is not containerized and tools are not stored The Fig. 12 shows total Sarek pipeline run time for all in-
on local disk. For long processes (Fig. 10), there is about 20% frastructure kinds. K8s-Shared is slightly slower compared to
slow down in K8s-Shared for MapReads and BamQC processes PBS-Dedicated, despite the speed up for short-lived tasks. K8s-
caused by NUMA-awareness of PBS-Dedicated and K8s-Dedicated. Dedicated is slightly faster compared to PBS-Dedicated because
Similarly to small dataset, Local performance is worse compared of NUMA aware configuration and running short-lived processes
to large infrastructure. efficiently. In case of Local, total duration of a single sample is
Correct analysis of total pipeline duration on large dataset about 16 hours, duration of just MapReads is about 10 hours. In
for every process type is problematic. Fifty-one samples forming theory, three instances of the longest process MapReads could
whole dataset are independent of each other so there are no be accommodated on local node. Finalizing just this process for
guarantees that certain process type runs at once for all the 51 samples would be possible after 17 iterations so total time
samples. However, we identified three process types that run in would be 10h ∗ 17 = 170 hours which is roughly 7 days.
uniform group, i.e., only single process type runs at given time BaseRecalibrator, ApplyBSQR, and HaplotypeCaller have
and two process types that run in parallel so observing total 51 times more instances, so overall time for these three steps is
duration for this kind is correct. The total duration of processes 10 days. Duration of the pipeline run on Local would be at least 18
is visualized in Fig. 11. days (including the rest of the tasks which represent the minority
Both BaseRecalibrator and ApplyBQSR processes yield time) which is 14 times slower compared to large infrastructure.
very different results in K8s-Dedicated, K8s-Shared and In reality, three instances of MapReads ended after 7.5 days. It
PBS-Dedicated. This is expected as for each job there is some can be safely stated that for large dataset, large infrastructure is
delay caused by scheduler and batch system in grid infrastructure. an absolute requirement.
In addition, these kinds of processes have very short average In the case of PBS-Shared, the pipeline was not able to com-
duration (Fig. 9). In case of the small dataset, delay is hidden plete successfully uninterruptedly due to execution failures.6
because all these processes run in parallel. However, in case of
the large dataset, all these processes cannot run in parallel as 4 Total meaning that a few cores were left unused on each cluster node
parallelism is limited to 200,3 thus the scheduler delay accumu- adding up to the total amount.
5 Derived from the size of the local machine’s resources divided by MapReads
lates. In the case of MapReads, there is about 20% slow down
resource requests.
6 Caused by unimplemented robustness in Nextflow, e.g. missing retires if
3 A limit set on the number of jobs in the queue. API call timeouts.
334
Fig. 9. Average duration of short Sarek processes for large dataset.
Fig. 10. Average duration of long Sarek processes for large dataset.
However, we were able to complete the pipeline with help of 6. Discussion

Nextflow resume option which allows continuing the computa-
tion from the last completed process. Resuming does not have As outlined in the introduction, we aim to compare perfor-
any impact on average process duration as we included only mance of Kubernetes and OpenPBS platform with a specific in-
completed processes in the analysis. However, it distorts total terest in scheduling overhead of containers and traditional batch
duration of process groups as presented in Fig. 11 and total jobs. There is sufficient evidence to support claim that overall,
duration in Fig. 12. Distortion is caused by Nextflow not terminat- containerized and grid environments perform similarly. However,
ing whole pipeline immediately — it attempts to finish running if we study the deviations in the time-to-completion between the
processes. During this time, resources are not fully utilized so two systems, we are able to provide a detailed comparison of
total duration is not precise. There were 5 resumes per pipeline shared and dedicated environments, overhead reasons and ways
run at maximum so the distortion is insignificant. for improvement.
335
Fig. 11. Total duration of selected Sarek processes for large dataset.
Fig. 12. Total duration of Sarek pipeline for large dataset.
In the shared environment, Kubernetes infrastructure per- moved to another CPU core, whole cache is discarded. Pinning
formed better overall. Moreover, extremely short tasks were is effective only for containers that are set to Guaranteed quality
slower in grid than in containerized environment which we at- of service — resource requests are equal to resource limits (htt
tribute to container cache mechanism and additional IO overhead ps://kubernetes.io/docs/tasks/configure-pod-container/quality-se
in the grid. Furthermore, we observed that memory-intensive rvice-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guara
processes (BamQC, MapReads, MarkDuplicates) significantly nteed).
benefit from NUMA aware system setting — the difference can Secondly, our infrastructure features 64 physical cores (128
reach up to 20%. NUMA is by default enabled in the shared grid hyperthreaded CPU cores) per cluster node. In default config-
and completely omitted in the shared containerized environment. uration, Kubernetes views all CPU cores as physical and does
In the dedicated environment, we aimed for the most similar not consider hyperthreading. If any significant threads are as-
scheduler and system setting of both infrastructures to achieve signed to hyperthreaded-sibling cores, their performance is heav-
fair comparability. The grid scheduler and system are better ily degraded. Hyperthreaded cores can be explicitly reserved in
suited for running computational tasks, both short and long. the configuration so we reserved half of the cores (64–128).
Therefore, we looked for the relevant settings in the Kubernetes From scheduler’s point view, the amount of cores is 64 but from
native scheduler to achieve the best HPC-like performance. kernel’s point of view, there are still 128 cores.
Firstly, we enabled CPU pinning which is advantageous for Thirdly, existence of the MapReads process, its resource re-
memory-intensive tasks such as MapReads and is enabled in the quirements and its successor task pointed to another important
grid by default. CPU pinning retains running processes on a spe- problem. MapReads requires 16 CPUs in dedicated PBS environ-
cific CPU cores and can thus significantly improve performance ment where four MapReads tasks can run in parallel on one node
because the cache can be effectively reused. If the process is but only three can be placed on the node in dedicated Kubernetes
336
environment. This oddity is explained by Kubernetes scheduler scripts built inside. If the same pipeline is executed again, con-
which reserves 0.5–2.5 CPU for control plane. Consequently, the tainer is already cached and starts immediately. Also, Kubernetes
amount of node CPUs is actually less than 64 (not enough for scheduler is naive and processes are scheduled immediately as
four processes) so the whole computation takes longer because they come whereas grid schedulers form scheduling iterations
the tasks cannot be allocated until the next scheduling round. and run tasks once per iteration which further delays short tasks.
There are multiple possibilities how to mitigate the issue, e.g., add This can be seen in results (Fig. 11) where BaseRecalibrator
a bigger node to cluster or set the control plane container CPU total time is significantly better for K8s-Shared and K8s-Dedicated
request to 0. The second option is especially clever because Ku- compared to PBS-Dedicated.
bernetes scheduler does not count actual usage (just projected Apart from pipeline composition, e.g. the amount of short-
requests) and so it seems 64 CPUs are still available. The actual living tasks, the total execution time depends on the input data
assignment of the containers is performed by kernel that even- size which further marks the most appropriate execution envi-
tually puts the system container to hyperthreaded cores (they ronment. Local, or single node computing, has its limits simply
can be occupied by non-guaranteed containers). Altering system because the input data is too large to process in a reasonable time
container requests helped to utilize more CPUs but another issue frame. If working with large datasets, it is entirely fundamental
arose. to move the analyses into any type of large infrastructure. The
Fourth, the immediate successor of MapReads task is BamQC Fig. 12 shows no significant difference of total pipeline duration
and predecessor is FastQCFQ requesting 16 and 2 CPU respec- in grid and containerized environment when similar amount of
tively. Running 51 FastQCFQ processes consumes 51 ∗ 2CPU=102 resources is reserved.
CPU out of cluster’s 512 CPU. Ideally, 25 MapReads processes (out In context of the A-C-G-T project, local environment obviously
of 51) consume remaining 410 CPUs and it is fundamental how cannot accommodate computations and a bigger infrastructure
the other 26 MapReads processes will be executed. Predecessor must be used. Kubernetes proved as an optimal choice because
FastQCFQ processes last about 1–2 h, MapReads processes last it promotes both reliability and computational throughput. It
5–7 h but in the meantime, new processes of BamQC task start to must be stated that the grid environment was always occupied
appear in the queue. When the first eight FastQCFQ processes by other users whereas Kubernetes environment offered unoc-
end, BamQC processes compete against ‘‘older’’ MapReads pro- cupied resources up to 500 CPU cores. However, we created
synthetic grid environment similar to Kubernetes one to provide
cesses for free resources. It can easily happen that many BamQC
fair infrastructure comparison.
processes are executed before remaining MapReads processes
To conclude, the experiments show that Kubernetes container
which delays execution time greatly. Similar races do not happen
platform performs equally to traditional HPC OpenPBS environ-
in PBS environment because PBS scheduler follows FIFO principle
ment if appropriate performance tuning is employed. On top of
in the queue but in Kubernetes, workload starvation is always
that, the overhead inherent to the complex scheduling mecha-
possible. Both kinds of processes request the same CPU amount
nism of HPC systems is not negligible and it is the most prominent
which means that both could run at the same time, so we
when executing larger amount of short-living tasks.
could use the concept of PriorityClasses (https://kubernetes.io/
From the user perspective, we observed that pipeline runs are
docs/concepts/scheduling-eviction/pod-priority-preemption/)
more stable in containerized environment (the latest Nextflow
and set higher priority on MapReads processes. The priority class
version; https://github.com/nextflow-io/nextflow/releases/tag/v2
preemption policy must be set to Never.7 The containers will
2.05.0-edge) meaning that whole pipeline executes in a single
remain in the scheduling queue until sufficient resources are free run, without resume. On the other hand, almost each pipeline run
but they are still influenced by scheduler back-off period — if the had to be resumed several times in the grid environment. The
scheduler has enough free resources but the container is in the reason is that grid environment is strongly heterogeneous and
back-off, it is not polled and other pods (even lower priority) distributed where occasional failures are expected and while it
containers can be scheduled sooner. Overall, setting a priority is WMS task to deal with these failures, it is not always success-
class does not help accommodating new processes because of ful.Moreover, abundance of tasks do not overload the scheduler in
back-off period. the Kubernetes environment but may overload the grid scheduler.
Fifth, we set the CPU topology manager to the bestEffort OpenPBS offers a setting to lower the job submission rate to
technique (https://kubernetes.io/docs/tasks/administer-cluster/to avoid the scheduler overload but this leads to prolongation of the
pology-manager/#policy-best-effort) to make CPU pinning the overall pipeline run time.
most effective. Pure CPU pinning pinned the CPUs but they could
be chosen from multiple NUMA nodes which hampers the per- 7. Conclusion and future work
formance as their memory is not on the same NUMA node which
is slower. bestEffort topology management assigns CPUs from In this paper, we have described computational performance
same NUMA node to processes which is beneficial to memory of the bioinformatics Nextflow pipeline Sarek in five computa-
intensive tasks. PBS allocates CPUs from same NUMA node by tional environments. Importantly, we analyzed real WGS datasets
default. from 1000 Czech genome project (A-C-G-T) thus our results are
As a result of all adjustments, the pipeline performed better directly applied in analytical strategy of this ongoing project.
(the results were computed faster) in the dedicated containerized Although we use a particular pipeline, results are valid for various
environment than in the dedicated grid. genomics labs which plan to analyze sequencing data and also
Based on the analysis results, we can say that abundance of for the future or other ongoing national initiative within 1M
short living tasks creates a large overhead for both dedicated and European genome project. The results are also valid for other
shared grid environments. The administration cost of repeated WMS, e.g. Snakemake, because all WMS work in the same way —
pipelines is notably smaller in containerized environment be- they manage more complex workloads composed of smaller units
cause it spawns a task in a form of container with all necessary and execute the units (or delegate the execution to other environ-
ment). It does not matter which WMS is used with infrastructure
7 If preemption of lower priority containers was enabled, the preemption because the infrastructure will work based on its capabilities and
logic can lead to eviction of one or more lower priority container just to
scheduler configuration.
accommodate higher priority container. The execution of the pipeline cannot Our final results show that the choice of distributed infras-
lose any process so real eviction is not a solution. tructure has not remarkable impact on resulting data but on
337
the computational time and user experience. For small datasets, on investing their time to shifting to or at least learning novel
local computer can be used for WGS analyses and there are technologies.[15]
no significant benefits using large infrastructures which usually
impose additional complicated bootstrap. However, it is crucial CRediT authorship contribution statement
to move to distributed grid or containerized infrastructures for
large datasets like A-C-G-T. Furthermore, computing complex Viktória Spišaková: Software, Validation, Investigation, Re-
workflows is far more efficient and faster on distributed in- sources, Writing – original draft, Writing – review & editing,
frastructure. Although this finding is not a breakthrough, it is Supervision, Project administration, Writing – review & editing
surprising how many research groups still favor using solely local of resubmission. Lukáš Hejtmánek: Conceptualization, Method-
resources instead of shared, but larger environments. Lastly, we ology, Software, Validation, Formal analysis, Investigation, Re-
have proved that high performance computing is possible on sources, Visualization, Writing – review & editing of resubmis-
container infrastructure and that it has performance similar to sion. Jakub Hynšt: Resources, Data curation, Writing – original
draft, Writing – review & editing, Reading of resubmission.
grid infrastructures.
We did not observe any huge difference between container-
Declaration of competing interest
ized and grid infrastructure, especially if user can reserve the
same amount of resources in any infrastructure type. However,
The authors declare that they have no known competing finan-
containerized infrastructure has been more stable compared to
cial interests or personal relationships that could have appeared
grid, pipelines did not have to be re-run due to failures and in
to influence the work reported in this paper.
dedicated NUMA aware environment, it performs slightly bet-
ter than grid, for these reasons we see it as next generation Data availability
computing infrastructure even for HPC environment. At technical
level, One dataset is freely available online (link in the document),
1. HPC batch systems such as PBSPro have problems with run- second dataset is confidential.
ning huge numbers of short-lived processes due to com-
Acknowledgments
plex scheduler and overhead with starting and stopping
processes
This work has been supported from European Regional Devel-
2. Containerized systems (Kubernetes) are not NUMA aware
opment Fund Project ‘‘A-C-G-T’’ (No. CZ.02.1.01/0.0/0.0/16_026/
by default which can impose slowdown up to 20%.
0008448). Computational resources were supplied by the project
3. Approximating Kubernetes scheduler configuration to PBS
‘‘e-Infrastruktura CZ’’ (e-INFRA CZ LM2018140) supported by the
configuration makes Kubernetes more effective than PBS
Ministry of Education, Youth and Sports of the Czech Republic.
There are basically two ways for future development — im-
prove existing solutions or find other ways. According to our References
conclusions, exploring scheduling strategies in Kubernetes could
[1] H. Nakagawa, M. Fujita, Whole genome sequencing analysis for cancer
lead to performance boost and more efficient computations. Ku-
genomics and precision medicine, Cancer Sci. 109 (3) (2018) 513–522,
bernetes scheduler has low global knowledge of node reserva- http://dx.doi.org/10.1111/cas.13505.
tion details, it offloads the scheduling to nodes. It is debatable [2] E. Turro, W.J. Astle, K. Megy, S. Gräf, D. Greene, O. Shamardina, H.L. Allen,
whether increasing the global knowledge would result in better A. Sanchis-Juan, M. Frontini, C. Thys, J. Stephens, R. Mapeta, O.S. Burren,
container allocation. Moreover, Kubernetes scheduler does not K. Downes, M. Haimel, S. Tuna, S.V.V. Deevi, T.J. Aitman, D.L. Bennett,
P. Calleja, K. Carss, M.J. Caulfield, P.F. Chinnery, P.H. Dixon, D.P. Gale,
perform any forward resource reservations and functions almost
R. James, A. Koziell, M.A. Laffan, A.P. Levine, E.R. Maher, H.S. Markus,
as LIFO (because of scheduling back off). This is a diametrically J. Morales, N.W. Morrell, A.D. Mumford, E. Ormondroyd, S. Rankin, A.
different approach to scheduling than in the grid environment Rendon, S. Richardson, I. Roberts, N.B.A. Roy, M.A. Saleem, K.G.C. Smith,
and it has not been researched enough yet. We believe that if H. Stark, R.Y.Y. Tan, A.C. Themistocleous, A.J. Thrasher, H. Watkins, A.R.
native scheduling mechanisms were modified to support more Webster, M.R. Wilkins, C. Williamson, J. Whitworth, S. Humphray, D.R.
Bentley, NIHR BioResource for the 100,000 Genomes Project, N. Kingston,
FIFO-like polling, the performance would be significantly ahead
N. Walker, J.R. Bradley, S. Ashford, C.J. Penkett, K. Freson, K.E. Stirrups,
of grid scheduler. Other ways could also include utilizing cus- F.L. Raymond, W.H. Ouwehand, Whole-genome sequencing of patients
tom scheduling strategies, e.g. Volcano for Kubernetes (https:// with rare diseases in a national health system, Nature 583 (7814) (2020)
volcano.sh/en/). 96–102, http://dx.doi.org/10.1038/s41586-020-2434-2.
In case of grid systems, additional middleware could improve [3] S.T. Park, J. Kim, Trends in next-generation sequencing and a new era for
the performance and mitigate problems with huge numbers of whole genome sequencing, Int. Neurourol. J. 20 (Suppl 2) (2016) S76–83.
[4] G. Saunders, M. Baudis, R. Becker, S. Beltran, C. Béroud, E. Birney, C.
short-lived jobs. Some suggest [14] that moving towards con-
Brooksbank, S. Brunak, M. Van den Bulcke, R. Drysdale, S. Capella-Gutierrez,
tainers and cloud is the right direction for pipelines and data P. Flicek, F. Florindi, P. Goodhand, I. Gut, J. Heringa, P. Holub, J. Hooyberghs,
analyses. The best features could be joined by a hybrid approach N. Juty, T.M. Keane, J.O. Korbel, I. Lappalainen, B. Leskosek, G. Matthijs,
where demanding tasks could be spawned into grid environment M.T. Mayrhofer, A. Metspalu, A. Navarro, S. Newhouse, T. Nyrönen, A. Page,
possessing larger resources and smaller tasks could stay in con- B. Persson, A. Palotie, H. Parkinson, J. Rambla, D. Salgado, E. Steinfelder,
M.A. Swertz, A. Valencia, S. Varma, N. Blomberg, S. Scollen, Leveraging
tainerized environment. We would like to test this approach,
European infrastructures to access 1 million human genomes by 2022,
specifically with Nextflow. The idea is fairly new but we are Nature Rev. Genet. 20 (11) (2019) 693–701, http://dx.doi.org/10.1038/
working on the prototype solution on our infrastructure. s41576-019-0156-9.
Lastly, it is very important to educate users on platform appro- [5] P. Di Tommaso, M. Chatzou, E.W. Floden, P.P. Barja, E. Palumbo, C.
priateness for their use cases. Users need to be informed properly Notredame, Nextflow enables reproducible computational workflows, Na-
ture Biotechnol. 35 (4) (2017) 316–319, http://dx.doi.org/10.1038/nbt.
about financial costs of using the infrastructure as well as ‘‘per-
3820.
sonal’’ cost — switching platforms brings a steep learning curve. [6] L. Wratten, A. Wilm, J. Göke, Reproducible, scalable, and shareable analysis
Becoming familiar with new platform is a huge commitment and pipelines with bioinformatics workflow managers, Nature Methods 18 (10)
without assurance the effort will pay off, users are not keen (2021) 1161–1168, http://dx.doi.org/10.1038/s41592-021-01254-9.
338
[7] M. Jackson, K. Kavoussanakis, E.W.J. Wallace, Using prototyping to choose Viktória Spišaková was born in Slovakia in 1999. She
a bioinformatics workflow management system, PLoS Comput. Biol. 17 (2) received a master degree in Software Systems Man-
(2021) 1–13, http://dx.doi.org/10.1371/journal.pcbi.1008622. agement in 2021 at Faculty of Informatics at Masaryk
[8] H. Li, R. Durbin, Fast and accurate short read alignment with University (Brno, Czech Republic) and now is a doctoral
burrows–wheeler transform, Bioinformatics 25 (14) (2009) 1754–1760. student at the same faculty. She is currently work-
[9] G.v.d. Auwera, B.D. O’Connor, Genomics in the Cloud: Using Docker, GATK, ing as IT specialist of computational infrastructures at
and WDL in Terra, first ed., O’Reilly Media, Sebastopol, CA, 2020, oCLC: Institute of Computer Science at Masaryk University.
on1148137471.
[10] X. Chen, O. Schulz-Trieglaff, R. Shaw, B. Barnes, F. Schlesinger, M.
Källberg, A.J. Cox, S. Kruglyak, C.T. Saunders, Manta: rapid de-
tection of structural variants and indels for germline and can-
cer sequencing applications, Bioinformatics 32 (8) (2016) 1220– Lukas Hejtmánek received his Ph.D. degree in Com-
1222, http://dx.doi.org/10.1093/bioinformatics/btv710, https://academic. puter Science from the Masaryk University, Brno, Czech
oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btv710. Republic. He works as IT architect at Masaryk Univer-
[11] W. McLaren, L. Gil, S.E. Hunt, H.S. Riat, G.R.S. Ritchie, A. Thormann, P. sity in CERIT-SC project and is also storage specialist
in at CESNET. His main IT interest is to improve
Flicek, F. Cunningham, The ensembl variant effect predictor, Genome Biol.
architecture of HPC systems and adopting container
17 (1) (2016) 122, http://dx.doi.org/10.1186/s13059-016-0974-4.
infrastructure into academic HPC environment.
[12] S. Kim, K. Scheffler, A.L. Halpern, M.A. Bekritsky, E. Noh, M. Källberg,
X. Chen, Y. Kim, D. Beyter, P. Krusche, C.T. Saunders, Strelka2: fast and
accurate calling of germline and somatic variants, Nature Methods 15 (8)
(2018) 591–594, http://dx.doi.org/10.1038/s41592-018-0051-x.
[13] D. Benjamin, T. Sato, K. Cibulskis, G. Getz, C. Stewart, L. Lichten-
stein, Calling somatic SNVs and indels with Mutect2, Bioinformatics Jakub Hynšt was born in Brno, Czech republic in 1990.
(2019) http://dx.doi.org/10.1101/861054, http://biorxiv.org/lookup/doi/10. He is married, 1 children. He received the master
degree in Biochemistry in 2015 (Masaryk University),
1101/861054. (preprint).
and PhD. in Molecular oncology and tumor biology
[14] Filippo Abbondanza, Is cloud computing the answer to genomics’ big
(Masaryk University) in 2021. He is currently working
data problem? 2021, https://www.labiotech.eu/in-depth/cloud-genomics-
as bioinformatician in CEITEC MU.
big-data-problem/, (Accessed: 2022-06-21).
[15] N.C. Sheffield, V.R. Bonazzi, P.E. Bourne, T. Burdett, T. Clark, R.L. Grossman,
O. Spjuth, A.D. Yates, From biomedical cloud platforms to microservices:
next steps in fair data and analysis, Sci. Data 9 (1) (2022) 553, http:
//dx.doi.org/10.1038/s41597-022-01619-5.
339

Nextflow in Bioinformatics Executors Performance - 2023 - Future Generation Co

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nextflow in Bioinformatics Executors Performance - 2023 - Future Generation Co

Uploaded by

Copyright:

Available Formats

Future Generation Computer Systems 142 (2023) 328–339

Contents lists available at ScienceDirect

Future Generation Computer Systems

Nextflow in Bioinformatics: Executors Performance Comparison Using

1. Introduction intended for microservices and web applications. In addition,

For MapReads process, human genome assembly sequence is Storage

Fig. 1. Number of Sarek processes for small dataset.

Fig. 2. Average duration of short Sarek processes for small dataset.

Fig. 3. Average duration of long Sarek processes for small dataset.

Fig. 4. Total duration of short Sarek processes for small dataset.

Fig. 5. Total duration of long Sarek processes for small dataset.

Fig. 6. Total duration of Sarek pipeline for small dataset.

Fig. 8. Number of Sarek processes for large dataset.

Fig. 9. Average duration of short Sarek processes for large dataset.

However, we were able to complete the pipeline with help of 6. Discussion

Fig. 12. Total duration of Sarek pipeline for large dataset.

You might also like