Professional Documents
Culture Documents
9, SEPTEMBER 2021
Abstract—Currently, with the booming growth of cloud computing, workloads from broad ranges of functions and demands are
crammed into a single physical machine. They lay considerable stress on the need of evolution of the operating system underneath,
especially the memory subsystem. Even enhancing large pages with main memory compression is not intuitively straightforward due to
rigid rules imposed by the state-of-the-art manager Buddy System from the beginning of the design. To relieve the aforementioned
problems and provide broader design space for system designers, we propose Zweilous, a clean slate physical memory management
framework. It is self-contained, highly decoupled, and thus can co-exist with the vanilla memory manager. Separate self-contained
metadata/functions guarantee a flexible extension with little modification to current frameworks. To show it is easy to add enhanced
functions that accelerate the evolution of the memory management subsystem, we implement Hzmem, a new large page memory
manager redesign enhanced with the function of main memory compression. Our method achieves competitive performance compared
with native and virtualized large page support, effective memory size increased and fewer impacts on other parts of the operating
system.
0018-9340 ß 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: National University Fast. Downloaded on November 23,2022 at 10:54:26 UTC from IEEE Xplore. Restrictions apply.
LI ET AL.: ZWEILOUS: A DECOUPLED AND FLEXIBLE MEMORY MANAGEMENT FRAMEWORK 1351
The key techniques for Zweilous are as follows a) well- to increase TLB reaches and thus reduce page faults and
defined placeholders of self-contained, customizable meta- TLB misses of big memories; b) main memory compression
data/functions that are necessary for a full-featured memory to increase effective memory size so that it enables a single
manager, and b) well partitioned underlying physical mem- machine to accommodate more workloads, thus reducing
ories that are managed by specific managers. costs. They are used frequently in these use cases separately.
Through these, Zweilous can run simultaneously with However, the costs of labors of applying them in one situa-
the vanilla memory manager, leaving critical data like ker- tion are prohibitive because of some obsolete assumptions.
nel data managed by the vanilla memory manager and For example, a single base page size is assumed in the
bringing down instability when developers deploying their current page reclaiming subsystem on which the vanilla
experimental implements of new ideas. It gets rid of rigid main memory compression framework is based. All that
assumptions with customizable metadata/function and hinders the enhancement of large pages with main memory
hides many non-trial, general details of developments of compression.
physical memory managers. It is the first-ever memory Motivated by the above, we have implemented a new
management framework in a commodity OS with three physical memory manager called Hzmen for memory compress-
benefits: ible hugepages. It takes advantage of Zweilous to show how it
is easy to add enhanced functions that accelerate the pace of
Decoupled and flexible extension, co-existing with the evolution of the memory subsystem. Hzmem is another new
vanilla and prevention from the “burden of history”. It is a data and control path of physical memory management run-
decoupled and flexible methodological support of ning in parallel with the regular path. Hzmem includes a self-
memory manager implementation with a clean slate contained hugepage allocator without utilities provided by
by utilizing separated, customizable self-contained Buddy System. Hzmen have competitive performance over
metadata/functions and management from the vanilla the vanilla large page support in the benchmarks of frequent
memory manager. Therefore it enables multiple mem- and heavy hugepage allocations. Also, we believe that
ory managers to co-exist in one operating system co- Hzmem is the first physical memory manager that inherently
operatively. That will bring down much instability allows main memory compression over large pages. Hzmem
from the newly-designed memory manager. can increase effective memory size and impacts little on the
Non-intrusive ways with little modification to implement rest of the operating system.
new ideas. The framework remodels the current oper- This research makes the following contributions:
ating system memory infrastructure in less intrusive
ways. Zweilous enables efficient development of new We analyze the implementation and execution char-
ambitious/aggressive physical memory managers acteristics of current physical memory management.
managing the resource on their own for higher per- We find out the root cause of why its design is rigid
formance and/or richer features with little efforts. and not open to extension. Therefore, a new clean-
More programmability for ambitious developers. The slate idea of managing memory with the current one
framework is based on a general understanding of in an independent way is pioneered. It opens up
memory management developments rather than pre- another direction of memory subsystem evolutions
cise definitions. In order to help developers easily to for system designers and researchers to experiment
introduce and modify the memory management serv- with their new ideas or algorithms with less effort,
ices, and to focus on the new ways of managing mem- which is difficult before.
ory resources without taking care of details, the We propose a new physical memory management
framework hides many non-trial general details and framework abstraction named Zweilous which is
provides programmability at two levels. First, the self-contained and decoupled from the current mem-
framework can provide userspace application compat- ory framework. To the best of our knowledge, this is
ible API at the application level. Second, the frame- the first effort ever to design a physical memory
work can manage memories for some well defined framework that is independent of and co-exists with
partitions of the physical memory at the level of pages. the unmodified one.
Therefore, ambitious developers need only pay atten- We design and implement new large page support
tion to the designs of metadata/functions, like page with the main memory compression feature named
descriptors and page fault handlers, and fill up the pla- Hzmem, which is to enhance large page support
ceholders for the targeted memory allocator. with the function of main memory compression with
On the other hand, both industry and academia have wit- moderate efforts.
nessed a growing movement towards cloud computing in We demonstrate a thorough evaluation of Zweilous
recent years. In response, circumstances and scenarios have and Hzmem showing the framework and its specific
evolved to the point where intricate applications are very appliance can achieve competitive performance,
likely to deplete the entire memory of a single physical increasing effective memory and impacting little on
machine. Traditional databases, NoSQL stores, and large other subsystems.
router machines fall into such categories of memory-hungry The remainder of the paper is structured as follows:
workloads where large memory footprint fails to bring section 3 outlines the high-level design and the architec-
about good temporal locality [4], [10], [11], [12], [13]. ture of Zweilous, section 4 describes the implementation of
Two existing techniques are already used for the afore- Hzmem, section 5 presents the evaluation of Zweilous and
mentioned workloads: a) large pages also called hugepages Hzmem, and section 6 concludes.
Authorized licensed use limited to: National University Fast. Downloaded on November 23,2022 at 10:54:26 UTC from IEEE Xplore. Restrictions apply.
1352 IEEE TRANSACTIONS ON COMPUTERS, VOL. 70, NO. 9, SEPTEMBER 2021
2 RELATED WORK with fewer overheads. Pekhimenko et al. [16] propose Line-
arly Compressed Pages (LCP) that compresses all the cache
Although virtual memory is an active research area, our
lines with the same pages to the same size using special
work of new memory management framework over com-
hardware. Unlike these works, Hzmem does not need any
mercial operating systems is the first effort to the best of our
hardware modifications.
knowledge.
Tuduce et al. [17] propose a main memory compression
solution that can resize the compressed data automatically.
2.1 Memory Management Architecture
This work gives us much insight into how to manage com-
Recently, lots of researches [2], [4], [5], [6] with great ambi- pressed data in software, and we get much inspiration from
tions make efforts to review or improve the memory it to implement our part of compression management.
subsystem.
Some studies give us an overview of the evolution of
Linux memory subsystems. Huang et al. [2] conduct a quan- 2.4 Limitation of the Previous Works
titative survey of virtual memory’s development process Previous works improve or enhance the memory manage-
over five years (2009-2015). Insights of this work help sys- ment subsystem by introducing heavy software or hardware
tem designers and developers track hot functions and build modifications. Although main ideas are simple such works
more reliable and efficient memory management systems. require lots of domain-specific knowledge in implementa-
The study with great patience gives us lessons and start tion. This paper does not focus on the traditional method of
points for diving into this research area. memory subsystem improvement, but rather it focuses on a
Clements et al. [5] propose a new design called RadixVM general-purpose memory management framework abstrac-
that makes mmap, munmap and page faults scale perfectly tion to facilitate the implementation of new ideas.
on non-overlapping memory regions. The drawbacks of
RadixVM are its intensive labor interconnected with other 3 DESIGN
subsystems if applied to a commodity operating system.
In this section, we describe the design and architecture of
Therefore the only implementation on a Unix-like research
Zweilous. We focus on the motivation of redesign and the
kernel is presented. This study again shows that modifying
multiple choices made during the process of design. To clar-
the memory subsystem of commercial operating systems
ify such problems we are giving examples of Linux, but the
needs to be thought twice and that our work to avoid the
idea is not specific to Linux.
risk by using Zweilous is valuable.
Park et al. [7] propose a new physical memory manager
call lazy iBuddy system. They eliminate overheads of splits 3.1 Motivation
and coalescing by managing pages individually and reduce The rapid development of cloud computing and thus the
lock contentions by exploiting a fine-grained locking mecha- advent of emerging demands impact tremendously the evo-
nism. However, this work has to replace vanilla memory lution of software and hardware infrastructure. For exam-
manager while our work can co-exist with Buddy System. ple, Non-volatile memory (NVM) technologies which are
Considering the complexity of a memory manager, our receiving increasing attention in academia and industry
work is more reliable. blurs the line between main memory and persistent storage
[18], [19], [20], [21] while big memory workloads of hun-
2.2 Huge Page Support dreds of GBs call for much more elaborate and adaptive
Large page supports on commercial operating systems are overcommitment functions of the underlining memory
pioneered very early at the beginning of this century. Navarro management. These new situations bring big challenges for
et al. [8] pioneer the official large page support in FreeBSD. the current memory management whose obsolete design
They introduce techniques of reservation-based allocation can date back to decades ago.
and fragmentation control on the vanilla memory manager Our efforts to build Zweilous are first motivated by
while Hzmem is built upon the new memory framework improving memory-hungry workloads without good tempo-
Zweilous that is independent of the vanilla base page alloca- ral locality. In this section, we discuss two exiting techniques:
tor. Kwon et al. [9] propose Ingens, a framework for transpar- large pages and main memory compression. We discuss the
ent huge pages. They use tracking utilization and access current challenges in combining them in one situation, and
frequency of memory pages. Panwar et al. [14] present a nice briefly mention how we address these challenges. Finally, we
research of the impacts of fragmentation with huge pages in show some more examples of why a framework is needed so
the Linux kernel and an efficient memory manager called illu- that memory subsystems of the operating system can evolve
minator to make cost-effective allocations. The problems these independently and quickly.
works try to resolve, like fragmentations of base pages are the
focus of our work. 3.1.1 Large Pages
Most modern commodity operating systems support large
2.3 Main Memory Compression pages. For example, Linux has support for large pages since
Most researches of main memory compression focus on around 2003 [22]. The large pages are called hugepages in
hardware modification. Ekman et al. [15] propose a main Linux. Unlike other operating systems, Linux starts to sup-
memory compression scheme that reduces access costs by port large pages more explicitly via hugetlbfs filesystem.
exploiting a highly-efficient structure for data locating and Developers need to map some sections of memory from
a hierarchical memory layout to vary the compressibility files of hugetlbfs. The process first reads or writes that
Authorized licensed use limited to: National University Fast. Downloaded on November 23,2022 at 10:54:26 UTC from IEEE Xplore. Restrictions apply.
LI ET AL.: ZWEILOUS: A DECOUPLED AND FLEXIBLE MEMORY MANAGEMENT FRAMEWORK 1353
sections and thus triggers a page fault. While handling page other subsystems are using them that a single byte modifica-
faults, the virtual memory mapping of the hugetlbfs tion impacts much on other parts of the operating system.
makes the operating system associate the virtual memory The task is time consuming and laborious because of the
with one free large page instead of one free base page. How- complex invariants of memory systems.
ever, the free large page is not directly allocated from physi- We choose another solution that self-contained, custom-
cal memory manager (i.e., Buddy System) but large page izable metadata/functions are used to eliminate these con-
memory pools preserved in advance by administrators. straints. The implementation can have as much freedom as
The large page that it maintains is indeed contiguous possible without the consequent impacts on other subsys-
base pages preserved from Buddy System. Thus, the mem- tems if implemented by the first solution. Developers need
ory pool acts as a “broker” or “middle person” between the only to pay attention to the specific parts of the system and
process demanding large pages and Buddy System of the not other parts.
physical memory manager.
Why processes cannot use large pages directly from the 3.1.3 Slow Evolutions of Memory Management
physical memory manager? It is a burden of history. The The memory subsystem is a mature and core kernel subsys-
legacy physical memory manager has been running without tem. Over decades, it has much more functionalities and
trouble for long before operating systems start to support interconnected more inextricably with other subsystems. Its
larges pages. The design of the physical memory manager codebases get large and complex. Therefore, the recent
in Linux from the very beginning bases on the simplifying developments are mainly focused on bug fixes, code main-
assumption that the page size is constant (i.e., 4 KB of the tenance, and optimizations, but new features [2]. To get a
base page in Linux). Meanwhile, a lot of enhanced features, general understanding of the memory subsystem with other
like swapping or slab, have been added to it not long after- subsystems in terms of new features added recently, we do
wards. They all exclusively handle base pages using the a survey of important features of Linux kernel from version
page descriptors for base pages, which underpins the 4.0 to 5.5. That is a time range from Apr 2015 to Jan 2020.
assumption and make the memory management system The sources are “LinuxChanges” from the site kernelnew-
rigid and complicated. bies.org. They introduce kernel changes whenever a major
Therefore, Linux kernel developers have to take a lengthy revision of the kernel is released. We pay attention to the
detour and implement the large page support based on the title in the section of “prominent features” or “coolest
vanilla memory manager at the expense of overheads and features” after checking its descriptions. We observe that 18
maintainability problems. In Linux, one large page is deemed are associated with memory subsystems, which accounts
to be 512 contiguous base pages in an aligned 2 MB memory for only 7.9 percent of the 228 important features in total.
region but consuming the same amount of page descriptors. Memory management changes are very few. For example, 4
In Linux, it is the key metadata called struct page, a data of them are related to system calls like mmmap, madvise.
structure per base page holding meta-information used by the They are deemed as important because some in-kernel serv-
physical memory manager. ices are exposed to user space, which has a greater impact
on others. Therefore, if we raise the standards, the percent-
age of new and important features that are recently added
3.1.2 Main Memory Compression
to memory subsystems can be lower.
A compressed-memory system reserves some sections of We find that some large and crazy memory manage
memory to hold pages in compressed form to make the changes finding their way into the mainline often take
effective memory larger instead of disk read/write for many years like userspace page fault handling [26], or some
swapping. Main memory compressions like zRAM [23], incomplete functions take a couple of years to complete like
zswap [24] and zcache [25] are supported in Linux, but using persistent memory as RAM [27]. Although our early
the implementation is leveraging page reclaiming subsys- motivation is contingent, these examples show that a frame-
tem. The page reclaiming is centering around Buddy Sys- work should be built for hiding non-trial, general details
tem and expects a page size of 4 KB. Therefore, hugepages and make it more extensible for developers. The program-
of legacy memory management are detached from the page mability it achieves can help memory subsystems evolve
reclaiming for practical reasons. more quickly. Developers can use the framework to fulfill
The constraints that make it difficult to reuse current them for these benefits:
infrastructure to complete our goal: 1) current page reclaim-
ing subsystem expects a single base page size exclusively, Zweilous aims to generalize. Components of our
not for multiple page sizes. It is no trivial to change the situ- framework are as follows a) well-defined placeholders
ation and even some simple key assumptions Linux has of self-contained, customizable metadata/functions
held for a long time have to be challenged. 2) Current large that are necessary for a full-featured memory man-
pages supports do not involve in the page reclaiming sub- ager, and b) well partitioned underlying physical
system. Furthermore, The fact that large pages in Linux memories that are managed by to-be-implemented
have no backing storage devices leads to the impossibility memory managers. It is easy for developers to start
of swapping and writing back to persistent storage devices. designing the very core of the memory manager with-
One solution is to modify base codes of page reclaiming out learning or referring to the details that memory
or even the page descriptors, the metadata, to change this managers have in common. Our work shows that a
situation. As one of the most used metadata in the kernel, developer without any experience of kernel develop-
the size of page descriptors has swollen largely. So many ing can start to work in 2 weeks.
Authorized licensed use limited to: National University Fast. Downloaded on November 23,2022 at 10:54:26 UTC from IEEE Xplore. Restrictions apply.
1354 IEEE TRANSACTIONS ON COMPUTERS, VOL. 70, NO. 9, SEPTEMBER 2021
Zweilous can run with the vanilla memory manager From the practical development with high-level design,
in one operating system. Zweilous achieve program- we derive the following metadata/functions which a new
mability at two levels: pages and API levels. Chang- memory management framework need to consider:
ing and debugging memory management is difficult
as many in-kernel debugging/trace tools rely on invisibility from the vanilla memory management
memory to run. The case is different in Zweilous. subsystem.
The underlying physical memory is well partitioned: new page descriptor represents the physical page
important data in the operating system, like kernel states.
data, is managed by the vanilla one which is proved page fault handler branch.
to be robust and stable in a long time while the mem- page reclaiming mechanism.
ory of testing applications is handled by the newly- In need of considering details and implementations of
implemented memory manager based on Zweilous. these aspects, system designers and developers should not
Developers can easily introduce and modify the new ignore optimization.
memory management services with much fewer
times of kernel crashes, which greatly reduce devel- 3.2.1 Invisibility From the Vanilla Memory Management
opment time and labors. Subsystem
Therefore, we conclude that enhance the large page with To make our framework as decoupled as possible from the
main memory compression using existing infrastructure vanilla memory management subsystem in Linux, we decide
and tools is even more difficult and convoluted, and that it to make the framework completely invisible from the vanilla
is advisable to redesign a general-purpose physical memory memory subsystem but not to make any significant modifica-
management framework based on much more flexible and tion to other parts of the operating system. During the boot-
extensible assumptions without bringing instability to the ing process of the operating system, some memories need to
Linux base code. be removed from the management of the vanilla memory
management subsystems. The reserved memory region is,
thus, invisible from the operating system but visible to our
3.2 Goals and Challenges of Design new memory framework. Applications access memory
We want the code modification both of kernel space and through the framework. Then our new memory framework
user space to be as small as possible. Thus, 1) in user space, can manipulate the reserved memory region exclusively
framework interface should be compatible with the legacy without any interference of any other Linux subsystems.
hugepage API; 2) in kernel space, a self-contained frame-
work without using existing in-kernel utilities (i.e., function
or API provided by the kernel) makes no overlapping with 3.2.2 New page descriptor represents one physical
old stable kernel codes. page’ s states
Thus, the main goals are to make our framework decoupled A normal page descriptor is used for representing one physi-
and flexible, and systems based on Zweilous do not compro- cal page’s states and the key metadata for memory manager.
mise functionality and performance. The new systems can also These general metadata are modified by many subsystems
run simultaneously with other subsystems and/or even the in a frequent and complicated way during the whole life of
vanilla memory management. We take hardware virtualiza- the operating system. Therefore, They contain various states
tion support into consideration to make our framework more concerning many different subsystems in just one data
versatile in cloud computing. structure.
A large, old and complicated system is retrofitted with a For example, the complete definition of a struct page
new feature that affects or crosscuts all functions. This kind has a size of over 4 double words in x86-64 architecture
of retrofit does not make a better new system. There are though it has been optimized by developers leaving almost
many cases in the system design domain. For example, no means untried. The union of C language is seen
Buddy System, the allocator that was originally designed throughout its definition. What parts can be put together
to manage a single page size, is to be retrofitted with a new into one union is quite a domain-specific knowledge, and
feature for large pages. Though the new feature can destroy there is no guarantee that the length of the structure will be
the old principles or assumptions, it has to be subordinate unchanged. Many states concerning different subsystems
to the old ones by reusing the existing page descriptors. are tangled together, which causes a problem of readability
That makes the design phase left with less freedom and and maintainability in this overused data structure. More-
results in an awkward implementation. over, the assumption of a constant page size originates in
The solution is to level the new features with the old the general page descriptors.
ones. It is also the first-class principle. The key points of our The new memory management is focused on the mem-
framework are as follows: a) well-defined placeholders of ory regions that are detached and invisible from the vanilla
self-contained, customizable metadata/functions that are memory management subsystem. Due to the main design
necessary for a full-featured memory manager, and b) well goals of decoupling and flexibility, reusing of old page
partitioned underlying physical memories that are man- descriptor with readability and maintainability problems
aged by specific managers. We decide to implement the and basing on the obsolete assumption of page size of 4 KB
huge page physical memory management with a clean slate show no merits or even necessity. But a page descriptor
based on a redesigned memory management framework used for holding information of one physical page is indis-
Zweilous. pensable, therefore we decide to use a completely new page
Authorized licensed use limited to: National University Fast. Downloaded on November 23,2022 at 10:54:26 UTC from IEEE Xplore. Restrictions apply.
LI ET AL.: ZWEILOUS: A DECOUPLED AND FLEXIBLE MEMORY MANAGEMENT FRAMEWORK 1355
Authorized licensed use limited to: National University Fast. Downloaded on November 23,2022 at 10:54:26 UTC from IEEE Xplore. Restrictions apply.
1356 IEEE TRANSACTIONS ON COMPUTERS, VOL. 70, NO. 9, SEPTEMBER 2021
Fig. 2. Workflows of Hzmem: two paths, two entities, and one end.
newly-customized page descriptor is introduced in
Hzmem. One page descriptor strictly represents one
entities — a fault handler and a compression daemon — one large page and co-exists with the vanilla page descrip-
end — a compression data manager. tors that represent base pages for other parts of the
Compression Path. a periodically-running compression operating system. The one-to-one mapping gets rid of
daemon selects pages with less prospect to be used in the any “compound” and space-wasting problems.
future (we call them cold pages) and passes them to com- The new page descriptors are completely self-contained
pression data management component. The compression and specialized without being used by other parts of
data manager compresses them in a lossless way. the operating system. It contains fewer fields of struc-
Decompression Path. This path starts from a user space ture with much more readability and maintainability.
application accessing a page that is compressed in the above The decreasing number and size of the new page descrip-
path before. When that happens, a special page fault is trig- tors reduce memory footprint enormously, which is ampli-
gered and the corresponding handler restores the compressed fied by machines with a large capacity of physical memory
pages through compression data manager in a fast way. under severe pressure.
We implement 3196 lines of C code (LOC). It runs well in
Linux kernel of versions with the vanilla Linux memory
manager Buddy System. 4.1.2 Deallocated and Allocated Page Management
Each node of a NUMA system usually have tens of GBs of
memory on servers. To manage such large memory effi-
4.1 Hugepage Physical Memory Allocator
ciently, one NUMA node should not be managed in the
We take a clean-slate approach to implement the hugepage
same way as Linux. Zweilous gets rid of Buddy System,
physical memory allocator from the ground up. Overall, it
and thus many its key assumptions cannot hamper Hzmem
uses the design principles described in Section 3.
from flexible management of memory.
As is shown in Fig. 3, on top of Zweilous, hugepage
4.1.1 New Page Descriptors memory of one NUMA node contains several sections. Cur-
Currently, one important data structure regarding memory rently, we assume the size of one section as 4 GB heuristi-
management in Linux is struct page since many subsys- cally. Thus, free pages are scattered into different sections,
tems or functionalities of memory management, like page reducing lock contention among CPUs. A smaller granular-
allocation/free of Buddy System, page reclaiming, etc. ity memory management achieves better scalability and
using it for getting/setting page states. Each struct page parallelism.
represents one base page in the initial designing phase. As The state-of-the-art Buddy System manages contiguous
memory capacity increases, it means a large number of page blocks of base pages in the number of 2n , where n is called
descriptors, forcing developers to rack their brains cram- order. A n þ 1 order of physical memory block is exactly
ming more into a single page descriptor just to make its size twice the size of a n order of physical memory block. To
small at the expense of readability and maintainability [28]. best fit the size of requested memory large order of block
Hugepages (Linux’s large page support) use this data split into halves of one order lower when necessary. Both
structure to reuse page allocation/free from Buddy System are called a buddy to each other, where the name of Buddy
by merging 512 base pages into one compound page. Its System comes from. To make contiguous physical memory
states are stored in one page descriptor at the head, leaving as large as possible, both buddies, when free, are coalesced
other page descriptors unused. 511 of 512 page descriptors into one block size of one order higher.
are just wastes of spaces. Blocks of excessively large orders are seldom requested
We introduce new page descriptors exclusively for and bring lots of overhead when allocated/free. The upper
Hzmem. It has two benefits: limit of order in Linux is empirically 10, which makes the
largest contiguous block 4 MB, a size of 2 contiguous huge-
In Hzmem, vanilla large page support that treats one pages. Thus, it is not advisable to maintain free huge pages
large page as 512 contiguous base pages is dropped. A in contiguous blocks. To make the implementation of
Authorized licensed use limited to: National University Fast. Downloaded on November 23,2022 at 10:54:26 UTC from IEEE Xplore. Restrictions apply.
LI ET AL.: ZWEILOUS: A DECOUPLED AND FLEXIBLE MEMORY MANAGEMENT FRAMEWORK 1357
Hzmem simple, we arrange free hugepages in linked lists. 2) More Complicated Page Fault hanzdling. When EPT
The non-trivial splitting and coalescing of “Buddy” are faults in the same 2 MB range occur successively,
averted. In Hzmem, allocation/free of hugepages are done only the first one triggers a host page fault and loads
in Oð1Þ time with faster speed. . the 2 MB memory pointed by the page table entry.
The rests do not have to trigger a host page fault due
to the already resident 2 MB memory. However,
4.2 Page Fault Handler they cannot simply point to the same 2 MB page
Page faults are categorized into hard and soft. They are dis- since the fault addresses falls in different offsets
tinguished by whether reading contents from disk to fill the within the 2 MB range.
free page returned from memory manager, which means Therefore, some tweaks of page faults exist in Hzmem to
that hard page faults make the performance of the system support nested paging in a hardware-based virtualization
suffer. environment.
Since huge pages are not mapping persistent storage we Our solution to the mismatch of page size shows great
pay attention only to soft page faults that trigger page table flexibility in our design. We add one field to the new page
operations in our implementation of Hzmem, which greatly descriptor for containing the offset within the 2 MB block.
simplifies the code path of page faults. Because of our self-contained design, it is convenient to add
There are two solutions when processing page faults in one field without impacts of other subsystems.
Hzmem: We make the page descriptor available to the EPT page
1) The most common one is the virtual page is accessed fault handler. When the first EPT fault occurs within one
when not resident in memory for the first time. The 2 MB memory block, the VMM setups EPT page table and
huge page fault is a soft page fault, therefore the new host page table correctly and loads a free 2 MB page. Other
physical memory manager allocates a zeroed huge EPT faults within the same 2 MB block can be handled
page. through the offset carried by the page descriptor without
2) The special one is protection violations are triggered. manipulating the host page table. The offset in page descrip-
Either a shared page has to be returned from page tor is transient, and it is reusable in every EPT fault within.
caches or a compressed page has to be decompressed
by the compression data manager. 4.3 Page Reclaiming and Compression Data
Hardware-based virtualization helps to improve virtuali- Management
zation performance and simplify guest OS implementation. The allocated pages are identified as cold or hot. The cold
We also extend Hzmem to virtualization supports in terms ones are reclaimed instead of hot ones, which helps reduc-
of X86-64 CPU and Intel VT-x for practical purposes. ing the throttles of the operating system when memory is
Intel VT-x introduces many new features: In term of CPU not enough. Based on Zweilous we have our own page
virtualization, root and non-rood mode guarantee different descriptors for storing states of hugepages and thus a self-
privileges of CPU executions for isolation and trap-and- contained page reclaiming can be established. In every
emulate; in terms of memory virtualization, nested paging, node, a daemon monitors the usage of huge pages periodi-
called Extended Page Table (EPT) is used for mapping from guest cally with a watermark indicating whether the memory is
physical memory address to host physical memory address. enough or else triggers page reclaiming. We use the second
However, it is not possible to apply Hzmem directly. The chance algorithm [29] taken from vanilla Linux and further
challenge comes from different page sizes of host and guest. investigation of hot/code page identification for specialized
Both host and guest can have 2 sizes of pages: 2 MB and optimization is left for future work.
4 KB. That forms 4 kinds of combinations mathematically, Hugepage compression data manager takes a role in com-
but only 3 kinds exit in reality—a combination of 2 MB in pression/decompression of hugepages. We choose LZ4 for a
guest and 4 KB in host does not exist in Linux. lossless data compression algorithm with better decompres-
Ideally, everything just works fine when page sizes sion speed [30]. Since the not fixed size in the compressed
match both in host and guest. An EPT fault is a VM exit, form we store them in base pages for less fragmentation.
and thus the execution falls back to the VMM. In the han-
dling of the EPT fault to fill the EPT table, the VMM triggers 5 EVALUATION
a host page fault which brings in the “real page”. Both page
fault handlers can simply use the two corresponding EPT We measure Zweilous and Hzmem with many user applica-
page entry and host page entry with the same size, which tions and benchmarks, comparing against the state-of-the-art
can be perfectly dealt with by Hzmem or the vanilla. hugetlbfs in Linux. Experiments are performed on a
The situation is not straightforward in the third 64-GB-memory server having 16 Intel Xeon E7520 1.87 GHz
combination—4 KB in the guest and 2 MB in the host. CPUs. Linux 3.10 and Centos 7 are for both host and guest
environments. Base pages are 4 KB and large pages are 2 MB.
1) Managing Two Sizes in One Memory maznager. In this After describing some details of our development experi-
case, one host page entry of 2 MB can be involved in ence to show the programmability, we first test Zweilous in
many other EPT faults of 4 KB whose fault address native, non-virtualized and virtualized platforms using
falls in the same 2 MB memory range. Hzmem mem- SPECjvm2008[31], SPECCPU2017[32] for overheads and
ory manager has to deal with these two kinds of STREAM [33] for throughputs. Then datasets of Yelp Data-
pages at the same time while Hzmem is inherently set Challenge [34] are used to evaluate effective memory
designed for managing pages with a size of 2 MB increased in Hzmem. Finally, a series of user applications
Authorized licensed use limited to: National University Fast. Downloaded on November 23,2022 at 10:54:26 UTC from IEEE Xplore. Restrictions apply.
1358 IEEE TRANSACTIONS ON COMPUTERS, VOL. 70, NO. 9, SEPTEMBER 2021
Authorized licensed use limited to: National University Fast. Downloaded on November 23,2022 at 10:54:26 UTC from IEEE Xplore. Restrictions apply.
LI ET AL.: ZWEILOUS: A DECOUPLED AND FLEXIBLE MEMORY MANAGEMENT FRAMEWORK 1359
TABLE 1
Throughput (MB/s) of Benchmark STREAM in Different szizes (millions) of Vectors
5.4 Virtualized Platform Section 4.2. In both cases, the QEMU in host creates and
We test our Zweilous in the virtualized platform using lib- maps the main memory of the virtual machine using 2 MB
virt 3.9.0 and QEMU 1.5.3. Both guest and host use Linux page size. In guest, we configure the benchmark SPECjvm
3.10. QEMU [36] is a generic and open source machine emu- using two page sizes to stress memory subsystems through
lator and virtualizer that can leverage hardware virtualiza- different page fault code path.
tion on x86-64 platform. It is a user-mode virtual machine The first case is that the benchmark uses 4 KB page size
launch and monitor. To use Hzmem to manage the memory inside the VM. In this case, EPT faults return execution to
used by VM, we launch and map the memory of QEMU via the VMM and the VMM handles the EPT faults in 4 KB size
the aforementioned API hugetlbfs. which further triggers normal page faults in 2 MB in the
host. If successive EPT faults occur within the same 2 MB
5.4.1 Start Time memory block, no host page faults are triggered and correct
offsets within the 2 MB memory block are set without trou-
We want to measure how long it takes to create and boot a vir- ble. Fig. 6a shows the results of this case. The best case is
tual machine using management from Zweilous, and how it 3.37 percent and the worst is -0.91 percent. Most results are
scales as the capacity of the memory of running VMs increases, within the range of plus/minus 1.0 percent. Zweilous
and how these compare to an unmodified virtual machine. achieve comparative performance with the unmodified one.
In our test, we increase the memory capacity from 2 GB to
32 GB at one step of 2 GB. The results are shown in Fig. 5. An
unmodified virtual machine with 2 GB memory capacity
starts about 1.5 seconds, scaling almost linearly to a maxi-
mum of 16 seconds with 32 GB memory capacity. On the
other hand, Zweilous curve in the Figure starts at 3.7 seconds
up to 14 seconds. Before 22 GB memory capacity, Zweilous
is inferior to the unmodified in start time. However, from
22 GB up to 32 GB, Zweilous shows a great advantage over
the unmodified one.
Zweilous contains only a prototype code base without
the polishness of many superior engineers like the unmodi-
fied one. However, it has a Smore advanced design to
reduce the redundancy of the vanilla large page manage-
ment. Therefore, it scales greatly compared to the unmodi-
fied one in the start time of virtual machines.
Authorized licensed use limited to: National University Fast. Downloaded on November 23,2022 at 10:54:26 UTC from IEEE Xplore. Restrictions apply.
1360 IEEE TRANSACTIONS ON COMPUTERS, VOL. 70, NO. 9, SEPTEMBER 2021
Authorized licensed use limited to: National University Fast. Downloaded on November 23,2022 at 10:54:26 UTC from IEEE Xplore. Restrictions apply.
LI ET AL.: ZWEILOUS: A DECOUPLED AND FLEXIBLE MEMORY MANAGEMENT FRAMEWORK 1361
Authorized licensed use limited to: National University Fast. Downloaded on November 23,2022 at 10:54:26 UTC from IEEE Xplore. Restrictions apply.
1362 IEEE TRANSACTIONS ON COMPUTERS, VOL. 70, NO. 9, SEPTEMBER 2021
[22] J. Corbet huge pages part 1 (Introduction), 2010. [Online]. Avail- Wenzhi Chen (Member, IEEE) received the PhD
able: https://lwn.net/Articles/374424/ degree from the College of Computer Science and
[23] N. Gupta, zRAM: Compressed RAM based block devices, Accessed: Engineering at Zhejiang University. He is a profes-
Jul. 2020. [Online]. Available: https://www.kernel.org/doc/ sor with the College of Computer Science and
Documentation/blockdev/zram.txt Technology, Zhejiang University, and the director of
[24] S. Jennings, zswap, Accessed: Jul. 2020. [Online]. Available: https:// Information Technology Center of Zhejiang Univer-
www.kernel.org/doc/Documentation/vm/zswap.txt sity, he used to be the vice dean of the College of
[25] J. Corbet, zcache: A compressed page cache, 2010. [Online]. Avail- Computer Science and Technology. His current
able: https://lwn.net/Articles/397574/ research interests include embedded system and
[26] J. Corbet, User-space page fault handling, 2015. [Online]. Avail- its application, computer architecture, and com-
able: https://lwn.net/Articles/636226/ puter system software and information security. He
[27] J. Corbet, Persistent memory for transient data, 2019. [Online]. is a member of ACM and ACM Education Council.
Available: https://lwn.net/Articles/777212/
[28] J. Corbet, Cramming more into struct page, 2013. [Online]. Avail-
able: https://lwn.net/Articles/565097/ Yang Xiang (Fellow, IEEE) received the PhD
[29] W. Mauerer, Professional Linux Kernel Architecture. Hoboken, NJ, degree in computer science from Deakin Univer-
USA: Wiley, 2010. sity, Australia. He is currently a full professor and
[30] P. Zaitsev and V. Tkachenko Evaluating Database Compression the dean of Digital Research & Innovation Capa-
Methods: Update, 2016. [Online]. Available: https://www. bility Platform, Swinburne University of Technol-
percona.com/blog/2016/04/13/evaluating-database- ogy, Australia. His research interests include
compression -methods-update/ cyber security, which covers network and system
[31] SPEC, SPECjvm2008, 2008. [Online]. Available: http://www.spec. security, data analytics, distributed systems, and
org/jvm2008/ networking. He is also leading the Blockchain ini-
[32] SPEC, SPECCPU2017, 2017. [Online]. Available: http://www. tiatives at Swinburne. In the past 20 years, he has
spec.org/cpu2017/ been working in the broad area of cyber security,
[33] McCalpin, STREAM benchmark, 2002, [Online]. Available: which covers network and system security, AI, data analytics, and net-
http://www.cs.virginia.edu/stream/ working. He has published more than 300 research papers in many inter-
[34] Yelp Dataset Challenge, Accessed: Jan. 2017. [Online]. Available: national journals and conferences. He is the editor-in-chief of the
https://www.yelp.com/dataset_challenge/ SpringerBriefs on Cyber Security Systems and Networks. He serves as
[35] libhugetlbfs, 2015. [Online]. Available: https://github.com/ the associate editor of IEEE Transactions on Dependable and Secure
libhugetlbfs/libhugetlbfs Computing and IEEE Internet of Things Journal, and the editor of Journal
[36] QEMU, the FAST! processor emulator, Accessed: Jul. 2020. [Online]. of Network and Computer Applications. He served as the associate editor
Available: http://www.qemu.org of IEEE Transactions on Computers and IEEE Transactions on Parallel
and Distributed Systems. He is the coordinator, Asia for IEEE Computer
Guoxi Li (Student Member, IEEE) received the Society Technical Committee on Distributed Processing (TCDP).
BS degree in computer science from Zhejiang
University, he is currently working toward the PhD
degree with Zhejiang University. His research " For more information on this or any other computing topic,
interests include operating system and system vir- please visit our Digital Library at www.computer.org/csdl.
tualization. He is a student member of the ACM.
Authorized licensed use limited to: National University Fast. Downloaded on November 23,2022 at 10:54:26 UTC from IEEE Xplore. Restrictions apply.