You are on page 1of 6

Memory management re-architecturing

in Godot 4.0

Introduction
It may not seem like it, but the core of Godot is quite old, and it was designed for when devices
had significant memory constraints. As an example, the current architecture of Godot itself was
designed for Nintendo DS and it was even used to publish PSP games. These devices had, at
much, a dozen megabytes available to the game.

This is why, at the core, a lot of code in Godot tries to prioritize memory usage and avoid
memory fragmentation, by using small allocations and pooling for large ones.

Nowadays, the general panorama of computing has changed significantly. Devices have several
gigabytes of memory available, so the small allocations are rather insignificant. Likewise, most
devices are 64 bits, with the few remainings 32 bits devices becoming unsupported soon:

● Windows is deprecating 32 bits this year (2019) by no longer supporting Windows 7.


● Ubuntu has deprecated 32 bits versions of the distribution (only leaving a compatibility
layer).
● MacOS no longer supports 32 bits.
● Google deprecated 32 bits applications this year (2019). Only low end devices remain
supporting 32 bits.
● iOS no longer supports 32 bits.
● Consoles are, nowadays, all 64 bits.
● Newer Raspberry Pi devices are 64 bits.

Notable exceptions:
● webasm is 32 bit (although games exported for the web are generally simpler
due to the storage and memory limitations, so changes would not be a problem
for them). Eventually a 64 bits version will come out.

Given this situation, we should assume that in a few years, 32 bits will be completely obsolete.
While Godot should still support this architecture for a while, it should definitely no longer be the
priority or focus from this time onwards.
The upcoming Godot 4.0 presents a great chance to redesign the memory system, simplify it
and improve it by assuming that the large majority of our user-base only cares about 64 bits.

Current Situation
Currently, Godot has two types of memory allocation:
● Regular malloc​ (via memalloc/memfree): Its a wrapper around malloc/free and
new/delete. Adds keeping track of memory usage as well as max memory usage.
● Vector<>​ template: Mostly used for storage and passing data around. Its reference
counted to avoid unnecessary copying. Its used for more or less small allocations as
large allocations can lead to fragmentation.
● Pool vectors​: Are designed for placement in a single contiguous pool which can be
compacted if needed. Locking is required for accessing this memory, to avoid the
memory being moved during the compacting process.

This memory design was very useful for devices with low amounts of memory, or even 32 bits
devices, which have a very limited address space. With 64 bits, using this design makes less
sense, as the virtual address space is enormous in comparison.

Use cases
Before proposing a new system, it seems wiser to make sure the use cases of memory usage in
Godot are well understood. Use cases are explained below.

● Godot is designed so most of its regular allocations are small, in fact smaller than a
memory page. This helps allocators avoid mixing large and small allocations, which is
one of the most common sources of memory fragmentation.
● When high performance ​is not required​ and memory access does not need to be linear,
yet grow and shrink, Godot tends to prefer ​not​ allocating large amount of contiguous
memory, preferring the use of ​lists​ or r​ ed-black trees​. Memory allocators in operating
systems are generally optimized for this use case by binning small allocations together in
the same pages.
● Due to the two above, Godot architecture warrants that large allocations will be ​always
arrays​.
● In roughly 99% of use cases, Godot uses arrays for:
○ Passing data around
○ Keeping them static in memory
Due to this, both Vector<> and PoolVector<> are reference counted (to avoid
unnecessary copies when passing) and utilize copy on write (to avoid the user breaking
the encapsulated data that was passed to some other class).
● In roughly 1% of cases, vectors are used to be grown to large sizes by push_back().
Godot generally tries to avoid this situation during real-time because it is slow and
inefficient memory wise.
● PoolVector<> is rarely resized, as data contained in them is most of the time lifted from
somewhere are sent somewhere else without modification.
● There are, however, use cases that are needed (not often, but still have proven to be
needed):
○ A std::vector behavior, without reference counting or copy on write (because it
affects performance) and with reserve() capabilities. This would again only be
used for relatively small allocations. This is not needed _that_ often but it may
happen. Godot generally handles it by just resizing a Vector to a size and then
using a pointer to it. A solution more similar to std::vector in these cases might be
more elegant.
○ A Vector behavior without reference counting or copy on write that can
push_back elements and grow to large sizes. This is currently not supported in
any way by Godot.

Proposal

1: Merge Vector<> and PoolVector<>

On 64 bits systems, I think it does not make much sense to separate Vector<> and
PoolVector<>. Technically, the following scheme is ideal on 64 bits systems:

1. For small allocations (less than a page) use malloc()


2. For large allocations, just allocate virtual memory (mmap)

I believe malloc() already does this on modern systems, (although it may not in consoles?).
Likewise, Vector<> could grow in powers of 2 until it reaches a page, and then just wrap the
allocation size to a page (4k). Again, it is most likely possible that malloc() does this already on
64 bits systems, so it might need more research.
One thing we discussed at some point, which might be an interesting benefit of doing this
behavior manually, is that if we assume large allocations will use virtual memory directly (mmap
or similar), it would be possible to pass a Vector<> that internally maps to a file or something
like this, thus speeding up loading in some cases and using less memory.. so again this needs a
bit more deliberation.

This behavior is ideal because:


● For small allocations, growing in powers of 2 is faster
● For large allocations, wrapping to a page size (4k) ensures no fragmentation

For 32 bits systems it should be about the same, but the small virtual address space will cause
fragmentation. I will assume games made for these systems will be rather simple in nature, so it
might not be a problem.

2: Create a page allocator

There are some cases where we need to grow arrays to large and uncertain sizes, I will detail
some of these use cases:

● When dealing with RIDs (opaque IDs), Godot 3.x is rather slow because they need to be
looked up in a hashmap. To avoid this, we can just use an array with O(1) access for the
unique IDs, this works as follows:
○ RID is 64 bits, the upper 32 bits contain an unique resource ID (always
incremented), and the lower 32 bits contain an index to the array.
○ This way, when allocating, we find a free slot in the array and return the index,
then we create a unique ID, store it in the array and return the RID by ORing
together unique ID and index.
○ To look it up, we just get the index from the RID, and then check that the unique
ID matches.
○ If the array is full when we allocate, we just grow it. We never shrink it, as we
consider this a worst-case allocation.
● When issuing drawing commands (lines, rects, polygons, etc) in a CanvasItem in Godot
3.x these commands are allocated continuously and put on a linked list, so if a lot of
drawing commands happen, then a lot of allocations also take place. Re-drawing the
canvas item clears the list and then allocation begins again, further hurting performance.
To improve this situation, the following logic needs to take place:
○ Drawing commands are put contiguous in a buffer associated with the
CanvasItem.
○ If out of buffer size, the buffer is made to grow.
○ Again, this buffer will never shrink because we consider it a worst-case
allocation.
● When culling cameras or shadowmaps, we never know how many objects we are going
to find. The items culling need to, however, be put in a contiguous buffer because they
need to be processed linearly to take maximum advantage of cache coherency.
Additionally, we want to be able to cull all shadow maps from multiple threads to improve
performance. To improve this situation we need to:
○ Have a buffer for regular culling that can grow as big as the game needs.
○ Have a buffer for shadowmap culling that can grow as big as needed, to contain
all objects culled.
○ Again, this is also considered a worst-case allocation, so this buffer will never
shrink.
● This situation repeats around Godot, as an example, we can make use of these types of
worst-case growing but never shrinking buffers for improving physics collision and
solving performance.

Just growing a large buffer is very inefficient memory and CPU wise, even using virtual memory
and malloc/realloc. As an example, If one of these buffers is 128mb and it needs to grow to
130mb, at some point the old 128 and 130mb allocations will co-exist in memory (suddenly
using 258mb), and then all content from one buffer needs to be copied to the new one, which is
slow and can stall.

The right solution for this problem is to divide buffers in ​pages​. Instead of having a large array,
we have a small array of pointers to pages. If the buffer needs to grow, then eventually a new
page is allocated and the page-pointer array resize is harmless because its very small in
comparison.

As a lot of these pages need to be allocated and freed, using malloc() is likely not a great idea,
because malloc() needs to store extra data to keep track of the allocation, so these arrays may
eventually become not very efficient.

To solve this situation, we could have page pools and allocate them on demand, again, on a
worst case scenario. A page pool could be something like 64 or 128mb and contain pages.
Allocating would just return a pointer, and freeing would be easy because we check the pointer
against our known start/end pointer locations of the pagepools.

These pools can be allocated with either malloc() or mmap(), but maybe mmap might be a
better idea because we know for certain that the memory is virtual. Again, this may not make
any difference and it needs to be researched.
3 Create a StaticVector class

This would behave similar to std::vector for cases the vector needs to be resized (to a not too
large size), reserved and quickly addressed, its basically a more primitive wrapper around
malloc/realloc.

We could either use std::vector (with a custom allocator that uses or own memalloc/memfree so
we can keep track of memory usage better), or write our own.

You might also like