You are on page 1of 37

Hitting 60Hz with the Unreal

Engine: Inside the Tech of

Mortal Kombat vs DC Universe
Jon Greenberg – MK Team
Nathan Mefford – Chicago ATG
Why Bother?
• In general, “twitch” games require very
high framerate.
• Fast input response demands fast
feedback to player
• Running at 60Hz a basic requirement of
fighting genre.
Why Is 60 So Rare?
• Very few games target 60Hz (< 10% of games)
• Only 16.7 ms in which to do everything vs 33.3
ms at 30Hz. Implies half the time to do
everything… this is not correct.
• In general, this means you have ~1/3 the time,
due to fixed cost overhead which can’t be
• Customer doesn’t “care” that you have less time
to do everything – still wants game to look great.
• Game must hit 60Hz on both PS3 and Xbox 360,
and both versions look as close as possible!
Why 1/3 The Time?

• Game must run at >= 60Hz – not allowed to

drop frames (bog).
• This means we have to set aside headroom that
can absorb instantaneous spikes.
• MK vs DC steady state ~= 9.5 ms per frame.
• Allows for lot of particle effects and variability.
Other genres (even other fighting games) likely
need a great deal less slack.
• Philosophy: Always address worst-case
scenarios up front.
The Problem (part 1)
• Midway had decided to use UnrealEngine 3
(UE3) as basic middleware across all internal
games. Using UE3 was required by mgmt.
• UE3 was (is) designed for 30Hz FPS/3rd person
action genre titles.
• We started with the October 2006 (post Gears of
War 1) codebase. Some additional features
taken from Epic à-la-carte. Ex: MITV, file
caching, misc fixes.
• About 22 months to develop the game.
The Problem (part 2)
• UE3 brings a lot to the table (nice tools, wide
feature set) but imposes a lot of heavy fixed
• There are also some choices made in the
engine that have problematic side effects for
60Hz play (UObject overhead, Garbage
Collection, etc).
• Out-of-the-box fixed cost baseline (especially
GPU) too high for a 60Hz title.
• Eg., Oct06 build GPU baseline ~ 9ms.
Breaking it Down
• GPU Overhead
• GPU Fixed costs
• General rendering overhead
• Multipass overhead
• Lighting cost
• Particle cost
• CPU Overhead
• Particle cost
• Cloth & Water
• Render thread virtual overhead/state caching
GPU Fixed Costs
• Post-processing
• Usually the biggest fixed cost.
• Combine as many operations together as possible to
hide work (ie, Bloom+DOF+Gamma+Resolution
• Cut as many corners as possible and special case as
necessary – eg. we use 1 of 3 different DOF methods
depending on the case:
• Normal gameplay: classic blur cross-fade
• Main Menu/Cinematics: dialating Poisson disc
• Klose-Kombat: a series of blur planes.
• “Normal” DOF+Bloom effect cost = 1.8 ms
• Bloom is done a little strangely to compensate
for linear color range and not having a separate
• Per environment thresholding value determines which
pixels bloom.
• Thresholding is done inside downsample pass and
written out into the alpha channel as 0 or 1.
• This bloom mask is then blurred along with color.
• We had separate thresholding and strength
values for characters and the general
background to allow the two to be tuned
• Character masks were written/read from stencil
• Normal UE3 distortion effect has 3ms overhead!
• Instead, fold Distortion into Translucency.
• Sample from a snapshot of opaque pass, and do
a depth-based selection to prevent near-
• Overhead now just “capturing snapshot” - just a
copy blit of color buffer ~ 0.4ms.
• Now usable everywhere!
• Optionally support recapture of the “snapshot”
per distorting effect to allow for layered distortion
effects as well. Needed for water level.
Motion Blur
• Very expensive to do full-screen.
• Epic doesn’t support motion blurring of skinned
• Instead, motion blur effects done via rendering
velocity-stretched fading geometry.
• Required changing GPU skinning (PC/360) and
Edge (PS3-SPU) to support skinning against
previous bone positions.
• Requires localized blur-only Z-prepass to
prevent additive blur effects from blending badly.
Shadows and MSAA
• Game made use of MSAA-2x on both platforms
• Resolving MSAA is very expensive on PS3.
• Combine full-screen modulated shadow blit with
MSAA color/depth resolve!
• Hide heavy texture bandwidth operations inside
math heavy shadow work. Shadow ALU
overhead high enough that we can also hide the
Distortion copy blit!
• No self-shadowing – disabled via stencil mask.
• Once there’s no self-shadowing anyway, we use
proxy shadow characters.
• Total cost ~= 1.33ms
• Fullscreen per-pixel ~2 ms on the GPU.
• Visible vertices < visible pixels!
• Per-pixel fog is often overkill. Replaced with
per-vertex fog and per-object fog (characters).
• To keep per-vertex costs low, only support 2
active fog actors.
• Heightfog is optional, and controlled via static
• Also added optional undulating height fog, via
pulsing sine-waves through the fog height.
• Dramatically cheaper!
General Rendering
• 8 bpc render targets, linear color scale of 0..2.
• We light in a combination of γ=1.0 and γ=2.2,
depending on what we’re lighting, to save cost.
• Opaque: uses MSAA
• Translucent: post-MSAA resolve
• Heavy use of Playstation Edge library for
skinned and world geometry on PS3.
• 3D resolution of the game was 1040x624 which
was then scaled up to allow the HUD to render
at 1280x720.
Multipass Overhead
• Pass-per-light overhead is simply too high.
• We’re mostly prelit, so we chose forward
• Z-Prepass? Typical depth complexity <
• Loosely sort opaque objects front to back
via “rings of detail”. Removing Z-prepass
saves ~0.75 ms.
• Touch each pixel only once if possible.
World Lighting (static)
• World is prelit using Illuminate Labs’
Beast, with some “dynamic” RNMs built
with Turtle. Dynamic RNMs are
animated in materials or via MITVs.
• Prelit lighting was a mix of texture and
vertex RNM lighting, with a fast-path
added to support per-vertex diffuse only
RNM evaluation for distant objects.
World Lighting (dynamic)
• Effect point lighting is done via a mix of per-
pixel lighting (floors) and per-vertex (the rest
of the environment).
• To account for maximum load, shaders are
built with three diffuse-only point lights
active and burned into the material
• No branching! All three lights always
• These lights are globally assigned and
managed in 3-deep FIFO.
Character Lighting (part 1)
Custom lighting model:
• Irradiance volume of SH coefficient sets.
• Eval gradients to determine an SH-set per
• Diffuse light the model using only the first 4
coefficients (“ambient” and “directional” term).
• The 3 effect point lights are evaluated per-vertex
and combined into the final diffuse lighting
• Spec faked via power-scaling of (E•N) and
multiplying by diffuse lighting.
Character Lighting (part 2)
• Skin transmission faked by using (E•N) as lerp
factor between diffuse lighting and SH ambient
• Rim Lighting: power-scaling (1-E•N) for falloff
and then mul by hard thresholding (1-E•N).
• If threshold is raised high enough (~0.7), ends
up looking like chrome mapping!.
• Final rendering cost ~= 0.8ms per character
• Character mesh-chucks batch rendered.
Skin and Metal
The Story So Far…
• So far costs are:
• Misc ~0.5 ms
• Shadowmaps: 0.5 ms
• Characters: 1.6 ms
• Environment: ~4.X ms
• MSAA Resolve/Shadow: 1.3 ms
• PostFX: 1.8 ms
• Total ~9.X ms
• What about particle effects?
Particle Effects
• Very large problem. Cascade not very
• Solution – port Cascade runtime async on
separate worked threads (to SPU on
• All emitters for a particle system updated
in single block of async work (particles,
emitter state, system state).
• All particle Modules ported to SPU, except
for collision (due to data complexity).
Particle Effects (CPU load)
• All per-particle overhead removed from
Game/Render thread!
• Particle overhead now a simple linear
relationship between system count and
emitter count.
• On PC/360, vertex data for sprites created
JIT by async worker thread.
• No changes/compromises to artist tools or
Particle Effects (SPU load)
• SPUs extremely fast.
• Just used basic C++ code (including
templates and polymorphism). No need to
bother with intrinsics or ASM.
• Same module code runs on PS3/360.
• Complex (dependant) DMAs done
synchronously. Simpler to deal with and
fast enough that it doesn’t matter.
• Update done via SPURS job
Particle Effects (GPU load)
• GPU overhead less straightforward
• Attempt 1: Lie to hardware and tell it
we’re in MSAA-4x on non-MSAA target.
Looks okay on wispy stuff in general
(smoke, fire, etc.), but looks terrible on
Particle Effects (GPU cont…)
• Attempt 2: for somewhat opaque particles, break
effect out into masked pass and unmasked
pass, sorting particles for a system front to back
before rendering to prime Z.
1. Render particles with alpha-test set to =1.0, front to back
2. Render particles with alpha-test set to <1.0, back to front

• Didn’t help! Alpha-test disables ZCull writes,

negating the benefits of the priming pass.
Particle Effects (GPU cont…)
• Attempt 3: Observation – for flipbook
effects, lots of time is wasted
rendering alpha-0 space around
meaningful content.
• Idea: For flipbook effects, reduce
particle dimensions (and UVs) to
bound content of the particular
flipbook page!
• Works great! Dramatic fillrate
improvement from doing this (>50%).
Requires artist to identify channel to
scan for image bounds.
General Render Thread
• Lots of work to reduce unnecessary operations.
• Render thread virtuals = death by a thousand
paper cuts.
• Cache as much state as possible to reduce
redundant virtual calls. Eg, replaced
FMaterialRenderProxy’s GetMaterial virtual call
with a caching call.
• Remove tons of unneeded repeated calls to
GetXXX() (ie, GetPixelShader) states from
inside Shader processing.
Misc Further optimizations
• Cloth simulation moved to run async in another
thread (SPU on PS3).
• Epic’s water simulation code ported to run on
SPU on PS3.
• Animation still synchronous Game-thread based,
but doesn’t use AnimTrees. Very limited blend
options for designers.
• No Occlusion pass – Vis is simple frustum
• Lots of work to reduce amount of memory
allocation via pools and isolated heaps. Still,
accounts for 25% of CPU time.
Garbage Collection
• Based on work by Stranglehold team
• Not quite as aggressive as they were, but
removes all live calling of GC from gameplay –
only called when exiting modes.
• Memory management switched to deferred (by a
frame) cleanup of UObjects/AActors.
• All “loaded” data trapped via Rootset
• Introduces UResource class, a reference
counting UObject.
• All USurface derived classes (ie, UMaterial,
UTexture, etc) are all reference counted via
UResource to prevent unwanted deletion.
Additional Game Details
• We don’t use UnrealScript. Minimally use
Kismet. Use our own scripting engine (C/C++-
ish) for AI, object management, menu logic, etc.
• Game scripts are expected to manage resource
• Main advantage – dynamically reloadable for
fast iteration!
• MKScripts describe resource usage to determine
cooked resources that need to be added to
Artist Limitations
• UE3 gives artists a lot of rope to hang themselves
• Big thing was to limit who could use the Material
• All character art uses same small set of materials.
• Characters budgeted at 20k polys visible at a time.
• Backgrounds budgeted based on visible object
count and storage limitations more than polycount.
• Environment material/lighting complexity managed
by the background lead to ensure overall
performance hit GPU performance targets, with
various metrics helping to tell them where they were.
General Recommendations for
hitting 60Hz in UE3
• Budget performance up front!
• Given Edge and 360’s unified shaders, geometry
less of a problem than fillrate.
• Predetermine valid PostFx and hardwire the
majority of permutations.
• Reduce dynamic critical sectioned memory
allocation as much as possible. Massively stalls
all performance.
• Use pool allocators whenever possible, and
watch for realloc’s.
• Force designers and artists to run with
performance metrics on!
Recommendations for hitting 60Hz
in UE3 on PS3 (well, and 360)
• Consider what can be deferred and/or can be
made to run async and consider moving that
• Consider using Edge on PS3.
• Even sync’d work can be done way faster on
SPU if divided over multiple SPUs/threads!
• Don’t be intimidated by the SPUs on PS3.
Prototype SPU code on 360/PC where its
easier to debug.
• Template heavy C++ might not be ideal
performance case for SPUs, but certainly a
LOT better than not using them at all.
Things We Have Yet to Address
• Serialization – as we tend to only stream
content underneath movie playback or
load screens, the CPU impact wasn’t too
problematic for us, though it does impact
load times.
• Animation – need to explore making it run
on worker threads/SPU for deferrable
(background and LOD’d) objects.
• Thanks for listening!