Pitfalls of Object Oriented Programming GCAP 09 PDF

Sony Computer Entertainment Europe
Research & Development Division
Pitfalls of Object Oriented Programming
Tony Albrecht – Technical Consultant

Developer Services
What I will be covering
• A quick look at Object Oriented (OO)
programming
• A common example
• Optimisation of that example
• Summary
Slide 2
Object Oriented (OO) Programming
• What is OO programming?
– a programming paradigm that uses "objects" – data structures
consisting of datafields and methods together with their interactions – to
design applications and computer programs.
(Wikipedia)
• Includes features such as

– Data abstraction
– Encapsulation
– Polymorphism
– Inheritance
Slide 3
What’s OOP for?
• OO programming allows you to think about
problems in terms of objects and their
interactions.
• Each object is (ideally) self contained
– Contains its own code and data.
– Defines an interface to its code and data.
• Each object can be perceived as a „black box‟.
Slide 4
Objects
• If objects are self contained then they can be
– Reused.
– Maintained without side effects.
– Used without understanding internal
implementation/representation.
• This is good, yes?
Slide 5
Are Objects Good?
• Well, yes
• And no.
• First some history.
Slide 6
A Brief History of C++
C++ development started
1979 2009
Slide 7
Named “C++”
1979 1983 2009
Slide 8
First Commercial release
1979 1985 2009
Slide 9
Release of v2.0
1979 1989 2009
Slide 10
Added
• multiple inheritance,
• abstract classes,
Release of v2.0
• static member functions,
• const member functions
• protected members.
1979 1989 2009
Slide 11
Standardised
1979 1998 2009
Slide 12
Updated
1979 2003 2009
Slide 13
C++0x
1979 2009 ?
Slide 14
So what has changed since 1979?
• Many more features have
been added to C++
• CPUs have become much
faster.
• Transition to multiple cores
• Memory has become faster.
http://www.vintagecomputing.com
Slide 15
CPU performance
Slide 16 Computer architecture: a quantitative approach

By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau
CPU/Memory performance
Slide 17 Computer architecture: a quantitative approach

By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau
What has changed since 1979?
• One of the biggest changes is that memory
access speeds are far slower (relatively)
– 1980: RAM latency ~ 1 cycle
– 2009: RAM latency ~ 400+ cycles
• What can you do in 400 cycles?
Slide 18
What has this to do with OO?
• OO classes encapsulate code and data.
• So, an instantiated object will generally
contain all data associated with it.
Slide 19
My Claim
• With modern HW (particularly consoles),
excessive encapsulation is BAD.
• Data flow should be fundamental to your

design (Data Oriented Design)
Slide 20
Consider a simple OO Scene Tree
• Base Object class

– Contains general data
• Node
– Container class
• Modifier
– Updates transforms
• Drawable/Cube
– Renders objects
Slide 21
Object
• Each object
– Maintains bounding sphere for culling
– Has transform (local and world)
– Dirty flag (optimisation)
– Pointer to Parent
Slide 22
Objects
Each square is
Class Definition 4 bytes Memory Layout
Slide 23
Nodes
• Each Node is an object, plus
– Has a container of other objects
– Has a visibility flag.
Slide 24
Nodes
Class Definition Memory Layout
Slide 25
Consider the following code…
• Update the world transform and world
space bounding sphere for each object.
Slide 26
• Leaf nodes (objects) return transformed
bounding spheres
Slide 27
• Leaf nodes (objects)
What‟s return transformed
wrong with this
bounding code?
spheres
Slide 28
• Leaf nodes (objects)
If m_Dirty=false thenreturn transformed
we get branch
bounding misprediction
spheres
cycles.
which costs 23 or 24
Slide 29
bounding Calculation
spheres of the world bounding sphere
takes only 12 cycles.
Slide 30
bounding So
spheres
using a dirty flag here is actually
slower than not using one (in the case
where it is false)
Slide 31
Lets illustrate cache usage
Main Memory Each cache line is
128 bytes L2 Cache
Slide 32
Cache usage
Main Memory L2 Cache
parentTransform is already
in the cache (somewhere)
Slide 33
Cache usage
Assume this is a 128byte
Main Memory boundary (start of cacheline) L2 Cache
Slide 34
Cache usage
Load m_Transform into cache
Slide 35
Cache usage
m_WorldTransform is stored via

cache (write-back)
Slide 36
Cache usage
Next it loads m_Objects
Slide 37
Cache usage
Then a pointer is pulled from

somewhere else (Memory
managed by std::vector)
Slide 38
Cache usage
vtbl ptr loaded into Cache
Slide 39
Cache usage
Look up virtual function
Slide 40
Cache usage
Then branch to that code

(load in instructions)
Slide 41
Cache usage
New code checks dirty flag then

sets world bounding sphere
Slide 42
Cache usage
Node‟s World Bounding Sphere

is then Expanded
Slide 43
Cache usage
Then the next Object is

processed
Slide 44
Cache usage
First object costs at least 7

cache misses
Slide 45
Cache usage
Subsequent objects cost at least

2 cache misses each
Slide 46
The Test
• 11,111 nodes/objects in a
tree 5 levels deep
• Every node being
transformed
• Hierarchical culling of tree
• Render method is empty
Slide 47
Performance
This is the time

taken just to
traverse the tree!
Slide 48
Why is it so slow?
~22ms
Slide 49
Look at GetWorldBoundingSphere()
Slide 50
Samples can be a little
misleading at the source
code level
Slide 51
if(!m_Dirty) comparison
Slide 52
Stalls due to the load 2
instructions earlier
Slide 53
Similarly with the matrix
multiply
Slide 54
Some rough calculations
Branch Mispredictions: 50,421 @ 23 cycles each ~= 0.36ms
Slide 55
Some rough calculations
Branch Mispredictions: 50,421 @ 23 cycles each ~= 0.36ms
L2 Cache misses: 36,345 @ 400 cycles each ~= 4.54ms
Slide 56
• From Tuner, ~ 3 L2 cache misses per
object
– These cache misses are mostly sequential
(more than 1 fetch from main memory can
happen at once)
– Code/cache miss/code/cache miss/code…
Slide 57
Slow memory is the problem here
• How can we fix it?

• And still keep the same functionality and
interface?
Slide 58
The first step
• Use homogenous, sequential sets of data
Slide 59
Homogeneous Sequential Data
Slide 60
Generating Contiguous Data
• Use custom allocators
– Minimal impact on existing code
• Allocate contiguous
– Nodes
– Matrices
– Bounding spheres
Slide 61
Performance
19.6ms -> 12.9ms
35% faster just by

moving things around in
memory!
Slide 62
What next?
• Process data in order
• Use implicit structure for hierarchy
– Minimise to and fro from nodes.
• Group logic to optimally use what is already in
cache.
• Remove regularly called virtuals.
Slide 63
We start with
Hierarchy
a parent Node
Node
Slide 64
Hierarchy
Node
Node Which has children Node

nodes
Slide 65
Hierarchy
Node
Node Node
And they have a
parent
Slide 66
Hierarchy
Node
Node Node
Node Node Node Node Node Node Node Node
And they have

children
Slide 67
Hierarchy
Node
Node Node
And they all have

parents
Slide 68
Hierarchy
Node
Node Node
Node Node
Node Node Node
Node Node Node
Node NodeNode
Node Node Node
A lot of this
information can be
inferred
Slide 69
Hierarchy
Node
Use a set of arrays,
one per hierarchy
Node Node level
Slide 70
Hierarchy
Node Parent has 2
:2 children
Node Node Children have 4

:4 children
:4
Slide 71
Hierarchy
Node Ensure nodes and their
data are contiguous in
:2
memory
Node Node
:4
:4
Slide 72
• Make the processing global rather than
local
– Pull the updates out of the objects.
• No more virtuals
– Easier to understand too – all code in one
place.
Slide 73
Need to change some things…
• OO version
– Update transform top down and expand WBS
bottom up
Slide 74
Update
transform Node
Node Node
Slide 75
Update
transform Node
Node Node
Slide 76
Node
Node Node
Update transform and

world bounding sphere
Slide 77
Node
Node Node
Add bounding sphere

of child
Slide 78
Node
Node Node

Slide 79
Node
Node Node
Add bounding sphere

of child
Slide 80
Node
Node Node

Slide 81
Node
Node Node
Add bounding sphere

of child
Slide 82
• Hierarchical bounding spheres pass info up
• Transforms cascade down
• Data use and code is „striped‟.
– Processing is alternating
Slide 83
Conversion to linear
• To do this with a „flat‟ hierarchy, break it
into 2 passes
– Update the transforms and bounding
spheres(from top down)
– Expand bounding spheres (bottom up)
Slide 84
Transform and BS updates
Node
For each node at each level (top down)
:2 {
multiply world transform by parent‟s
Node Node transform wbs by world transform
:4 }
:4
Slide 85
Update bounding sphere hierarchies
Node
For each node at each level (bottom up)
:2 {
add wbs to parent‟s
Node Node } cull wbs against frustum
:4 }
:4
Slide 86
Update Transform and Bounding Sphere
How many children nodes
to process
Slide 87
For each child, update

transform and bounding
sphere
Slide 88
Note the contiguous arrays
Slide 89
So, what’s happening in the cache?
Parent
Unified L2 Cache
Children‟s node
Children
data not needed
Parent Data
Childrens‟ Data
Slide 90
Load parent and its transform
Parent
Unified L2 Cache
Parent Data
Childrens‟ Data
Slide 91
Load child transform and set world transform
Parent
Unified L2 Cache
Parent Data
Childrens‟ Data
Slide 92
Load child BS and set WBS
Parent
Unified L2 Cache
Parent Data
Childrens‟ Data
Slide 93
Parent
Next child is calculated with
no extra cache misses ! Unified L2 Cache
Parent Data
Childrens‟ Data
Slide 94
Parent
The next 2 children incur 2
cache misses in total Unified L2 Cache
Parent Data
Childrens‟ Data
Slide 95
Prefetching
ParentBecause all data is linear, we
Unified L2 Cache
can predict what memory will
be needed in ~400 cycles
and prefetch
Parent Data
Childrens‟ Data
Slide 96
• Tuner scans show about 1.7 cache misses
per node.
• But, these misses are much more frequent
– Code/cache miss/cache miss/code
– Less stalling
Slide 97
Performance
19.6 -> 12.9 -> 4.8ms
Slide 98
Prefetching
• Data accesses are now predictable
• Can use prefetch (dcbt) to warm the cache
– Data streams can be tricky
– Many reasons for stream termination
– Easier to just use dcbt blindly
• (look ahead x number of iterations)
Slide 99
Prefetching example
• Prefetch a predetermined number of
iterations ahead
• Ignore incorrect prefetches
Slide 100
Performance
19.6 -> 12.9 -> 4.8 -> 3.3ms
Slide 101
A Warning on Prefetching
• This example makes very heavy use of the
cache
• This can affect other threads‟ use of the
cache
– Multiple threads with heavy cache use may
thrash the cache
Slide 102
The old scan
~22ms
Slide 103
The new scan
~16.6ms
Slide 104
Up close
~16.6ms
Slide 105
Looking at the code (samples)
Slide 106
Performance counters
Branch mispredictions: 2,867 (cf. 47,000)
L2 cache misses: 16,064 (cf 36,000)
Slide 107
In Summary
• Just reorganising data locations was a win
• Data + code reorganisation= dramatic
improvement.
• + prefetching equals even more WIN.
Slide 108
OO is not necessarily EVIL
• Be careful not to design yourself into a corner
• Consider data in your design
– Can you decouple data from objects?
– …code from objects?
• Be aware of what the compiler and HW are
doing
Slide 109
Its all about the memory
• Optimise for data first, then code.
– Memory access is probably going to be your
biggest bottleneck
• Simplify systems
– KISS
– Easier to optimise, easier to parallelise
Slide 110
Homogeneity
• Keep code and data homogenous
– Avoid introducing variations
– Don‟t test for exceptions – sort by them.
• Not everything needs to be an object
– If you must have a pattern, then consider
using Managers
Slide 111
Remember
• You are writing a GAME
– You have control over the input data
– Don‟t be afraid to preformat it – drastically if
need be.
• Design for specifics, not generics
(generally).
Slide 112
Data Oriented Design Delivers
• Better performance
• Better realisation of code optimisations
• Often simpler code
• More parallelisable code
Slide 113
The END
• Questions?
Slide 114

Pitfalls of Object Oriented Programming GCAP 09 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pitfalls of Object Oriented Programming GCAP 09 PDF

Uploaded by

Copyright:

Available Formats

Sony Computer Entertainment Europe

Research & Development Division

Pitfalls of Object Oriented Programming

Tony Albrecht – Technical Consultant

• Includes features such as

• This is good, yes?

• First some history.

C++ development started

1979 1983 2009

First Commercial release

1979 1985 2009

1979 1989 2009

1979 1989 2009

1979 1998 2009

1979 2003 2009

Slide 16 Computer architecture: a quantitative approach

Slide 17 Computer architecture: a quantitative approach

• What can you do in 400 cycles?

• Data flow should be fundamental to your

• Base Object class

Load m_Transform into cache

m_WorldTransform is stored via

Next it loads m_Objects

Then a pointer is pulled from

vtbl ptr loaded into Cache

Look up virtual function

Then branch to that code

New code checks dirty flag then

Node‟s World Bounding Sphere

Then the next Object is

First object costs at least 7

Subsequent objects cost at least

This is the time

• How can we fix it?

35% faster just by

Node Which has children Node

Node Node Node Node Node Node Node Node

And they have

Node Node Node Node Node Node Node Node

And they all have

Node Node Node Node Node Node Node Node

Node Node Children have 4

Node Node Node Node Node Node Node Node

Node Node Node Node Node Node Node Node

Node Node Node Node Node Node Node Node

Node Node Node Node Node Node Node Node

Node Node Node Node Node Node Node Node

Update transform and

Node Node Node Node Node Node Node Node

Add bounding sphere

Node Node Node Node Node Node Node Node

Update transform and

Node Node Node Node Node Node Node Node

Add bounding sphere

Node Node Node Node Node Node Node Node

Update transform and

Node Node Node Node Node Node Node Node

Add bounding sphere

Node Node Node Node Node Node Node Node

Node Node Node Node Node Node Node Node

For each child, update

Note the contiguous arrays

19.6 -> 12.9 -> 4.8ms