2, United States Patent
oy
oy
my
3)
oy
(58)
Bolz et al.
OPTIMIZING TRIAN
PATH RENDERING
LE TOPOLOGY FOR
Applicant: NVIDIA CORPORATION, Saat
Clara, CA (US)
Inventors: Jeffrey A. Bolz, Austin, TX (US):
Mark J. Kilgaed, Austin, TX (US)
Assignee: NVIDIA Corporation, Sania Clara, CA
ws)
Notice: Subject to any disclaimer, the teem ofthis
patent is extended or adjusted under 35
USC. 18446) by 195 days
Appl. Now 197717458
Filed: Dee. 17, 2012
Prior Publication Data
US 201410168222 Al Jun, 19, 2014
Im. C1,
Goor 11/40
(2006.01)
(200801)
GO6T 1/40 (2013.01); GO6T 117208
(201301)
Field of Classification Search
CPC es GOST 11/208; GOST 11/20; GO6T 17/20;
‘Gost 17/205,
spc 345/423, 441, 42, 17
Se application file for complete search history.
US0095S8573B2
(10) Patent No.:
(4s) Date of Patent:
US 9,558,573 B2
Jan, 31, 2017
60) References Cited
US. PATENT DOCUMENTS.
064771 A $2000 Migdal eta
Sa08SS3 BL* 82008 Tokovig ea sasiaat
Sa7sa82 Bo* 72014 Brown usa
2004 000857 AL* $004 Cenk oa 345.420
So1sno0rat ALS 112013. Schmdt Gacr9'29
FOREIGN PATENT DOCUMENTS
(OTHER PUBLICATIONS
Kolingerova: Simulated Annealing and Genetic Algorithms in
Quest of Optimal Tianglations; Generalized Voronoi Diagram
SSCL ISS, pp 247-266, Springer Verlag Beri Heiberg 2009."
‘Sewhol Updating and Constnsting Consesned Delaunay and
(Constiined Replat Tuangulations by lips; SoC U3, ACM
10, 2003
* cited by examiner
Primary Esaminer — Carlos Perromat
(74) Atorney, Agen, or Firm — Aregis Law Group, LLP
on ABSTRACT
A technique for efficiently rendering path images tessellates
path contours into triangle tas comprising a set of repre-
sentative triangles. Topology of the set of representative
triangles is then optimized for greater rasterization efiiency
by applying & Nip operator to selected triangle pairs within
the sot of representative triangles. The optimized triangle
pairs are then rendered using a path rendering technique
Such as stenil and cover.
17 Claims, 9 Drawing SI
r™ J
>
Flip Operation
600U.S. Patent
Jan. 31, 2017
Sheet 1 of 9
US 9,558,573 B2
Computer
System
System Memory 100
104 a7
Device Driver
103 Communication Path
T 113,
cpu Memory Parallel Processing
‘102 |) Bridge Subsystem
108 112
Communication | Display
Path Device
106 t—110
Input Devices
2 <
System -
Disk VO Bridge
414 107
‘Add-in Card Switch ‘Add-in Card
120 116 24
Network
‘Adapter
18
Figure 4U.S. Patent Jan. 31, 2017
Sheet 2 of 9 US 9,558,573 B2
Parallel Processing
Memory Bridge | Communication Subsystem
Path — 142
113, ¥
PPU 202(
10 202(0)
Unit |—» Host Interface 206
205
Front End 212
Task/Work Unit 207
Processing Cluster Array 230
GPC GPC
208(0) 208(1)
GPC
Crossbar Unit 210
| Memory|interface 214,
Partition | | Partition Partition
Unit Unit |---] Unit
=
+
DRAM || DRAM
_ | DRAM
22010) | | 220(1) 220(D-1)
PP Memory 204(0)
PPU PP Memory
20211) 204(4
PPU PP Memory
L__, joy
202(U-1) 204(U-1)U.S. Patent Jan, 31, 2017 Sheet 3 of 9 US 9,558,573 B2
From Pipeline Manager 305
in GPC 208
ee
‘SM 310
Instruction L1 Cache 370 ho
Warp Scheduler and Instruction Unit 312
Local Register File 304
Exec Exec
Unit . Uni
30200) 302(N-1)
Unified
usu |] usu |, Address
soa) [) sos Mapping Unit
352
Memory and Cache Interconnect 380
Shared Memory 306
L1 Cache 320 kK
To/From mmu From
Memory Interface 214. <——») 328 k— L1.5 Cache 335
Via Crossbar Unit 210 in GPC 208
Figure 3Patent Jan. 31,2017 Sheet 4 of 9 US 9,558,573 B2
CONCEPTUAL
DIAGRAM
Instruction Stream
and Parameters
Graphics |
Processing Data Assembler
Pipeline 410
400
,| Vertex Processing Unit
415
Primitive Assembler
420
Geometry Processing Unit
425
Viewport Scale, Cull,
and Clip Unit
450
Rasterizer
455
Memory t Fragment Processing Unit
Interface 460
214
sf Raster Operations Unit
465
Figure 4U.S. Patent Jan, 31, 2017 Sheet 5 of 9 US 9,558,573 B2
c c 56
B s D B
D
510 512 514
—
Flip Operation
500
A A
Figure 5A
520(0) 520(1) 520(0) 520(1)
0 | (st ye 8
5 2 5 Z
3 1 1 i
3 3 1 1
514 |
1 3 1 1
1 4 1 2
° 4 ° 2
(Total: 30) 52012) (Total: 21) 520(2)
Figure 5B Figure 5CU.S. Patent Jan. 31,2017 Sheet 6 of 9 US 9,558,573 B2
F ca J
—
Flip Operation
600
Figure 6U.S. Patent Jan. 31,2017 Sheet 7 of 9 US 9,558,573 B2
LM Pog LM Pg LM Pog
——w ——w
Flip Operation Flip Operation
710 720
K. K. K
Figure 7U.S. Patent Jan, 31, 2017 Sheet 8 of 9 US 9,558,573 B2
»——— 800
Receive Path Image
810
¥
Tessellate Path Image into Set of
Representative Triangles
¥
Optimize Topology for Set of
Representative Triangles
830
¥
Save Optimized Set of Triangles
¥
Render Optimized Set of Triangles
850
Done
890
Figure 8U.S. Patent Jan, 31, 2017 Sheet 9 of 9 US 9,558,573 B2
»—— 900
Receive Set of Representative Triangles
Term
Metric Satisfied?
920
Yes
No
¥
Calculate Flip Metric for Triangle Pairs
922
y
Select Triangle Pair to Flip Based on Flip Metric
924
¥
Perform Flip on Selected Triangle Pair
Done
990
Figure 9US 9,558,573 B2
1
OPTIMIZING TRIANGLE TOPOLOGY FOR
PATH RENDERING
BACKGROUND OF THE INVENTION
Field ofthe Invention
‘The present invention generally relates to path rendering
and, more specially to optimizing triangle topology for
path rendering.
‘Description of the Related Art
ath rendering represents one syle of resolution-indepen-
‘dent two-dimensional (2D) rendering that forms basis for
‘a numberof important graphics rendering standards knows,
in the art as PostScript, Java 2D, Apple's Quartz 2D, PDF,
TrueType fonts, OpenType fonts, PostScript fonts, scalable
vestor graphics (SVG), OpenVG, Microsoft's Si
Adobe Flash, Microsoft's XML Paper Speci
and more.
‘One class of teehnigues for performing path rendering
includes atleast a tessellation step and a path coverage step.
Path elements are essellated into representative triangles ia
the tesellation step. The puth coverage step draws many
tesselated triangles, and samples covered by these triangles
are counted in a stencil or color buffer, which is used t0
‘determine whether each sample is inside or outside an
sssociated path. Frontefacing triangles increment covered
sample counts and back-facing triangles decrement covered
sample counts. Samples counted a inside a path are ren-
dered according 1o an associated path fill color, while
samples counted as outside a path are not rendered t0 the
path fill color,
‘Many common tessellation technigues generate triangle
fins and meshes having very narrow, sliver Tike triangles
shih typically render with relatively poor elfiiency. As 3
‘consequence, overall path rendering efficiency and perfor-
mance may be relatively poor, which can diminish the
‘quality of user experience.
‘As the foregoing illustrates, what is need in the artis a
technique for improved path rendering efficiency.
SUMMARY OF THE INVENTION
(One embodiment of the present invention sets forth a
method for processing a path image for efficient rasteriza-
tion, the method comprising tessellating one oF mare com-
‘ours defining the path image into a first set of triangles
‘wherein cach triangle ofthe first set of triangles includes &
winding order, penerating a second set of triangles that are
‘optimized to reduce rasterization cost based on topology and
Winding order of wiangles within de fist set of triangles,
rd saving the second set of triangles
Other embodiments of the present invention include,
‘without limitation, a computerreadable slomge medium
including instructions that, When executed by a processing
unit, cause the processing unit to perform the techniques
‘described herein as well asa computing deviee that includes
f processing unit configured to perform the techniques
‘described herein,
‘One advantage of the disclosed technique is that it
improves rendering efiiency of path images rendered by a
raphies processing u
BRIEF DESCRIPTION OF THE DRAWINGS
‘So thatthe manner in which the ahove recited features of
the present invention can be understood in detail, a more
particular description of the invention, briefly summarized
0
o
2
above, may be had by reference to embodiments, some of
‘whieh are illustrated in the appended drawings. Its 10 be
noted, however, thatthe appended drawings ilustrate only
‘typical embodiments ofthis invention and are therefore not
to be considered limiting ofits scope, forthe invention may
admit to other equally effective embodiments
FIG. 1 isa block diagram illustrating a computer system
configured to implement one oF more aspects of the present
FIG. 2 isa block diagram of a parallel processing sub-
system for the computer system of FIG. 1, aeconling to one
embodiment ofthe present invention:
FIG. 318 a block diagram of a portion of a steaming
‘multiprocessor within the general processing cluster of FIG.
2, according to one embodiment ofthe present invention;
FIG. 4 is a conceptual diagram of a graphies processing
pipeline that one or more of the PPUS of FIG. 2 ean be
configured to implement, according to one embodiment of
the present inveation
FIG. A illustrates a flip operation on a triangle pair
testellted from # path element, avording to one embodi-
‘ment of the present invention;
FIG. 5B illustrates processing cost associated with a
triangle pair according to one embodiment of the present
IG. SC illostates processing cost associated with 3
flipped triangle pair, acconting to one embodiment of the
present invention;
FIG, 6illutrates flip operation on triangle pair having
diferent facing tributes, according to one embodiment of
‘the present invention
'PIG. 7 illustrates soquentil flip operations for improved
‘overall topology optimization, according to one embodiment
of the present invention
FIG. 8 isa flow diaaram of method steps for performing
path rendering with optimized tangle topology, according
{0 one embodiment ofthe present invention; and
PIG. 9 isa flow diagram of method steps for performing
topology optimization, aeconting to one embodiment ofthe
present invention,
DETAILED DESCRIPTION
In the following description, numerous specific details are
sot forth to provide a more thorough understanding of the
present invention. However, it will be appareat to one of
alin the art that the present invention may be practiced
‘without one or more of these specif details
System Overview
FIG. Lisa block diagram illustrating » computer system
100 configured to implement one oF more aspects of the
present invention. Computer system 100 inchides a central
processing unit (CPU) 102 and a system memory 104
‘ommiicating via an interconnection path that may include
fa memory bridge 108, Memory bridge 108, which may’ be,
eg, a Northbridge chip, is connected via a bus o¢ other
‘communication path 106 (eg. a HyperTransport Tink) to an
‘VO (inpuvoutput bridge 107. JO bridge 107, which may be,
eg, # Southbridge chip, rceives user inp Irom one oF
‘mote user input devices 108 (e-., keyboard, mouse) and
‘orwards the input to CPU 102 via communication path 106
‘and memory bridge 108. A parallel processing subsystem
112 is coupled to memory bridge 108 via a bus or second
communication path 113 (eg.. 8 Peripheral Component
Interconnect (PCI) Express, Accelerated Graphies Port, orUS 9,558,573 B2
HyperTranspor link)
ing subsystem 112 is a graphies subsystem that delivers
pixels 1 a display device 110 that may be any conventional
cathode ray tube, liquid erystal display, light-omiting diode
‘display, or the Tike. A system disk 114 is also connected t0
VO bridge 107 and may be configured to store content and
applications and data for use hy CPU 102 and parallel
processing subsystem 112. System disk 114 provides non-
Volatile storage for applications and data and! may include
fixed or removable hard disk drives, lash memory devices,
‘and CD-ROM (compact dise read-only-memory), DVD-
ROM (digital versatile dise-ROM), Blu-ray, HD-DVD (high
definition DVD), or other magnetic, optical, or solid state
storage devices
A switeh 116 provides connections between VO bridge
107 and other components such as 2 network adapter 118
and various add-in eards 120 and 121. Other components
(ot explicitly shown), including universal serial bus (USB)
‘or other port connections, compact dise (CD) drives, digital
versatile die (DVD) drives, lm recording devices, and the
Tike, may also be connected to VO bridge 107. The various
‘communication paths shown in FIG. 4, including the spe-
cifically named communication paths 106 and 113 may be
Implemented using any suitable protocols, such as PC]
Express, AGP (Accelerated Graphics Port), Hyper Teanspot,
‘or ay. other bus or point-to-point | communieato
protocol(s, and connections between dillerent devices may
tse diferent protocols as is known in the ar
Tn one embodiment, the parallel processing subsystem
112 incorporates circuitry optimized for praphies and video
processing, including, for example, video output circuitry,
‘and constitutes a graphies processing unit (GPU). In another
‘embodiment, the parallel processing subsystem 112 incor
Porates circuitry optimized for general purpose processing,
‘while preserving the underlying computational architecture,
described in greater detail herein, In yet another embodi
ment, the parallel processing subsystem 112 may be jnte-
arated with one or more other system elements ina single
Subsystem, such as joining the memory bridge 108, CPU
102, and VO bridge 107 to form a system on ehip (SoC),
i will be appreciated that the system shown herein is
itlustrative and that variations and modifications are pos-
sible. The connection topology. including the umber and
arrangement of bridges, the number of CPUs 102, and the
umber of parallel processing subsystems 112, may be
modified as desired, For instance, in some embodiments,
system memory 104 is conneeted to CPU 102 direely’ rather
than though a bridge, and other devices communicate with
system memory 104 via memory bridge 108 and CPU 102.
In other altemative topologies, parallel processing subsys-
tem 112 is connected to VO bridge 107 or directly to CPU
102, rather than to memory bridge 108. In stil other embodi-
ments, UO bridge 107 and memory bridge 108 might be
‘intgrated into «single chip instead of existing as one oF
more discrete devices. Large embodiments may include two
‘or more CPUs 102 and two or more parallel processing
subsystems 112. The particular components shown herein
are optional for instance, any number of ald-in cards or
petipheral devices might be supported. In some embodi-
ments, switch 116 js eliminated, and network adapter 118
‘ann add-in cards 120, 121 connect drecly to VO bridge 107,
FIG. 2 illustrates a parallel processing subsystem 112,
‘acconting to one embodiment of the present invention. AS
shown, parallel processing subsystem 112 includes one or
‘more parallel processing units (PPUs) 202, each of which is
‘coupled toa local parallel processing (PP) memory 204. In,
enerl, a parallel processing subsystem includes @ number
0
o
4
U of PPUs, where Uzl. (Herein, multiple instances of ike
objects are denoted with relerence numbers identifying the
‘object and paceathetical numbers identifying the instance
whore needsd,) PPUs 202 and parallel processing memories
204 may be implemented using one oF more integrated
circuit devices, such as programmable processors, applica-
‘on speific integrated circuits (ASICS), ormemory devices,
or in any other technically feasible fashion,
Referring again to FIG. 1 as well as FIG. 2, in some
‘embodiments, some or all of PPUs 202 in parallel processing
subsystem 112 are graphs processors with rendering pipe-
Tines that can be configured to perform various operations
related to generating pixel data from graphies data supplied
by CPU 102 andlor system memory 104 via memory bridge
105 and the second communication path M3, inieracting
‘ith loeal parallel processing memory 204 (whieh can be
ed as graphies memory including, eg, 2 conventional
‘ame buller) to store and update pixel dats, delivering pixel
data w display device 110, and the like. In some embed
seals, parallel processing subsystem I2 may include one
‘or more PPUs 202 that operate as graphics processors and
fone oF more other PPUS 202 that are used for generale
purpose computations. The PPUs 202 may be identical oF
ferent, and each PPU 202 may have one or more dedi-
cated parallel processing memory devieo(s) oF no dicated
parallel processing. memory deviee(s). One or more PPUs
202 in parallel processing subsystem 112 may output data to
‘splay device 110 oF each PPU 202 ia parallel processing
subsystem 112 may outpat data to one or more display
vices 110,
In operation, CPU 102 is the master processor of com-
puter system 100, controling and coontinating operations of
fther system components. In particular, CPU 102 issues
‘commands that control the operation of PPUs 202, In some
embodiments, CPU 102 writes a steam of commands for
each PPU 202 to a data structure (not explicitly shown in
either FIG. 1 or FIG. 2) that may be located in system
memory 104, parallel processing: memory 204, or another
storage location accessible to both CPU 102 and PPU 202.
‘pointer o each data steutare is written toa pushbufer to
initiate processing ofthe stream of commands in the data
structure. The PPU 202 reads command streams from one oF
‘more pushbullers and thea executes commands asynchro-
ously relative (0 the operation of CPU 102, Execution
priorities may be specified for each pushbufler by an app
tation program via the device dever 103 to control sched-
ling ofthe different pushbulers,
Roferring back now to FIG, 23s well as FIG. 1, each PPU_
202 includes an /O (inpuvoutput) unit 208 that communi-
fates With the rest of computer system 100 via communi-
cation path 113, which connects to memory bridge 108 (or,
in one altemative embodiment, decly to CPU 102). The
‘connection of PPU 202 to the rst of computer systems 100
‘may also be varied. In some embodiments, parallel proces
ing subsystem 112 is implemented as an add-in card that can
be inserted into an expansion slot of computer system 100.
In other embodiments, a PPL 202 can be integrited on @
single chip with a bus bridge, such as memory beidge 105 or
VO bridge 107. In still other embodiments, some or all
elements of PPU 202 may be integrated ona single chip with
CPU 102,
In one embodiment, communication path 113 is @ PCI
Express link, in which dedicated lanes ae allocsted 10 each
PU 202, a s known inthe art. Other communication paths
ray also be used. An HO unit 208 generates packets (or
‘ther signals) for transmission of eommnnication path 113
and also receives all incoming packets (or other signals)US 9,558,573 B2
5
‘rom communication path 113, direting the incoming pack-
‘els to appropriate eomponcats of PPU 202. For example,
‘commands related to processing tasks may be drweted 10 3
host interface 206, while commands related t- memory
‘operations (¢g., reading from or writing to parallel process:
Jing memory 204) may he directed toa memory eosthar wit
210. Host interface 206 reads each pushbufler and outputs
the command stream stored inthe pushbuller toa front end
212.
Each PPU 202 advantageously implements a highly pare
allel processing architecture. As shown in detail, PPU 202(0)
includes a processing cluser array 230 that includes @
number Cof general processing clusters (GPCs) 208, where
(Cel. Fach GPC 208 is capable of executing a large number
(ea, hundreds o thousands) of threads concurrently, where
‘each thread is an instance of a progzam. In various appli=
cations, different GPCs 208 may be allocate for processing
‘iffereat types of programs or for performing dilleent types
‘of computations. The allocation of GPCs 208 may vary
‘depesdent on the workload arising for cael ype of program >
‘or computation
GPC 208 receive processing tasks tobe executed from a
‘wore distribotion unit within a ask/work unit 207. The work