You are on page 1of 20
2, United States Patent oy oy my 3) oy (58) Bolz et al. OPTIMIZING TRIAN PATH RENDERING LE TOPOLOGY FOR Applicant: NVIDIA CORPORATION, Saat Clara, CA (US) Inventors: Jeffrey A. Bolz, Austin, TX (US): Mark J. Kilgaed, Austin, TX (US) Assignee: NVIDIA Corporation, Sania Clara, CA ws) Notice: Subject to any disclaimer, the teem ofthis patent is extended or adjusted under 35 USC. 18446) by 195 days Appl. Now 197717458 Filed: Dee. 17, 2012 Prior Publication Data US 201410168222 Al Jun, 19, 2014 Im. C1, Goor 11/40 (2006.01) (200801) GO6T 1/40 (2013.01); GO6T 117208 (201301) Field of Classification Search CPC es GOST 11/208; GOST 11/20; GO6T 17/20; ‘Gost 17/205, spc 345/423, 441, 42, 17 Se application file for complete search history. US0095S8573B2 (10) Patent No.: (4s) Date of Patent: US 9,558,573 B2 Jan, 31, 2017 60) References Cited US. PATENT DOCUMENTS. 064771 A $2000 Migdal eta Sa08SS3 BL* 82008 Tokovig ea sasiaat Sa7sa82 Bo* 72014 Brown usa 2004 000857 AL* $004 Cenk oa 345.420 So1sno0rat ALS 112013. Schmdt Gacr9'29 FOREIGN PATENT DOCUMENTS (OTHER PUBLICATIONS Kolingerova: Simulated Annealing and Genetic Algorithms in Quest of Optimal Tianglations; Generalized Voronoi Diagram SSCL ISS, pp 247-266, Springer Verlag Beri Heiberg 2009." ‘Sewhol Updating and Constnsting Consesned Delaunay and (Constiined Replat Tuangulations by lips; SoC U3, ACM 10, 2003 * cited by examiner Primary Esaminer — Carlos Perromat (74) Atorney, Agen, or Firm — Aregis Law Group, LLP on ABSTRACT A technique for efficiently rendering path images tessellates path contours into triangle tas comprising a set of repre- sentative triangles. Topology of the set of representative triangles is then optimized for greater rasterization efiiency by applying & Nip operator to selected triangle pairs within the sot of representative triangles. The optimized triangle pairs are then rendered using a path rendering technique Such as stenil and cover. 17 Claims, 9 Drawing SI r™ J > Flip Operation 600 U.S. Patent Jan. 31, 2017 Sheet 1 of 9 US 9,558,573 B2 Computer System System Memory 100 104 a7 Device Driver 103 Communication Path T 113, cpu Memory Parallel Processing ‘102 |) Bridge Subsystem 108 112 Communication | Display Path Device 106 t—110 Input Devices 2 < System - Disk VO Bridge 414 107 ‘Add-in Card Switch ‘Add-in Card 120 116 24 Network ‘Adapter 18 Figure 4 U.S. Patent Jan. 31, 2017 Sheet 2 of 9 US 9,558,573 B2 Parallel Processing Memory Bridge | Communication Subsystem Path — 142 113, ¥ PPU 202( 10 202(0) Unit |—» Host Interface 206 205 Front End 212 Task/Work Unit 207 Processing Cluster Array 230 GPC GPC 208(0) 208(1) GPC Crossbar Unit 210 | Memory|interface 214, Partition | | Partition Partition Unit Unit |---] Unit = + DRAM || DRAM _ | DRAM 22010) | | 220(1) 220(D-1) PP Memory 204(0) PPU PP Memory 20211) 204(4 PPU PP Memory L__, joy 202(U-1) 204(U-1) U.S. Patent Jan, 31, 2017 Sheet 3 of 9 US 9,558,573 B2 From Pipeline Manager 305 in GPC 208 ee ‘SM 310 Instruction L1 Cache 370 ho Warp Scheduler and Instruction Unit 312 Local Register File 304 Exec Exec Unit . Uni 30200) 302(N-1) Unified usu |] usu |, Address soa) [) sos Mapping Unit 352 Memory and Cache Interconnect 380 Shared Memory 306 L1 Cache 320 kK To/From mmu From Memory Interface 214. <——») 328 k— L1.5 Cache 335 Via Crossbar Unit 210 in GPC 208 Figure 3 Patent Jan. 31,2017 Sheet 4 of 9 US 9,558,573 B2 CONCEPTUAL DIAGRAM Instruction Stream and Parameters Graphics | Processing Data Assembler Pipeline 410 400 ,| Vertex Processing Unit 415 Primitive Assembler 420 Geometry Processing Unit 425 Viewport Scale, Cull, and Clip Unit 450 Rasterizer 455 Memory t Fragment Processing Unit Interface 460 214 sf Raster Operations Unit 465 Figure 4 U.S. Patent Jan, 31, 2017 Sheet 5 of 9 US 9,558,573 B2 c c 56 B s D B D 510 512 514 — Flip Operation 500 A A Figure 5A 520(0) 520(1) 520(0) 520(1) 0 | (st ye 8 5 2 5 Z 3 1 1 i 3 3 1 1 514 | 1 3 1 1 1 4 1 2 ° 4 ° 2 (Total: 30) 52012) (Total: 21) 520(2) Figure 5B Figure 5C U.S. Patent Jan. 31,2017 Sheet 6 of 9 US 9,558,573 B2 F ca J — Flip Operation 600 Figure 6 U.S. Patent Jan. 31,2017 Sheet 7 of 9 US 9,558,573 B2 LM Pog LM Pg LM Pog ——w ——w Flip Operation Flip Operation 710 720 K. K. K Figure 7 U.S. Patent Jan, 31, 2017 Sheet 8 of 9 US 9,558,573 B2 »——— 800 Receive Path Image 810 ¥ Tessellate Path Image into Set of Representative Triangles ¥ Optimize Topology for Set of Representative Triangles 830 ¥ Save Optimized Set of Triangles ¥ Render Optimized Set of Triangles 850 Done 890 Figure 8 U.S. Patent Jan, 31, 2017 Sheet 9 of 9 US 9,558,573 B2 »—— 900 Receive Set of Representative Triangles Term Metric Satisfied? 920 Yes No ¥ Calculate Flip Metric for Triangle Pairs 922 y Select Triangle Pair to Flip Based on Flip Metric 924 ¥ Perform Flip on Selected Triangle Pair Done 990 Figure 9 US 9,558,573 B2 1 OPTIMIZING TRIANGLE TOPOLOGY FOR PATH RENDERING BACKGROUND OF THE INVENTION Field ofthe Invention ‘The present invention generally relates to path rendering and, more specially to optimizing triangle topology for path rendering. ‘Description of the Related Art ath rendering represents one syle of resolution-indepen- ‘dent two-dimensional (2D) rendering that forms basis for ‘a numberof important graphics rendering standards knows, in the art as PostScript, Java 2D, Apple's Quartz 2D, PDF, TrueType fonts, OpenType fonts, PostScript fonts, scalable vestor graphics (SVG), OpenVG, Microsoft's Si Adobe Flash, Microsoft's XML Paper Speci and more. ‘One class of teehnigues for performing path rendering includes atleast a tessellation step and a path coverage step. Path elements are essellated into representative triangles ia the tesellation step. The puth coverage step draws many tesselated triangles, and samples covered by these triangles are counted in a stencil or color buffer, which is used t0 ‘determine whether each sample is inside or outside an sssociated path. Frontefacing triangles increment covered sample counts and back-facing triangles decrement covered sample counts. Samples counted a inside a path are ren- dered according 1o an associated path fill color, while samples counted as outside a path are not rendered t0 the path fill color, ‘Many common tessellation technigues generate triangle fins and meshes having very narrow, sliver Tike triangles shih typically render with relatively poor elfiiency. As 3 ‘consequence, overall path rendering efficiency and perfor- mance may be relatively poor, which can diminish the ‘quality of user experience. ‘As the foregoing illustrates, what is need in the artis a technique for improved path rendering efficiency. SUMMARY OF THE INVENTION (One embodiment of the present invention sets forth a method for processing a path image for efficient rasteriza- tion, the method comprising tessellating one oF mare com- ‘ours defining the path image into a first set of triangles ‘wherein cach triangle ofthe first set of triangles includes & winding order, penerating a second set of triangles that are ‘optimized to reduce rasterization cost based on topology and Winding order of wiangles within de fist set of triangles, rd saving the second set of triangles Other embodiments of the present invention include, ‘without limitation, a computerreadable slomge medium including instructions that, When executed by a processing unit, cause the processing unit to perform the techniques ‘described herein as well asa computing deviee that includes f processing unit configured to perform the techniques ‘described herein, ‘One advantage of the disclosed technique is that it improves rendering efiiency of path images rendered by a raphies processing u BRIEF DESCRIPTION OF THE DRAWINGS ‘So thatthe manner in which the ahove recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized 0 o 2 above, may be had by reference to embodiments, some of ‘whieh are illustrated in the appended drawings. Its 10 be noted, however, thatthe appended drawings ilustrate only ‘typical embodiments ofthis invention and are therefore not to be considered limiting ofits scope, forthe invention may admit to other equally effective embodiments FIG. 1 isa block diagram illustrating a computer system configured to implement one oF more aspects of the present FIG. 2 isa block diagram of a parallel processing sub- system for the computer system of FIG. 1, aeconling to one embodiment ofthe present invention: FIG. 318 a block diagram of a portion of a steaming ‘multiprocessor within the general processing cluster of FIG. 2, according to one embodiment ofthe present invention; FIG. 4 is a conceptual diagram of a graphies processing pipeline that one or more of the PPUS of FIG. 2 ean be configured to implement, according to one embodiment of the present inveation FIG. A illustrates a flip operation on a triangle pair testellted from # path element, avording to one embodi- ‘ment of the present invention; FIG. 5B illustrates processing cost associated with a triangle pair according to one embodiment of the present IG. SC illostates processing cost associated with 3 flipped triangle pair, acconting to one embodiment of the present invention; FIG, 6illutrates flip operation on triangle pair having diferent facing tributes, according to one embodiment of ‘the present invention 'PIG. 7 illustrates soquentil flip operations for improved ‘overall topology optimization, according to one embodiment of the present invention FIG. 8 isa flow diaaram of method steps for performing path rendering with optimized tangle topology, according {0 one embodiment ofthe present invention; and PIG. 9 isa flow diagram of method steps for performing topology optimization, aeconting to one embodiment ofthe present invention, DETAILED DESCRIPTION In the following description, numerous specific details are sot forth to provide a more thorough understanding of the present invention. However, it will be appareat to one of alin the art that the present invention may be practiced ‘without one or more of these specif details System Overview FIG. Lisa block diagram illustrating » computer system 100 configured to implement one oF more aspects of the present invention. Computer system 100 inchides a central processing unit (CPU) 102 and a system memory 104 ‘ommiicating via an interconnection path that may include fa memory bridge 108, Memory bridge 108, which may’ be, eg, a Northbridge chip, is connected via a bus o¢ other ‘communication path 106 (eg. a HyperTransport Tink) to an ‘VO (inpuvoutput bridge 107. JO bridge 107, which may be, eg, # Southbridge chip, rceives user inp Irom one oF ‘mote user input devices 108 (e-., keyboard, mouse) and ‘orwards the input to CPU 102 via communication path 106 ‘and memory bridge 108. A parallel processing subsystem 112 is coupled to memory bridge 108 via a bus or second communication path 113 (eg.. 8 Peripheral Component Interconnect (PCI) Express, Accelerated Graphies Port, or US 9,558,573 B2 HyperTranspor link) ing subsystem 112 is a graphies subsystem that delivers pixels 1 a display device 110 that may be any conventional cathode ray tube, liquid erystal display, light-omiting diode ‘display, or the Tike. A system disk 114 is also connected t0 VO bridge 107 and may be configured to store content and applications and data for use hy CPU 102 and parallel processing subsystem 112. System disk 114 provides non- Volatile storage for applications and data and! may include fixed or removable hard disk drives, lash memory devices, ‘and CD-ROM (compact dise read-only-memory), DVD- ROM (digital versatile dise-ROM), Blu-ray, HD-DVD (high definition DVD), or other magnetic, optical, or solid state storage devices A switeh 116 provides connections between VO bridge 107 and other components such as 2 network adapter 118 and various add-in eards 120 and 121. Other components (ot explicitly shown), including universal serial bus (USB) ‘or other port connections, compact dise (CD) drives, digital versatile die (DVD) drives, lm recording devices, and the Tike, may also be connected to VO bridge 107. The various ‘communication paths shown in FIG. 4, including the spe- cifically named communication paths 106 and 113 may be Implemented using any suitable protocols, such as PC] Express, AGP (Accelerated Graphics Port), Hyper Teanspot, ‘or ay. other bus or point-to-point | communieato protocol(s, and connections between dillerent devices may tse diferent protocols as is known in the ar Tn one embodiment, the parallel processing subsystem 112 incorporates circuitry optimized for praphies and video processing, including, for example, video output circuitry, ‘and constitutes a graphies processing unit (GPU). In another ‘embodiment, the parallel processing subsystem 112 incor Porates circuitry optimized for general purpose processing, ‘while preserving the underlying computational architecture, described in greater detail herein, In yet another embodi ment, the parallel processing subsystem 112 may be jnte- arated with one or more other system elements ina single Subsystem, such as joining the memory bridge 108, CPU 102, and VO bridge 107 to form a system on ehip (SoC), i will be appreciated that the system shown herein is itlustrative and that variations and modifications are pos- sible. The connection topology. including the umber and arrangement of bridges, the number of CPUs 102, and the umber of parallel processing subsystems 112, may be modified as desired, For instance, in some embodiments, system memory 104 is conneeted to CPU 102 direely’ rather than though a bridge, and other devices communicate with system memory 104 via memory bridge 108 and CPU 102. In other altemative topologies, parallel processing subsys- tem 112 is connected to VO bridge 107 or directly to CPU 102, rather than to memory bridge 108. In stil other embodi- ments, UO bridge 107 and memory bridge 108 might be ‘intgrated into «single chip instead of existing as one oF more discrete devices. Large embodiments may include two ‘or more CPUs 102 and two or more parallel processing subsystems 112. The particular components shown herein are optional for instance, any number of ald-in cards or petipheral devices might be supported. In some embodi- ments, switch 116 js eliminated, and network adapter 118 ‘ann add-in cards 120, 121 connect drecly to VO bridge 107, FIG. 2 illustrates a parallel processing subsystem 112, ‘acconting to one embodiment of the present invention. AS shown, parallel processing subsystem 112 includes one or ‘more parallel processing units (PPUs) 202, each of which is ‘coupled toa local parallel processing (PP) memory 204. In, enerl, a parallel processing subsystem includes @ number 0 o 4 U of PPUs, where Uzl. (Herein, multiple instances of ike objects are denoted with relerence numbers identifying the ‘object and paceathetical numbers identifying the instance whore needsd,) PPUs 202 and parallel processing memories 204 may be implemented using one oF more integrated circuit devices, such as programmable processors, applica- ‘on speific integrated circuits (ASICS), ormemory devices, or in any other technically feasible fashion, Referring again to FIG. 1 as well as FIG. 2, in some ‘embodiments, some or all of PPUs 202 in parallel processing subsystem 112 are graphs processors with rendering pipe- Tines that can be configured to perform various operations related to generating pixel data from graphies data supplied by CPU 102 andlor system memory 104 via memory bridge 105 and the second communication path M3, inieracting ‘ith loeal parallel processing memory 204 (whieh can be ed as graphies memory including, eg, 2 conventional ‘ame buller) to store and update pixel dats, delivering pixel data w display device 110, and the like. In some embed seals, parallel processing subsystem I2 may include one ‘or more PPUs 202 that operate as graphics processors and fone oF more other PPUS 202 that are used for generale purpose computations. The PPUs 202 may be identical oF ferent, and each PPU 202 may have one or more dedi- cated parallel processing memory devieo(s) oF no dicated parallel processing. memory deviee(s). One or more PPUs 202 in parallel processing subsystem 112 may output data to ‘splay device 110 oF each PPU 202 ia parallel processing subsystem 112 may outpat data to one or more display vices 110, In operation, CPU 102 is the master processor of com- puter system 100, controling and coontinating operations of fther system components. In particular, CPU 102 issues ‘commands that control the operation of PPUs 202, In some embodiments, CPU 102 writes a steam of commands for each PPU 202 to a data structure (not explicitly shown in either FIG. 1 or FIG. 2) that may be located in system memory 104, parallel processing: memory 204, or another storage location accessible to both CPU 102 and PPU 202. ‘pointer o each data steutare is written toa pushbufer to initiate processing ofthe stream of commands in the data structure. The PPU 202 reads command streams from one oF ‘more pushbullers and thea executes commands asynchro- ously relative (0 the operation of CPU 102, Execution priorities may be specified for each pushbufler by an app tation program via the device dever 103 to control sched- ling ofthe different pushbulers, Roferring back now to FIG, 23s well as FIG. 1, each PPU_ 202 includes an /O (inpuvoutput) unit 208 that communi- fates With the rest of computer system 100 via communi- cation path 113, which connects to memory bridge 108 (or, in one altemative embodiment, decly to CPU 102). The ‘connection of PPU 202 to the rst of computer systems 100 ‘may also be varied. In some embodiments, parallel proces ing subsystem 112 is implemented as an add-in card that can be inserted into an expansion slot of computer system 100. In other embodiments, a PPL 202 can be integrited on @ single chip with a bus bridge, such as memory beidge 105 or VO bridge 107. In still other embodiments, some or all elements of PPU 202 may be integrated ona single chip with CPU 102, In one embodiment, communication path 113 is @ PCI Express link, in which dedicated lanes ae allocsted 10 each PU 202, a s known inthe art. Other communication paths ray also be used. An HO unit 208 generates packets (or ‘ther signals) for transmission of eommnnication path 113 and also receives all incoming packets (or other signals) US 9,558,573 B2 5 ‘rom communication path 113, direting the incoming pack- ‘els to appropriate eomponcats of PPU 202. For example, ‘commands related to processing tasks may be drweted 10 3 host interface 206, while commands related t- memory ‘operations (¢g., reading from or writing to parallel process: Jing memory 204) may he directed toa memory eosthar wit 210. Host interface 206 reads each pushbufler and outputs the command stream stored inthe pushbuller toa front end 212. Each PPU 202 advantageously implements a highly pare allel processing architecture. As shown in detail, PPU 202(0) includes a processing cluser array 230 that includes @ number Cof general processing clusters (GPCs) 208, where (Cel. Fach GPC 208 is capable of executing a large number (ea, hundreds o thousands) of threads concurrently, where ‘each thread is an instance of a progzam. In various appli= cations, different GPCs 208 may be allocate for processing ‘iffereat types of programs or for performing dilleent types ‘of computations. The allocation of GPCs 208 may vary ‘depesdent on the workload arising for cael ype of program > ‘or computation GPC 208 receive processing tasks tobe executed from a ‘wore distribotion unit within a ask/work unit 207. The work

You might also like