You may notice some images loading slow across the Open Collections website. Thank you for your patience as we rebuild the cache to make images load faster.

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Hybrid rendering : in pursuit of real-time raytracing Demoullin, Francois M. 2020

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2020_november_demoullin_francois.pdf [ 10.72MB ]
JSON: 24-1.0391894.json
JSON-LD: 24-1.0391894-ld.json
RDF/XML (Pretty): 24-1.0391894-rdf.xml
RDF/JSON: 24-1.0391894-rdf.json
Turtle: 24-1.0391894-turtle.txt
N-Triples: 24-1.0391894-rdf-ntriples.txt
Original Record: 24-1.0391894-source.json
Full Text

Full Text

Hybrid Rendering: In Pursuit of Real-Time RaytracingbyFrancois M. DemoullinB. Sc, University of British Columbia, 2017A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of Applied ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)June 2020© Francois M. Demoullin, 2020The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Hybrid Rendering: In Pursuit of Real-Time Raytracingsubmitted by Francois M. Demoullin in partial fulfillment of the requirements forthe degree of Master of Applied Science in Electrical and Computer Engineer-ing.Examining Committee:Tor M. Aamodt, Electrical and Computer EngineeringSupervisorMichiel van de Panne, Computer ScienceSupervisory Committee MemberPrashant Nair, Electrical and Computer EngineeringSupervisory Committee MemberiiAbstractHybrid rendering combines ray-tracing and rasterization graphics techniques togenerate visually accurate photorealistic computer-generated images at a tight real-time frame rate. This thesis presents contributions to the field of hybrid renderingby introducing an open-source ray-traced ambient occlusion workload, by quanti-fying the performance tradeoffs of this workload and by evaluating one promisinghardware technique to accelerate real-time raytracing.In this thesis, we introduce the first open-source implementation of a viewpointindependent raytraced ambient occlusion GPU compute benchmark. We study thisworkload using a cycle-level GPU simulator and present the trade-offs betweenperformance and quality. We show that some ray-traced ambient occlusion is pos-sible on a real-time frame budget but that the full quality effect is still too compu-tationally expensive for today’s GPU architectures.This thesis provides a limit study of a new promising technique, Hash-BasedRay Path Prediction (HRPP), which exploits the similarity between rays to predictleaf nodes to avoid redundant acceleration structure traversals. Our data shows thatacceleration structure traversal consumes a significant proportion of the raytracingrendering time regardless of the platform or the target image quality. Our studyquantifies unused ray locality and evaluates the theoretical potential for improvedray traversal performance for both coherent and seemingly incoherent rays. Weshow that HRPP can skip, on average, 40% of all hit-all traversal computations.We further show that the significant memory overhead, ranging on the order ofmegabytes, inhibits this technique from being feasible for current architectures.iiiLay SummaryComputer graphics is the science of generating pictures on a computer screen.There is an inherent trade-off between making the images look realistic and gener-ating them fast. Current work has pushed both ends of the spectrum, images eitherlook more realistic and take longer, or they are generated faster. Fast generatedimages still lack in realism compared to slowly generated images.This thesis implements a workload that takes a highly realistic effect, raytracedambient occlusion, and analyzes the feasibility of rendering it quickly. We explorethe trade-offs between visual quality and speed of image generation for this par-ticular workload. Based on the outcome of our analysis, we propose new changesto the state of the art computer architecture of computing cores to facilitate fastervisual effects without excessively changing the existing hardware.ivPrefaceThis dissertation is based on a research project conducted by myself under thesupervision and guidance of Professor Tor M. Aamodt. I assisted with definingthe problem space and was responsible for deriving mathematical solutions, iden-tifying challenges within this problem space, and designing and optimizing thealgorithm to evaluate the proposed idea. I also conducted the experiments andcollected all the data represented in this dissertation. Tor M. Aamodt providedvaluable guidance and directions in identifying the research problems, developingsolution methodologies, and documenting the results.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1 Rasterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Ray-Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Hybrid Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4 Real-Time Raytracing . . . . . . . . . . . . . . . . . . . . . . . . 152.4.1 CPU Ray-Tracing . . . . . . . . . . . . . . . . . . . . . . 15vi2.4.2 GPU Compute Ray-Tracing . . . . . . . . . . . . . . . . 162.4.3 Custom Raytracing Hardware . . . . . . . . . . . . . . . 172.5 Ambient Occlusion . . . . . . . . . . . . . . . . . . . . . . . . . 183 Ray-Traced Ambient Occlusion GPU Memory Study . . . . . . . . 233.1 Ray-Traced Ambient Occlusion Benchmark . . . . . . . . . . . . 253.1.1 Determining Ray-Origin . . . . . . . . . . . . . . . . . . 263.1.2 Determining Ray-Direction . . . . . . . . . . . . . . . . 273.1.3 Tracing Rays . . . . . . . . . . . . . . . . . . . . . . . . 303.2 Visual Ambient Occlusion Output . . . . . . . . . . . . . . . . . 333.3 Performance and Memory Behaviour of GPU Ray-Traced AmbientOcclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.3.1 Performance Evaluation . . . . . . . . . . . . . . . . . . 383.3.1.1 Effect of Scene Size on Performance . . . . . . 383.3.1.2 Effect of SPS on Performance . . . . . . . . . . 403.3.2 Memory Evaluation . . . . . . . . . . . . . . . . . . . . 413.3.2.1 Memory Access Classification . . . . . . . . . 413.3.2.2 Cache Performance . . . . . . . . . . . . . . . 423.3.2.3 Memory Reuse . . . . . . . . . . . . . . . . . . 444 Hash-Based Ray Path Prediction . . . . . . . . . . . . . . . . . . . . 494.1 Ray Tracing Performance Evaluation . . . . . . . . . . . . . . . . 514.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.3 Hash-Based Ray Path Prediction . . . . . . . . . . . . . . . . . . 554.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 554.3.2 Hash-Based Ray Path Prediction (HRPP) . . . . . . . . . 564.3.3 Precision and Recall . . . . . . . . . . . . . . . . . . . . 574.4 HRPP Hash Function . . . . . . . . . . . . . . . . . . . . . . . . 594.4.1 Hash Function Tradeoff . . . . . . . . . . . . . . . . . . 604.5 Limit Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.5.1 Overview Table . . . . . . . . . . . . . . . . . . . . . . . 634.5.2 Go Up Level . . . . . . . . . . . . . . . . . . . . . . . . 644.5.3 Samples Per Pixel . . . . . . . . . . . . . . . . . . . . . 65vii4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 675.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69A Supporting Materials . . . . . . . . . . . . . . . . . . . . . . . . . . 80viiiList of TablesTable 3.1 Sample scenes used for performance evaluation of our RTAObenchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Table 3.2 The time (in ms) rendering the complete AO effect at 8 SPS and1 SPT per scene for the 3 GPU raytracers . . . . . . . . . . . . 38Table 3.3 Cache miss rates for various GPU caches for the Dragon andSponza scene . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Table 3.4 Limit study of the amount of DRAM accesses saved with aninfinite L2 cache . . . . . . . . . . . . . . . . . . . . . . . . . 46Table 3.5 Memory reuse study: What fraction of total accesses for topX% of memory addresses . . . . . . . . . . . . . . . . . . . . 46Table 4.1 Evaluation of test scenes - resolution 1024×1024 - 8 spp - GoUp Level 0 - hash precision 6 . . . . . . . . . . . . . . . . . . 63ixList of FiguresFigure 2.1 Rasterization graphics projects the primitives onto the screenusing the orthographic Model View Projection (MVP) matrix. 6Figure 2.2 The Open-GL pipeline as implemented on most dedicated graph-ics processing units [33, 37] . . . . . . . . . . . . . . . . . . 7Figure 2.3 Raytracing traces ray into the scene which is intersected withthe geometry and launches secondary rays . . . . . . . . . . 11Figure 2.4 Bounding Volume Hirarchy (BVH) traversal and intersection . 11Figure 2.5 The effect of ambient occlusion on a human face [5]. The leftimage of Figure 2.5a shows the model without the ambient oc-clusion effect, the face appears flat and unrealistic. The rightimage of Figure 2.5a shows the same face with the ambient oc-clusion effect enabled, the face appears vibrant and more real-istic compared to the left image. Figure 2.5b shows the ambientocclusion effect that is applied in Figure 2.5a . . . . . . . . . 19Figure 2.6 The difference between Screen-Space Ambient Occlusion andRay Traced Ambient Occlusion on the Sponza scene as ren-dered by the Unity Game Engine [76] . . . . . . . . . . . . . 21Figure 2.7 A simplified example of Ray-Traced Ambient Occlusion show-ing three levels of occlusion of ambient lighting. 1 is unoc-cluded, 2 is partially occluded and 3 is heavily occluded. 22xFigure 3.1 Using the vertex positions as the origin of the rays for theTeapot model. The regular pattern provides poor surface cov-erage resulting in incomplete coverage and subsequently anincomplete visual ambient occlusion effect. Each red dot rep-resents the origin of one ray. . . . . . . . . . . . . . . . . . . 27Figure 3.2 Illustration of samples per triangle (SPT) surface coverage for4spt, 8spt, and 16spt. Each red dot represents the origin of oneray. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Figure 3.3 Using the vertex normal as ray-direction yields poor coverageof the hemisphere originating at P. Each green dot representsthe destination of the ray direction of one ray. . . . . . . . . . 28Figure 3.4 Illustration of Samples Per Surfel (SPS) using the stochastichemisphere sampling approach from Listing 3.1. Each greendot represents the destination of the ray direction of one ray. . 29Figure 3.5 Our Ray-Traced Ambient Occlusion output visualized on 8scenes. We used 32 SPS and 32 SPT for (a) and (b) and 4SPS and 4 SPT for (c) through (f). . . . . . . . . . . . . . . . 36Figure 3.6 Kernel timing for Aila, Ylitie and OptiX raytracers for all 6benchmark scenes. 8 SPT, 1 SPS. Figure 3.7a shows the num-ber of trianges per scene, Figure 3.7b shows 60 fps. . . . . . 39Figure 3.7 Kernel timing for Aila, Ylitie, and OptiX raytracers for theBuddha scene for different values of SPS. The kernel perfor-mance scales linearly with SPS. . . . . . . . . . . . . . . . . 40Figure 3.8 Distribution of memory accesses generated at the GPU com-pute cores. 100k rays, 8 SPS, 8SPT. . . . . . . . . . . . . . . 43Figure 3.9 L1 data and L2 cache miss rates per buffer - Dragon and SponzaScene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45Figure 3.10 Memory reuse study: Classification of memory addresses be-longing to the top X% of all accessed memory addresses. . . 48Figure 4.1 Runtime proportions of increasingly complex scenes on theCPU and the GPU - 1024×1024, 8 SPP . . . . . . . . . . . . 53xiFigure 4.2 Binary BVH Acceleration Structure augmented by HRPP. Theinterior nodes are represented by their bounding boxes. Leafnodes have both bounding boxes and triangular scene geom-etry. HRPP additions are marked in green. The red arrowsrepresent the control flow from root to leaves through HRPP . 55Figure 4.3 Illustration of our hash function’s extraction of bits from IEEE754 floating point with an example precision of 2 bits. Bitsmarked in green are included in the hash representation . . . . 59Figure 4.4 Illustration of the tradeoff between a tight and a loose hashfunction and its impact on the predictor table properties. Bud-dha - 1024×1024 - spp 8 . . . . . . . . . . . . . . . . . . . . 62Figure 4.5 Impact of Go Up Level on number of predictor entries. Buddha- 1024×1024 - hash precision 6 . . . . . . . . . . . . . . . . 64Figure 4.6 Impact of SPP on interor node computation skipped and thenumber of predictor entries. Buddha - 1024×1024 - hash pre-cision 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Figure A.1 SPT tradeoff for the dragon scene. SPP is fixed at 32. . . . . 81Figure A.2 SPP tradeoff for the teapot scene. SPT is fixed at 32. . . . . . 82xiiGlossaryAABB Axis-Aligned Bounding-BoxAO Ambient-OcclusionAPI Application Program InterfaceASIC Application-Specific Integrated CircuitBVH Bounding Volume HierarchyCG Computer GraphicsCPU Central Processing UnitCUDA Compute Unified Device ArchitectureDXR DirectX Ray-tracingFPGA Field-Programmable Gate ArrayFPS Frames Per SecondGI Global IlluminationGPU Graphics Processing UnitHRPP Hash-Based Ray Path PredictionLDST Load-Store QueueMIMD Multiple-Instructions Multiple DataMVP Model-View-ProjectionRTAO Ray-Traced Ambient-OcclusionSIMD Single Instruction Multiple DataxiiiSIMT Single Instruction Multiple ThreadsSPP Samples Per PixelSPS Samples Per SurfelSPT Samples Per TriangleSSAO Screen-Space Ambient-OcclusionSSDO Screen-Space Directional OcclusionxivAcknowledgmentsI would like to thank my parents Nicole and Marc for their support.I want to give special thanks to my sister, Stephanie, who continues to be thesmartest and most impressive person I know. I want to thank everyone in the lab,especially Aamir and Amruth for being the best friends I could have wished for. Iwould like to thank Priscie and Osman for being my best friends outside of the laband for always being there for me. I want to thank Holly for helping me throughthis degree, for always being there for me and for enduring my many complaints.I would like to thank Tor for his patient advice and guidance.xvChapter 1IntroductionThis thesis explores real-time rendering, the study of quickly producing visualcontent on a computer screen. We explore new avenues for increasing the im-age quality of rendered images without sacrificing interactive frame-rates neededfor applications such as video games. This thesis proposes the first open-sourceray-traced ambient occlusion benchmark using Graphics Processing Unit (GPU)compute kernels. We explore the memory and compute tradeoffs of ray-traced am-bient occlusion on a cycle level GPU simulator. We finally propose and evaluate onepromising extension to existing GPU hardware architecture to improve ray-tracingspeed and efficiency.This research is driven by the popularity of video games, and the discrepancybetween the image quality in offline renderers and real-time applications. GPUsemerged in the 1980s as specialized graphics accelerators [52]. GPUs were a nicheand expensive accelerator only used in highly specialized supercomputers consist-ing of a few thousand transistors. Nowadays, GPUs are widely popular. Commer-cial vendors like NVIDIA and AMD sell over 10 billion GPUs per year [70]. Thevideo game industry is expected to be worth over 90 billion dollars with 2.5 billionpeople playing video games worldwide [88].Mobile video games pose an even bigger challenge. Mobile games requirereal-time framerates without applying too much of a burden on the battery of ahandheld device like phones or tablets. Mobile games have to be energy efficientto maximize battery life. Over 100 thousand publishers publish mobile game apps1and over 2.4 billion people play mobile games [47].Computer Graphics (CG) is the study of digitally synthesizing visual content.The generation of images on the computer dates back to the invention of Cathode-Ray tubes [87] and the digital display. The very first graphical applications con-sisted of the lighting of a single digitally controlled Cathode-Ray tube. The firstinteractive video game was a virtual tennis game invented by physicist WilliamHiginbotham [18].The field of Computer Graphics has since come a long way. Most notablydue to pioneers like Ivan Sutherland and David Evans. Sutherland won the Turingaward for his work on computer graphics such as Sketchpad [75]; Evans innovatedthe field with his work on UNIX timesharing [27] and the display as we knowit today [92]. Arguably, the two most impactful advancements of graphics algo-rithms can be attributed to Ed Catmull for his work on the rasterization Z-bufferalgorithm [17], and the invention of the ray-tracing algorithm by Whitted [89] andKay and Kajiya [36]. We discuss the Z-buffer algorithm in Section 2.1; ray-tracingis discussed in Section 2.2. These algorithmic advances have introduced applica-tions such as animated movies and complex interactive video games bringing joyto millions of people daily. Furthermore, graphical applications have contributedto a variety of fields outside of computer science such as medicine, data science,archaeology, and linguistics.Rendering images using modern computer graphics algorithms is compute andmemory intensive. Efficient ways to process images have been heavily researched.Designated hardware to accelerate graphics algorithms has been researched along-side algorithmic advances since the inception of graphics. Nowadays, dedicatedgraphics hardware is ubiquitous in most consumer-facing devices such as smart-phones, laptops, desktops, and even data centers and supercomputers. GraphicsProcessing Units (GPU), when used for graphics, almost exclusively accelerate theZ-Buffer rasterization algorithm. The rasterization algorithm is used in all modernreal-time applications such as video games. GPUs have recently gained popularityin the acceleration of machine learning applications but they nonetheless include asignificant portion of their die area to specialized graphics applications [57].The complexity of rendering gives rise to a fundamental tradeoff in computergraphics: quality vs speed. The highest quality images in computer graphics are2achieved with state of the art commercial path tracers such as Pixar’s Render-man [19] or Disney’s Hyperion [16]. The render times of high quality, movie gradescenes are on the order of a day and can sometimes exceed weeks per frame. Acluster of computers computes each frame for multiple days and returns the result,a technique known as offline rendering. Path tracers, which rely on the ray-tracingalgorithm, are becoming increasingly complex; novel research into rendering ad-vanced visual effects such as transparent and translucent volumes, fur, and globalillumination add complexity and compute time to each frame. Game engines, onthe other hand, have a hard requirement on runtime. For a game to appear reactive,smooth, and enjoyable to the player a framerate of 60 Frames Per Second (FPS)has to be enforced. This is achieved by reducing the quality of the image and byapplying simplifications and approximations of complex visual effects. Render-ing at interactive framerates is known as online rendering. Most online renderers,such as the Unity Engine [76] and the Unreal Engine [68], rely on the rasterizationalgorithm.As a general rule for runtime complexity, one can approximate ray-tracing toscale linearly with the number of samples used and rasterization to scale linearlywith the number of primitives in the scene.Hybrid rendering is a novel field of research that seeks to combine both raytrac-ing and rasterization in real-time applications that have recently emerged. Hybridrendering seeks to add high quality raytraced visual effects into rasterized frames toachieve superior image quality at real-time frame rates; it is discussed extensivelyin section Section 2.3. Most of the work in hybrid rendering focuses on algorithmicimprovements [7], however, very little academic research has been done on inte-grating raytracing specific hardware acceleration into rasterization based GPUs.1.1 ContributionsThe popularity of video games is motivating us to research new ways to improvethe quality of the gaming experience. Gamers enjoy the high visual quality of therendered images at fast and smooth framerates. Adding in the additional constraintof energy conservation for hand-held devices, real-time graphics poses a difficultchallenge for both computer software and hardware architects.3In this thesis, we explore the bottleneck of real-time raytracing by focusing oncurrent GPU architectures. We resist the temptation to consider specialized ray-tracing architectures and rather seek to integrate a limited version of raytracinghardware support into rasterization based accelerator chips following the hybridrendering paradigm.In this thesis we present:• An open-source implementation of a Ray-Traced Ambient-Occlusion (RTAO)benchmark for use as a hybrid rendering workload• The first evaluation of the memory behavior of current Single InstructionMultiple Threads (SIMT) Compute raytracing of the highly divergent RTAObenchmark on current GPU architectures• We propose a novel approach to accelerate Bounding Volume Hierarchy(BVH) traversal by prediction; we show a possible reduction of 40% in in-terior node computations in theory, but we conclude that the feasibility ourapproach is low due to significant memory overhead1.2 OutlineWe discuss relevant background information in Chapter 2. We discuss the differ-ence between rasterization graphics, ray tracing, and hybrid rendering. We furtherdiscuss the current state-of-the-art in GPU raytracing and recent advancements ofreal-time GPU raytracing. We introduce the Ambient Occlusion effect in Sec-tion 2.5.Chapter 3 introduces our GPU ray-traced ambient occlusion benchmark. Wedocument the setup of the benchmark and the relevant code snippets in Section 3.1,we then evaluate the performance of the benchmark in Section 3.2 and finally dis-cuss the memory behavior results from a cycle-level GPU simulator in Section 3.3.Chapter 4 introduces our attempt at introducing a new hardware accelerator toimprove both the performance and the energy efficiency of ray tracing applications.Hash-Based Ray Path Prediction (HRPP) aims at skipping computation by doinga lookup, the employed method is introduced in Section 4.3. We evaluate our4approach in Section 4.5 and show that it is not suited for current GPU architecturesdue to significant memory overhead.We conclude this thesis in Chapter 5 by summarizing our findings and point tofuture work that needs to be done on the topic of real-time hybrid rendering.5Chapter 2Background2.1 RasterizationReal-time graphics applications such as video games rely on Catmull’s Z-bufferalgorithm [17], also know as the rasterization algorithm or the depth buffer algo-rithm. The Z-Buffer algorithm converts 3D primitives such as triangles into 2Dpoints on the screen. This is done by linear transformations such as the Model-View-Projection (MVP) matrix and the screen-space transformation. Once theprimitives are in screen space, their surface is attributed to each pixel and theclosest fragment is stored in the Z-Buffer. The closest fragment is then shadedto determine the final color of the pixel.(a) 3D representation of scene and itsprojection onto the display(b) 2D representation of the final sceneon the displayFigure 2.1: Rasterization graphics projects the primitives onto the screen us-ing the orthographic Model View Projection (MVP) matrix.6Figure 2.1 illustrates a simplified version of the rasterization algorithm. Thescene that a user wants to render is given as a list of triangle primitives, also knownas a triangle mesh. Each primitive is then projected by the MVP matrix. The closestfragment for each pixel is determined and shaded to be displayed on the screen asshown in Figure 2.1a. In a step called rasterization, each pixel on the screen that isaffected by the projected triangles is shaded according to the triangle’s specifica-tion. The shading operation takes lighting, material, and texture properties of thecorresponding primitive into account. The final result is then written to the frame-buffer as shown in Figure 2.1b. Because of the properties of the display, each pixelon the screen can only have one color which gives rise to unwanted artifacts likealiasing, e.g. the rugged edge of the triangle in Figure 2.1b.Vertex Buffer Vertex Shader TesselationControl ShaderTesselationPrimitiveGenerationTesselationEvaluationShaderGeometryShader RasterizationFragmentShaderFragmentShaderInput (User defined), Ouput (Screen)Programmable Pipeline StageFixed-Function Pipeline StageFigure 2.2: The Open-GL pipeline as implemented on most dedicated graph-ics processing units [33, 37]Figure 2.2 illustrates the sequential steps the geometry goes through in therasterization pipeline as implemented on modern GPUs. The scene, consisting oftriangle primitives, is passed as the input. The rasterization pipeline operates oneach triangle independently, making the processing of triangles independent andtherefore parallelizable.The input to the rasterization pipeline is a list of primitives; triangles are mostcommon but lines and quads are supported. Each primitive is transformed in theprogrammable vertex shader, the MVP projection to screen-space happens in thisstage. The programmer supplies a programmable shader as a code snippet which7is then compiled and executed on the GPU cores. Shader code is C-like; OpenGL’sshader language is GLSL. After each vertex passed the vertex shader, the optionaltesselation stage breaks up large triangles and other primitives into smaller, moremanageable triangles which increase temporal and spatial locality during render-ing. Furthermore, the tessellation can add improved geometrical complexity tomodels. Tesselation is a relatively new addition to OpenGL and happens in 3 dis-tinct stages, some programmable and some fixed-function stages. The geometryshader is a rarely used shader stage that allows for procedural on-the-fly geometrygeneration. After the geometry stage, the rasterization stage is executed. Rasteri-zation breaks up large triangles into fragments. Each fragment is assigned to onepixel on the screen. Finally, the fragments are shaded in the fragment shader. Thefragment shader assigns a color to each fragment based on lighting calculations andoften multiple texture lookups. The bulk of the work in real-time game engines isdone in this stage. Fragment shaders are often complex and each commercial gamerequires a large amount of fragment shader code to account for all the variationsin lighting, models, and textures. For a more in-depth discussion of the OpenGLpipeline we refer to [33, 37].The independence of primitives throughout the pipeline is the key factor for thesuccess of rasterization. Each primitive can be computed independently of all otherprimitives. The only point of synchronization in the rasterization pipeline is theframebuffer. Multiple fragments may want to write to the same entry in the frame-buffer and either need to be blended or rejected based on depth (Z-value). Frame-buffer synchronization is handled in hardware in distinct units called early-Z rejec-tion and late-Z rejection [33]. Hardware synchronization abstracts any constraintsaway from the programmer and ensures parallelization. Parallelization guaranteesnear full GPU utilization and high render speeds.Due to the independence of primitives, the rasterization algorithm lends itselfwell to parallelization and fits the Single Instruction Multiple Data (SIMD) computemodel of the GPU. The architecture of the GPU is built around many thousand in-order cores. The cores do not execute out-of-order like most Central ProcessingUnit (CPU) do and are therefore weaker and have higher latency than the CPUcores. The GPU, however, is optimized for throughput. During rasterization thecores of the GPU can be kept busy, GPUs can save OpenGL context and switch8between them to hide the latency of memory fetches. Rasterization is efficient interms of memory reuse and therefore utilizes the cache hierarchy of the GPUs well.Rasterization both exploits spatial locality and temporal locality in texture fetches.Two neighboring fragments are likely to fetch from the same texture and thereforegenerate highly localized texture fetches which result in cache hits and thereforefast performance [21].GPU architecture has been adapted for the rasterization pipeline with manydedicated hardware blocks such as clipping, culling, early-Z and late-Z blocks inaddition to the tessellation and rasterization blocks [33]. The vertex, fragment andcompute shaders are run on the SIMT cores of the GPU.While the rasterization algorithm can render vast worlds at real-time speed, itsuffers from restrictions. True Global Illumination (GI) is impossible to solve inthe rasterization paradigm. Global illumination is the calculation of emitted lightat each point in space, taking into account effects such as shadowing, translucency,and transparency. As per the rendering equation introduced in Section 2.2, trueglobal illumination is computed over every possible light direction and thereforedepends on potentially every other primitive in the scene. This violates the inde-pendence of primitives assumption rasterization relies upon.Game developers have come up with ways to approximate the shortcomingsof the rasterization algorithm while maintaining real-time performance. Tech-niques such as light-maps [59], cascaded shadow-maps [44], and, screen-spaceeffects [55] have been introduced. Screen-space effects are localized effects thatare only applied to the geometry visible on the screen. Oftentimes they are im-plemented as post-processing effects that operates on the framebuffer after theframebuffer has been composed by the rasterization pipeline. In this thesis, wewill discuss one such screen-space effect in great detail, Ambient Occlusion [94].Screen-space effects sacrifice valuable render-time but fail to account any informa-tion beyond the screen which results in unrealistic effects. Despite these efforts,the visual quality of images rendered by rasterization lags behind the quality ofimages rendered via the ray tracing algorithm.92.2 Ray-TracingRay-Tracing is a rendering algorithm first introduced by Whitted [89] and then re-fined by Kay and Kajiya in [36]. Ray-Tracing seeks to imitate the physical proper-ties of light by following light paths which start at a light source and illuminate thescene. The physical properties of light, as described by Pharr and Humphreys [63],make this a complex process. Light refracts and diffracts depending on surfaceproperties, light scatters in many directions, sometimes objects even occludingthemselves. Advanced effects such as shadowing, volumetrics, and sub-surfacescattering have to be taken into account when modeling light transport.In practice, ray-tracing is done by generating a ray per pixel, originating at thecamera directed into the scene as shown in Figure 2.3. Rays are defined by anorigin point in 3D space O, a direction vector d such that:P = O+ td (2.1)where t is the ray extent and P is any point along the ray. The Ray extent is boundby: tmin <= t <= tmax, which are properties of the ray.Rays imitate the movement of photons as they are traced in a straight line.Primary rays originate at the camera and are traced into the scene as shown in Fig-ure 2.3. The intersection of the ray and the geometry is determined. Then sec-ondary rays are spawned. The direction of the secondary rays depends on theproperties of the material the primary ray has intersected with and the probabilisticdistribution function that is used as described by Pharr and Humphreys [63]. Sec-ondary rays are then intersected with the geometry of the scene before spawningfurther rays. A ray can be terminated when either a maximum number of bounceshas been achieved, when the ray misses all geometry and leaves the scene or whenthe ray is stochastically terminated.Shadow rays are a type of secondary rays that are used to determine the shad-owing effect in ray-tracing. At each intersection with geometry shadow rays (notshown in Figure 2.3 for simplicity) are spawned. Shadow rays determine a directconnection between a surface point and a light source, the result is used in thelighting calculation for the shadow effect. If the path between a light source and a10(a) 3D representation of scene and thepropagation of 2 sample rays and 2 corre-sponding secondary rays(b) Result of the 2 rays on the displayFigure 2.3: Raytracing traces ray into the scene which is intersected with thegeometry and launches secondary rayssurface point is unobstructed, the point is lit. If the path between a light source anda surface point is obstructed, the point is shaded.Ray Pool123Ray GenerationInterior node intersectionLeaf node intersectionFigure 2.4: Bounding Volume Hirarchy (BVH) traversal and intersectionEvery ray that is traced in the scene needs to determine where the ray intersectsthe geometry of the scene. This is referred to as “traversal and intersection”, whichis illustrated in Figure 2.4.Traversal and intersection without an acceleration structure as described in [89]11checks each primitive for an intersection of the ray. Then the intersection hit pointsare sorted based on the distance to camera to chose the closest hit point. The run-time of this algorithm is O(n), where n is the number of primitives in the scene. Amore efficient way to compute the traversal and intersection operation is by using atraversal acceleration structure such as a Bounding Volume Hierarchy (BVH)[32].The BVH is a spatial tree data structure that is built around the scene and that iscomposed of ever-tighter bounding volumes. Other traversal structures such as kd-trees [31] and octrees [2] are possible but are not widely used for GPU raytracing.Kd-trees show better traversal speed compared to BVH trees but have the a largeoverhead during construction [80].The root of the BVH tree encompasses the entire scene, the leaves of the treeare composed of a small number of scene primitives. The interior nodes are com-posed of water-tight axis-aligned bounding boxes [63]. The BVH needs to be con-structed only once per scene. Traversing the BVH is of runtime O(log(n)), wheren is the number of primitives in the scene.Figure 2.4 shows the traversal and intersection algorithm using a BVH. Duringtraversal and intersection each ray follows three steps: 1 Ray-generation, 2interior node intersection and 3 leaf intersection.1 illustrates the process of ray-generation. Each primary ray originates at thecamera and is given a direction and an extent. When rays are traced they spawn newrays such as shadow rays and secondary rays. In GPU raytracers [3, 93]the raysare generated and stored in a ray-pool in global memory. The ray-pool oftentimescontains over a million rays that are waiting to be traced [46]. The rays are loadedfrom memory in 1 , they are either packaged into warps on the GPU or ray-packets on the CPU before the BVH traversal starts. It is possible to sort raysbefore packaging them. Sorting rays is essential to assure good performance andlow divergence during traversal [30, 93].2 illustrates the interior node traversal. The root node encompasses the entirescene and assigns a watertight Axis-Aligned Bounding-Box (AABB). Watertight-ness [91] assures that no intersections are missed. AABB’s store the bounding boxinformation in as little as 6 floating-point values: (xmin, ymin, zmin) and (xmax, ymax,zmax) which represent the opposing corner vertices of the bounding box. Intersect-ing a ray with an AABB bounding box is a trivial operation and can be done in12much fewer instructions than intersecting a ray and a triangle. The implementationof one possible ray-AABB intersection routine is discussed by C. Ericson in [26].During intersection, it is only determined whether the ray intersects the AABB butthe exact intersection point is not relevant, further simplifying the computation. Ifa ray intersects an AABB the ray is tested against the node’s children. If the node’schildren are interior nodes, step 2 is repeated. If one of the children is a leaf, theray moves to step 3 . Early rejection of a ray is common in traversal and intersec-tion. If a ray fails to intersect with any AABB at a given level of the BVH tree, itis guaranteed that the ray does not intersect any geometric primitives in the sceneand the ray misses the scene.3 illustrates the leaf intersection. Once a ray has passed to the leaf nodes,the intersection step happens. The leaf nodes contain geometric primitives fromthe scene. During BVH construction, a fixed number of primitives are packagedinto each leaf node. The rays that intersect a leaf need to compute the intersectionpoint with each primitive in the leaf. An algorithm to intersect a ray with a triangleis discussed by C. Ericson [26]. If multiple intersections are found, the closestintersection is returned. This requires an additional sorting step per leaf node aswell as an additional sorting step across leaf nodes.There are two types of rays during the raytracing operation: closest-hit raysand hit-any rays. Closest-hit rays are used for primary and secondary rays, theserays need to find the closest intersection to the ray-origin and therefore require or-dered traversal through the BVH. Hit-any rays on the other hand simply determinewhether the ray intersects the scene or not. They do not require finding the closestintersection and omit the sorting step required in closest-hit rays.The biggest knob for quality in ray-tracing applications is the Samples PerPixel (SPP) count. Tracing a single ray per pixel only explores one possible lightpath. This produces a noisy and incomplete image. To decrease the amount of noiseand increase the visual accuracy of the rendered image, more SPPs are chosen.Production-based movies generally trace 2048 SPPs [20], while the smallest ray-tracing applications trace 14th SPPs [7].The strength of ray-tracing is that it takes into effect truly global light transport.Unlike the screen-space approximations discussed in Section 2.1, ray-tracing doesnot suffer from excluding any information not present in the scene, yielding higher13visual quality [50]. Each ray can potentially touch every aspect of the scene and inno determined order. This is a challenge for current memory systems. Ray-tracingdoes not exploit temporal and spatial locality.Two rays are considered coherent if two rays have similar origins and directionsand they are traced at the same time. Coherent rays exploit the cache hierarchy, theywill touch similar geometry with similar materials and similar textures. Divergentrays are rays that have distinct origins and distinct directions, these rays pose asignificant performance burden because the current memory system of CPUs andGPUs is not well suited to accommodate such rays [23].2.3 Hybrid RenderingHybrid rendering combines ray-tracing and rasterization graphics techniques togenerate visually accurate photorealistic computer-generated images at a tight real-time frame rate. Raytracing (Section 2.2) excels at computing photorealistic im-ages by allowing for reflection and refraction rays. Raytracing enables true globalillumination in computer-generated images by imitating how light propagates inthe real world. However, tracing rays is very expensive both in terms of energyconsumption and render time [40].Rasterization (Section 2.1) on the other hand sacrifices some visual qualityfor render speed. Rasterized images achieve a relatively high degree of visualaccuracy in real-time. Rasterization is primarily used in the video game industrywhere 60 frames per second rates are needed to achieve smooth gameplay. Raster-based rendering has benefited from the custom-tailored, ever-evolving, hardwarearchitecture of Graphics Processing Units (GPUs).Recent advances in GPU architecture have incorporated ray-tracing elementsinto current rasterization based GPUs [43, 60]. Graphics Application ProgramInterface (API) such as NVIDIA’s OptiX [60] or Microsoft’s DirectX Ray-tracing(DXR) expose raytracing capabilities to real-time graphics applications. While thecurrent state of GPU hardware does not yet allow full images to be raytraced in real-time [3], it appears there will be an industry trend towards combining some formof partial ray-tracing with rasterization based graphics [57]. This combination istypically referred to as hybrid rendering. To date, very little academic research has14explored the design space of hybrid rendering.One foray into hybrid rendering was done by EA’s Project PICA PICA [7].Raytracing effects such as transparency, translucency, shadows, ambient occlusion,and global illumination were implemented on top of a state-of-the-art commercialgame engine. Unfortunately, the source code is closed source and heavily relies onMicrosoft’s DXR API which is a closed source API. While PICA PICA describeshow to implement hybrid raytracing algorithmically, it does not allow for detailedprofiling and is, therefore, ill-suited as a tool for academic research.2.4 Real-Time RaytracingReal-time raytracing has been the holy grail of graphics since the invention of ray-tracing but remains unachieved. In this section, we document previous efforts tointroduce real-time raytracing and their shortcomings.2.4.1 CPU Ray-TracingCPUs are the default processing unit for raytracing. Open-Source raytracers suchas PBRT [63] and Mitsuba [56] rely solely on CPUs and do not provide a GPUbackend. They are nonetheless parallelized by taking advantage of multiple CPUcores and the SSE extensions to the x86 instruction set. CPUs are widely used inthe movie and VFX industry because of their programmability and the ease of use.Commercial frameworks such as Intel’s OSPRay [85] take advantage of theSIMD nature of the CPU cores to improve raytracing runtime. OSPRay is heavilyused as a tool for visualization while Intel’s Embree API [84] provides the state-of-the-art in CPU raytracing with near-real-time frame rates for simple scenes.Wald et al. [82] speed up CPU raytracing by combining rays into coherent ray-packets [34] which have similar origins and directions and map to the SSE lanes ofthe CPU. This technique takes advantage of the limited parallelism of the CPU totrace coherent rays and to reduce the memory overhead from fetching scene data.Rays become incoherent as they bounce around the scene which is detrimental tothe performance of packet tracing. Further schemes that dynamically break andreorganize packets have been proposed by Boulos et al. [12].15Another avenue of active research on CPU raytracing is raytracing on a clus-ter [35, 48]. A cluster is a combination of multiple compute nodes. The ray-tracingcomputation is split up among the nodes and the final result is combined once allthe nodes have completed their respective chunk of the overall workload. Ray-tracing lends itself well to parallelization. Rays are independent of each other.Nonetheless the splitting up of work among clusters is non-trivial: memory over-head and compute overhead between nodes are the bottleneck for performance[10, 54, 81].2.4.2 GPU Compute Ray-TracingPurcell and Hanrahan [65] were the first to parallelize raytracing applications on astream processor. With the rise of programmable GPU cores [66], raytracing speedon GPUs surpassed raytracing speed on CPUs for the first time [61, 66].GPU compute gave rise to two distinct approaches to parallelize ray-tracing:ray-to-thread mapping and thread-to-node mapping.The ray-to-thread mapping maps one ray to each thread and was popularizedby Aila et al. [3, 4]. GPUs operate on thread groups, or warps, of 32 threads. Ailaet al. propose to map 32 threads into one warp and let them compute the traversaland intersection computation in parallel. In their work, the acceleration structure ischosen to be a binary BVH. The implementation relies on CUDA compute kernelswith persistent threads. This approach was later extended by Ylitie et al. [93] whoshow significant performance improvements by increasing the branching factor ofthe BVH to eight. Furthermore, Ylitie et al. propose a mechanism for dynamicallyfetching new work which better load-balances the GPU compute units.The thread-to-node mapping was introduced by Garanzha and Loop [30] andrelies on each thread intersecting one node of the acceleration structure. The con-trol code in this approach becomes increasingly complex as the traversal proceedsthroughout the tree [93]. Furthermore, this approach requires sorting from front-to-back of the child bounding boxes to assure correctness.The efficient mapping of the acceleration structure used in raytracing to GPUmemory is an active area of research. Kd-trees gained popularity for raytracingin early research [29] but lately, BVH trees have dominated [4]. Ylitie et al. [93]16greatly compressed the traffic associated with interior root nodes. Nonetheless,GPU raytracing remains memory bound for large scenes due to high demands forscene memory traffic and acceleration structure traffic as demonstrated in Chap-ter Custom Raytracing HardwareVarious custom Field-Programmable Gate Array (FPGA) and Application-SpecificIntegrated Circuit (ASIC) implementations of custom raytracing chips have beenproposed.Schmittler, Wald, and Slusallek propose SaarCOR [69], a custom hardware ar-chitecture that uses SIMD like lanes to accommodate ray-packets. Woop, Schmit-tler, and Slusallek extend this work and introduce the RPU [90], a custom CPU/GPUco-designed FPGA chip running at only 66 MHz that renders images up to 20frames per second.Nah et al. propose T&I Engine [53], a custom pipeline designed for mobilephones that computes the traversal and intersection step of the kd-tree traversal.The T&I Engine proposes a ray accumulation unit that acts as a buffer for rays be-fore scheduling coherent rays. Their technique allows for effective memory latencyhiding.Barringer, Andersson, and Akenine-Mo¨ller propose Ray-Accelerator [8], a cus-tom chip that combines the CPU and the GPU into one integrated circuit. Commu-nication between the CPU and the GPU is facilitated using shared memory. TheCPU is responsible for sorting the rays before sending batches of rays to the GPUfor coherent bulk shading.Recent research work on real-time raytracing has proposed a custom Multiple-Instructions Multiple Data (MIMD) compute paradigm [38]. MIMD does not suf-fer from the same lock-step execution restrictions as SIMD compute architec-tures like CPUs. The MIMD paradigm has shown some promise in reducingthe energy requirement for raytracing by scheduling and pre-fetching memory ac-cesses [71, 73, 78]. Shkurko et al. [71] propose a MIMD architecture that allows forperfect pre-fetching of both ray input data and acceleration structure data, therebyeffectively eliminating data stalls. The closed-source nature of the SimTRaX sim-17ulator [72] used for these works makes this avenue of research more difficult topursue.A different avenue of research within the domain of custom hardware accel-eration has been on BVH construction. Kopta et al. [39] propose an extension toexiting architectures that allow for fast BVH updates in dynamic scenes. Viitanenet al. [79] show a promising FPGA approach to building the BVH from scratch andupdating it for dynamic elements in real-time framerates.While the performance and the power efficiency of these custom implementa-tions largely exceed those of GPUs and CPUs, they have not yet found widespreadadoption in commercial applications.2.5 Ambient OcclusionAmbient-Occlusion (AO), the measure of how exposed each point is to ambientlighting, is a technique heavily used in Computer Graphics. This effect is a vitalvisual effect in both real-time applications such as video games and offline render-ing applications such as movie frames [62].AO was first introduced by Zhukov et al. [94] and later popularized in offlinerendering by the movie industry [20, 63].A(P,n) =1pi∫ΩV (P,n)(ω×n)dω (2.2)Equation 2.2 [94] shows the mathematical expression of the AO value at pointP with a surface normal of n. Here, P is the point on the surface of an object in thescene and n is the normal vector on the surface at P. V (ω,n) is the visibility func-tion that has a value of zero when no geometry is visible in direction ω , otherwiseV (ω,n) is one. In the above,∫Ω refers to integration over a hemisphere orientedaccording to the surface normal n. Evaluating the expression in Equation 2.2 isnot possible using existing computers because it is not possible to integrate over aninfinite amount of directions, therefore, Equation 2.2 is approximated by samplingstochastically using Monte Carlo Sampling [41, 63].Figure 2.5 shows the impact of the AO effect on the rendered image of a hu-man face. The images in Figure 2.5 were rendered using Autodesk’s commer-18(a)(b)Figure 2.5: The effect of ambient occlusion on a human face [5]. The leftimage of Figure 2.5a shows the model without the ambient occlusioneffect, the face appears flat and unrealistic. The right image of Fig-ure 2.5a shows the same face with the ambient occlusion effect en-abled, the face appears vibrant and more realistic compared to the leftimage. Figure 2.5b shows the ambient occlusion effect that is appliedin Figure 2.5a19cial renderer [5], which relies on the rasterization algorithm. As shown in Fig-ure 2.5a, computer generated images appear flat and unrealistic to the human eyewithout AO. Ambient Occlusion imitates a natural light phenomenon that can beobserved all around us, one good example of the presence of ambient occlusionis the creases on a human’s forehead or the folds below a human nose as shownin Figure 2.5a. These naturally occurring troughs prevent ambient lighting fromentering the troughs and therefore appear darker in color then points on a flat sur-face that is exposed to direct ambient lighting. Figure 2.5b shows the map of theambient occlusion effect that is applied in Figure 2.5a.Raytracing applications do not require to take into account ambient occlusion.The natural propagation of light gives rise to the AO phenomenon. Raytracing im-itates the natural propagation of light and therefore renders the AO effect naturally,as part of the rendering process. Rasterization does not naturally take AO into ac-count. Real-time graphics developers have resorted to adding the AO effect as apost-processing step to increase the realism of the final output.Real-time ambient occlusion has been a much-researched topic in recent years [9,51, 67, 77]. The requirements for 60 fps often found in the video game industrycreated the demand for real-time ambient occlusion. Techniques such as Screen-Space Ambient-Occlusion (SSAO) [74] and Screen-Space Directional Occlusion(SSDO) [86] have found wide usage in video games. These effects are, however,performance-oriented approximations of the true Ambient Occlusion effect but op-erating on the Z-buffer rather than operating on the full geometry of the scene.Figure 2.6a shows an example of the SSAO output on the Sponza scene as ren-dered by Unity [76].The highest visual quality of Screen-Space Ambient-Occlusion is achieved us-ing Ray-Traced Ambient Occlusion (RTAO) [41]. RTAO is a post-processing effectsometimes used in the latency insensitive world of movies and offline rendering butis generally considered too computationally expensive for real-time applications.RTAO accounts for the entire scene and makes AO a global effect computed foronly the fragments visible on the screen. Figure 2.6b shows an example of theRTAO output on the Sponza scene as rendered by Unity [76]. Figure 2.6 shows thesubtle difference between the SSAO and the RTAO effect.Figure 2.7 shows a simplistic 2D example of RTAO. Ambient Occlusion is20(a)(b)Figure 2.6: The difference between Screen-Space Ambient Occlusion andRay Traced Ambient Occlusion on the Sponza scene as rendered by theUnity Game Engine [76]21Figure 2.7: A simplified example of Ray-Traced Ambient Occlusion show-ing three levels of occlusion of ambient lighting. 1 is unoccluded, 2is partially occluded and 3 is heavily occluded.a per-surface point effect and has to be computed for each visible surface point.Figure 2.7 illustrates 3 distinct examples of such surface points. 1 is a point on aflat surface; no geometry on a flat surface occludes 1 and therefore 1 ’s ambientcolor should not be attenuated during rendering. RTAO determines the absenceof any AO attenuation at 1 by shooting a given number (k) rays originating atthe surface with a given ray extent (tmin, tmax). The direction roughly follows thesurface normal but deviates according to a stochastic distribution as described byBarre´-Brisebois et al. in [7]. The rays shot originating at 1 do not intersect anysurrounding geometry, therefore, 1 is unoccluded.2 is partially occluded and its ambient color should be partially attenuated.RTAO determines this partial occlusion by shooting k rays with 2 as their origin.As illustrated, half the rays hit surrounding geometry, e.g. the wall to the right of2 .3 is completely occluded and its color should be heavily attenuated. RTAOdetermines this by the large majority of the rays originating at 3 that intersect thesurrounding geometry.22Chapter 3Ray-Traced Ambient OcclusionGPU Memory StudyThe Ambient Occlusion (AO, Section 2.5) effect is crucial to visual quality inrasterization graphics and is implemented as a post-processing effect in rasteriza-tion graphics. We study one such implementation, Ray-Traced Ambient Occlusion(RTAO), for the following reasons:• RTAO yields the highest visual quality by imitating the true physics of lightparticles, RTAO is a global effect and does not suffer from the limitations ofscreen-space effects• AO is a post-processing effect, therefore, it is trivial to replace existing ef-fects such as SSAO in rasterization based game engines by RTAO. RTAO isan active area of research for hybrid rendering ([7], Section 2.3)• RTAO is a relevant benchmark when studying divergent rays on a GPU simu-lator such as GPGPU-Sim [6] because AO generates viewpoint independent,highly divergent rays which do not need to bounce.• We do not limit our RTAO to screen-space as most rasterization engines do.We ray trace the entire model for a flexible, viewpoint independent workloadof truly global raytracing effects such as global illumination.23In this chapter, we first introduce our open-source implementation of a view-point independent RTAO GPU compute benchmark (Section 3.1). We show thevisual output in Section 3.2. We then analyze the GPU compute runtime, con-cluding feasibility for real-time hybrid RTAO effects, and study the benchmark’smemory behavior using GPGPU-Sim (Section 3.3).243.1 Ray-Traced Ambient Occlusion BenchmarkAmbient Occlusion is omnipresent in rendering engines. Rasterization based en-gines like Unreal or Unity include multiple ambient occlusion techniques. UnrealEngine provides limited support for ray-traced ambient occlusion with tight inte-gration into Microsoft’s real-time DXR API. DXR is closed-source and does notprovide the flexibility to be used in academia. Fully-fledged open-source CPU ray-tracers such as PBRT [63] and Mitsuba [56] include built-in support for ray-tracedCPU ambient occlusion.Zhukov et al. [94] first defined the concept of ray-traced ambient occlusion foroffline rendering. Laine and Kerras [41] present two methods for the generation ofapproximate ambient occlusion effects. The goal of their work is to achieve fasterambient occlusion in real-time applications by approximating the ambient occlu-sion effect. Their work compares their proposed techniques against the ray-tracedambient occlusion, which serves as the ground-truth. Visual quality degradation iscomputed using the absolute error that is an absolute pixel by pixel difference be-tween their output and the ground truth. The results presented by Laine and Kerrasshow that approximates ambient occlusion methods can be from 2× to 30× fasterthan ray-traced ambient occlusion. The main factor for the large range of speedupis scene size and scene complexity.Unfortunately, to the best of our knowledge, no open-source GPU raytracerprovides an Ambient Occlusion implementation for academic usage. Therefore, wehave made the code for our ambient occlusion benchmark publicly available1 [22].We design a benchmark that allows the user to load any model and to generatethe ray-traced ambient occlusion for that model on the GPU using three GPU ray-tracers. The user is given the freedom to adjust the most important parameters suchas Samples Per Surfel (SPS), Samples Per Triangle (SPT), maximum ray count, andray extent.SPS differs from the previously introduced SPP. Our benchmark is imple-mented as a viewpoint-independent global effect. We do not start rays at the cameraand shoot rays into the direction of a pixelated display. We instead compute ray-origins on the surface of the model, each ray origin is called a surfel. The number1 rays that are traced per surfel is given by the number of Samples Per Surfel(SPS). Traditional raytracing applications start all rays at the camera and shootthem toward each pixel on the screen. The number of rays per pixel is given by thenumber of Samples Per Pixel (SPP).3.1.1 Determining Ray-OriginThe first step in generating ambient occlusion rays for our RTAO benchmark isto load a model of the scene geometry. The scene geometry is defined as a listof triangles which is further split into a vertex list and a connectivity list. Weadopt the Object File Format (.obj) [14] model loading routine allowing us to loadtriangular meshes. Our model loader requires vertex positions, connectivity, andvertex normals in the Object File Format.From the triangular mesh, we determine the origin points of the rays. Therays should cover the surface area of the triangle mesh uniformly, such that we getuniform ambient occlusion information for the entire surface area of the model.Although it would be easy to simply use the vertex positions of the mesh asthe origins for the AO rays, this would lead to a sub-par ambient occlusion effectas illustrated in Figure 3.1. The best way to sample a mesh for ray-tracing is anactive field of research [28], with new sampling algorithms presented constantly.We follow the approach described in [7] and try to get an even cover for the surfacearea of the model.We achieve better surface coverage by randomly sampling each triangle primi-tive of the model. For each triangle with vertices (A, B, C) we generate two randomnumbers, r1 and r2 between 0 and 1 and generate a new point P within the triangleby applying the equation given in Equation 3.1 [58]. An extensive discussion ofsampling density and sampling distribution is given by Osada et al. [58] in Section4.2. Similarly, we generate the normal N at point P using the vertex normals (AN ,BN , CN) as shown in Equation 3.2 [58].P = (1−√r1)A+√r1(1− r2)B+√r1r2C (3.1)N = (1−√r1)AN +√r1(1− r2)BN +√r1r2CN (3.2)26Figure 3.1: Using the vertex positions as the origin of the rays for the Teapotmodel. The regular pattern provides poor surface coverage resultingin incomplete coverage and subsequently an incomplete visual ambientocclusion effect. Each red dot represents the origin of one ray.The tightness of the coverage of origin points depends on the samples per tri-angle (SPT). A higher number of samples results in a tighter surface cover and alarger number of rays. The tradeoff between a higher SPT is shown in figure Fig-ure 3.2.(a) 4 SPT (b) 8 SPT (c) 16 SPTFigure 3.2: Illustration of samples per triangle (SPT) surface coverage for4spt, 8spt, and 16spt. Each red dot represents the origin of one ray.3.1.2 Determining Ray-DirectionOnce the ray-origins have been established Section 3.1.1 the ray’s direction has tobe determined.It is possible to take the normalized surface normal N from Equation 3.2 as27the ray direction. This would yield an incomplete coverage of the hemisphereoriginating at P as illustrated in Figure 3.3Figure 3.3: Using the vertex normal as ray-direction yields poor coverage ofthe hemisphere originating at P. Each green dot represents the destina-tion of the ray direction of one ray.To compute ray-direction accurately we stochastically sample the hemisphereby using a cosine-weighted distribution function [25]. It is to be noted that the dis-tribution function does not take into account the surface properties such as textureroughness and reflective properties, this is because the ambient occlusion ischosento be independent of these surface properties [41]. We first generate an orthonor-mal basis to the normal N [24]. We then stochastically generate a tangent vectorusing the cosine weighted hemisphere distribution. Finally, we translate the tangentvector into the normal space using the orthonormal basis.True ambient occlusion requires multiple independent stochastic samples fromthe same origin. From one origin point on the surface model, multiple rays shouldbe shot into distinct directions according to the cosine weighted hemisphere distri-bution. The number of rays is given by the number of Samples Per Surfel (SPS).The pseudo-code in Listing 3.1 illustrates the algorithm. In Line 1, we iterateover each ray. For the remainder of the loop, the ray’s normal is referred to by N.In Line 5 we generate the tangent basis to N. A tangent basis is formed of threenormalized orthogonal vectors N, b1 and b2. We then iterate over each sample inLine 7, generating a stochastically generated sample in Line 10 and transforming281 for ray with normal N23 // get orthonormal basis of the normal4 float3 b1, b25 GetTangentBasis(N, b1, b2);67 for i in range(0, SPS)8 {9 // sample the hemisphere, get a tangent10 float3 tangent = CosineSampleHemisphere();1112 // translate tangent into world_normal space13 // tangent * float3x3(row0 = b1, row1 = b2, row2 = n)14 float4 ray_direction = tangent * to_matrix(b1, b2, n);15 }16 }Listing 3.1: Generation of the ray direction based on ray normal Nit into the normal space using the orthonormal basis in Line 14.Figure 3.4(a) illustrates the effect of using stochastically distributed ray-directions.The space surrounding the teapot is densely covered. Figure 3.4(b) illustrates theeffect of doubling SPS to 2, this configuration provides sufficient ray-distributionfor high-quality AO.(a) 1 SPS (b) 2 SPSFigure 3.4: Illustration of Samples Per Surfel (SPS) using the stochastichemisphere sampling approach from Listing 3.1. Each green dot rep-resents the destination of the ray direction of one ray.293.1.3 Tracing RaysOnce all the rays have been generated with origin and directions, they are ready tobe traced using existing raytracers.We generate the BVH acceleration structure using Intel Embree [84]. Embreetakes the model of the scene as it’s input. It then generates a high-quality BVH onthe CPU which is flattened into two buffers: an interior node buffer containing thebounding boxes of the BVH and a leaf node buffer containing the scene geometry.We then port this acceleration structure to the GPU by transferring the buffers toGPU device memory. We use three existing intersections and trace algorithms:Aila et al. [4], Ylitie et al. [93] and NVIDIA’s OptiX API [60]. We now discussthese raytracers.Aila et al. [4] published a raytracing GPU implementation in 2010. Their ray-tracer uses a binary BVH acceleration structure. The kernel is implemented as apersistent thread kernel, work is fetched on completion of previous rays. Becausethis implementation follows the one-ray-per-thread paradigm, a traversal stack ismaintained. We implement the Aila raytracer in Compute Unified Device Architec-ture (CUDA) as a compute kernel2 [22], we use CUDA-10 as shown in Listing 3.2.This raytracer uses two while loops to complete the raytracing work. The pre-liminary while loop in Line 1 in Listing 3.2 determines the termination of a ray.This while loop only terminates if a ray does not intersect with any more nodes,i.e. the traversal stack is empty. The two subsequent while loops in Line 2 andLine 6 do intersections of interior nodes and leaf nodes respectively. This orderingassures that all interior nodes have been intersected before intersecting any leaves.Performance measurements presented by Aila et al. [4] show that this separationbetween leaf and interior nodes utilizes the GPU cache hierarchy most efficientlyand therefore provides the best performance.Ylitie et al. [93] published an improvement on the Aila raytracer in 2017. Theimproved raytracer uses an 8-wide BVH acceleration structure. Using a wide ac-celeration structure has the advantage that it shortens the number of interior nodeseach ray has to intersect before reaching a leaf. Ylitie et al. further compress theinterior and the leaf nodes, reducing the memory overhead of each traversal. The2 while (ray not terminated)2 while (interior nodes to be intersected)3 intersect interior nodes4 maintain traversal stack56 while (leaf nodes to be intersected)7 intersect primitivesListing 3.2: While-While loop for Aila et al. [4] raytracertraversal order for leaves is not fixed, leaves are traversed from front to back ac-cording to ray-octant. Their memory compression scheme has allowed reducingthe stack memory traffic, which was identified as a major memory bottleneck byAila et al. to a negligible amount.We implement the Ylitie raytracer in CUDA as a compute kernel2 [22]. List-ing 3.3 shows pseudo-code of the CUDA compute kernel presented by Ylitie et al.The stack is again managed manually, the traversal stack S on Line 2 is accompa-nied by two additional stacks, G and Gt which indicate the current leaf or interiornodes to operate on. Line 5 shows that the kernel is implemented as an infinitewhile loop that does not terminate until the exit condition in Line 27 is met. Thekernel is implemented as a variation of Aila’s while-while implementation. First,all interior nodes are traversed in Line 7, then all leaf nodes are traversed in Line18. Finally either a new node is popped from the stack or the ray terminates in Line25.Throughout Listing 3.3 we refer to five helper functions that we did not in-clude code for. get closest node(G,r) chooses the closest node to the ray r’s ori-gin among all the child nodes of a given interior node G. intersect children(n,r)intersects all child nodes of an interior node n with a ray r. get next triangle(Gt)pops the next triangle from the triangle stack Gt . intersect triangle(t,r) intersectsa triangle t with a ray r. pop stack(S) pops a new node, either a triangle node oran interior node, from the stack S. The implementation can be found in3 [22].Finally, we use NVIDIA’s OptiX API [60] as a reference. OptiX is the state-of-the-art industrial GPU raytracing API. OptiX manages its acceleration structureand traversal and intersection operations and does not give much flexibility to the3 r = get_ray() // get ray from ray pool2 S = {} // traversal stack3 G = {root} // current node to operate on4 Gt = {} // current leaf node5 while(1)6 /*interior node loop*/7 if G is interior node8 n = get_closest_node(G, r)9 G = G - {n}10 if G is not empty11 S.push(G)12 G = intersect_children(n, r)13 else // G is leaf node14 Gt = G15 G = {}1617 /*leaf node loop*/18 while Gt is not empty19 t = get_next_triangle(Gt)20 Gt = Gt - {t}21 intersect_triangle(t, r)2223 /*terminal condition*/24 if G is empty and S is not empty25 G = pop_stack(S)26 else27 return // done!Listing 3.3: Ylitie et al. [93] GPU BVH traversal raytracerprogrammers in manipulating the acceleration structure or the traversal algorithm.The closed source nature of OptiX is therefore not suited for academic research.We nonetheless compare our benchmarks for both performance and correctnessagainst OptiX.Using the OptiX API for our RTAO query requires a few distinct steps as shownin Listing 3.4. Lines 1-5 setup the model. We feed both the vertex buffer and theindex buffer of the model to the OptiX API. Lines 7-9 setup the rays. We feedthe ray data to OptiX and set up a structure for the result data. Finally, Line 11executes the query which translates to tracing the rays. OptiX abstracts all tracinglogic away from the programmer, we do not know what the execute() function doesunder the hood.321 optix::prime::Context OptiXContext = optix::prime::Context::create(RTP_CONTEXT_TYPE_CUDA);2 optix::prime::Model SceneModel = OptiXContext->createModel();3 SceneModel->setTriangles(rg.IndexBuffer.size(),RTP_BUFFER_TYPE_HOST,, rg.VertexBuffer.size(), RTP_BUFFER_TYPE_HOST,;4 SceneModel->update(RTP_MODEL_HINT_NONE);5 SceneModel->finish();67 optix::prime::Query query = SceneModel->createQuery(RTP_QUERY_TYPE_CLOSEST);8 query->setRays(numRays,RTP_BUFFER_FORMAT_RAY_ORIGIN_TMIN_DIRECTION_TMAX,RTP_BUFFER_TYPE_CUDA_LINEAR, rg.cudaRays);9 query->setHits(numRays, RTP_BUFFER_FORMAT_HIT_T_TRIID_U_V,RTP_BUFFER_TYPE_CUDA_LINEAR, cudaHits);1011 query->execute();Listing 3.4: OptiX Prime [60] query implementation3.2 Visual Ambient Occlusion OutputWe visualize the output from our ambient occlusion benchmark using the GPUraytracers with Blender [11], an open-source 3D rendering tool. The raytraceroutputs a buffer with an entry for each ray, each entry corresponds to a floating-point value. An entry is either 0, indicating that the ray missed all geometry, or theentry is a floating-point value between tmin and tmax indicating that the ray has hitgeometry within its extent.The output of our ambient occlusion benchmark is a Polygon File Format [15](.ply) file. Each line in the file corresponds to one 3D point P on the surface of themodel. The number of points in file is calculated as:NumberO f Samples = NumberO f Triangles×SPT (3.3)Each line in the output file consists furthermore of a gray-scale color in therange from [0-255]. The color is representative of the ambient occlusion value atthe given surfel. The higher the value, the darker the color. Darker points exhibitless ambient occlusion than lighter points. This is counter-intuitive to the natu-331 for each sample point P2 sum = 03 for each sample in SPS4 if intersection_distance > tmin and intersection_distance <tmax5 sum++67 grayscale_color = sum / SPS89 if (sum > 0)10 write P to file11 Pcolor = grayscale_colorListing 3.5: Assigning color to each sample pointral phenomenon of ambient occlusion: the more a point is occluded, the darker itshould be represented. Our benchmark, however, aims to visualize ambient occlu-sion independent of all other visual effects. We find that debuggability and visualclarity is increased by inverting the gray-scale such that a white point represents aperfectly occluded point and a black point represents a little occluded point. If apoint is perfectly unoccluded it is omitted from our visual representation.The color of each sample point is determined by averaging the number of raysthat hit an object as shown in Listing 4.1. Line 4 is a bound check on the inter-section distance, this check makes sure that the intersection distance was withinthe extent of the ray. If a sample is perfectly occluded, sum equals SPS in Line 7.Therefore the corresponding pixel will have a color of 1, which equals black. If asample is perfectly unoccluded, sum equals 0 in Line 7; in this case, the check onLine 9 fails and the sample is omitted from the visual representation.In Figure 3.5 we show the visual output from our raytraced ambient occlusionbenchmark on scenes of various complexities and triangles counts. The visualoutput is a point cloud rendered in Blender. All scenes are taken from [49]. Thesummary of the benchmarks is shown in Table 3.1.We chose this point cloud visual representation to illustrate the 3D model. Wedo not limit our ambient occlusion to a pure screen-space because we want tostudy the memory behavior for the entire scene to get more representative resultsof a complete ray-traced model. We chose to omit any points that do not exhibit34Scene Name Number of trian-glesBVH Node Mem-ory (MB)Triangle NodeMemory (MB)Teapot 15k 0.09 0.41Dragon 300k 1.6 5.7Sponza 262k 4.6 17.77Buddha 1M 60.2 103.9Hairball 2.8M 156 356.8San Miguel 7.8M 176 574.5Table 3.1: Sample scenes used for performance evaluation of our RTAObenchmarkany ambient occlusion to make sure the points that do exhibit ambient occlusionare visual to the user of our benchmark. This allows for increased debuggability.Adding the ambient occlusion output to existing rasterization engines is possi-ble with some modifications to our benchmark. First of all, our benchmark rendersthe entire scene, the first step would be to limit the samples to only the trianglesvisible on the screen at each frame. Secondly, one would have to disable existingscreen space ambient occlusion of the rasterization engine and replace it with ourmodified screen-space ambient occlusion benchmark. Finally, the output from theambient occlusion and the rasterization output should be blended in an additivemanner. While adding our effect to rasterization engines is possible, we do not aimto do that. Our benchmark purposely traces the entire surface area of the modelrather than tracing just the screen-space area to guide GPU architecture research.We aim to improve GPU architecture by looking at the memory behavior of theentire scene, just like true global illumination would. Reducing our benchmark toscreen-space only would give an incomplete picture of the feasibility of real-timeraytracing.Teapot is a tiny example scene, it is not to be used as a benchmark but ratheras a tool for quick visualization and conceptual debugging. Both hairball and SanMiguel are large scenes of high geometrical complexity, they are not meant forreal-time rendering as they are too complex. They are included nonetheless as abenchmark for future-oriented hardware and as a tough workload illustrating thatour benchmark can handle complex scenes.The biggest impact on the quality of the AO effect is the values of Samples35(a) Utah Teapot(b) Dragon(c) Sponza (d) Buddha(e) Hairball (f) San MiguelFigure 3.5: Our Ray-Traced Ambient Occlusion output visualized on 8scenes. We used 32 SPS and 32 SPT for (a) and (b) and 4 SPS and4 SPT for (c) through (f).36per Surfel (SPS) and Samples per Triangle (SPT) as shown in Figure A.2 andFigure A.1. The user of the benchmark can choose the appropriate value, the higherthe values chosen the higher the quality of the ambient occlusion effect but the morerays are being traced and the longer the runtime.As mentioned in Chapter 1, there is an inherent tradeoff in Computer Graph-ics: quality vs speed. While the highest visual output is achieved with high SPSand SPT values, performance scales linearly with both SPS and SPT as discussedin Section 3.3. Our benchmark illustrates this trade-off: for performance reasons itis crucial not to waste resources by keeping SPS and SPT low but for the highestvisual quality both SPS and SPT have to kept high.Figure A.2 illustrates the tradeoff of the SPS values on the teapot scene, SPTis held constant at 32. 1 sample per pixel yields poor ambient occlusion output.With 1 SPS the color algorithm described in Listing 4.1 outputs a binary value:there either is occlusion or there is none. 32 SPS on the other end of the spectrumprovides values in the range from 0-32, this value indicates the amount of ambientocclusion and allows for a much more accurate visual effect.Figure A.1 illustrates the tradeoff in choosing the SPT value on the dragonscene while keeping the SPS value constant at 32. SPT roughly translates to thecoverage of the surface area of the model. A small SPT number provides sufficientcoverage if the model is composed of many small triangles. A small SPT valueprovides bad coverage for models with large flat triangles. We suggest keeping theSPT value low. Areas with a high density of small triangles in a model suggestsharp angles and complex geometry. These are the areas where the AO effect ismost pronounced. Large flat surfaces generally do not have any significant visualAO effect.3.3 Performance and Memory Behaviour of GPU Ray-Traced Ambient OcclusionWe have explained the setup of our RTAO benchmark in Section 3.1, we thenmeasured the performance of the CUDA kernels using NVProf and described theoutcome in. In this section, we measure the performance of the RTAO benchmark37Scene Name Aila et al. [4] Ylitie et al. [93] NVIDIA OptiXprime [60]Teapot 0.223 0.144 0.199Dragon 3.28 1.819 1.99Sponza 3.488 4.043 3.19Buddha 20.283 10.778 8.99San Miguel 872.3 392.6 363.57Hairball 1141.019 395.724 966.537Table 3.2: The time (in ms) rendering the complete AO effect at 8 SPS and 1SPT per scene for the 3 GPU raytracerson a high level by timing the CUDA kernels and on a low level using the GPGPU-Sim simulator [6].3.3.1 Performance EvaluationTable 3.2 shows the time it takes to render the complete AO effect for our bench-mark scenes. The test configuration is set to 8 SPS and 1 SPT. The results producedare a high-quality AO effect. Naturally, it takes longer to trace large scenes. Largerscenes have more triangles and therefore more rays are traced.Across the raytracers, a trend can be observed. The Aila et al. raytracer us-ing a binary BVH generally performs slower than Ylitie et al. and OptiX Prime.OptiX performs the best of all our raytracers. We suspect OptiX uses highly op-timized CUDA kernels and pre-processes the rays to increase ray-coherence andray locality. The closed-source nature of OptiX prevents us from confirming ourspeculations. Effect of Scene Size on PerformanceFigure 3.7 shows the performance of each of the three GPU raytracers on eachof the six sample scenes. These performance numbers were measured by runningthe RTAO benchmark on real hardware. The GPU used for all measurements wasan NVIDIA 2080, Turing architecture from 2019. This GPU is a commerciallyavailable, high-end desktop GPU found in up-to-date gaming PCs. The numbers,therefore, represent the upper end of the desktop performance scale. We timed the3815000 260000 300000100000028000007800000runtime (ms)# triangles02505007501000125002000000400000060000008000000teapotsponzadragonbuddhasan mighairballAila Ylitie OptiX # triangles(a) Kernel timing vs # triangles in sceneruntime (ms)0102030teapot sponza dragon buddha san mig hairballAila Ylitie OptiX 60 fps(b) Kernel timing, 60 fps. Right most barsare cut-off.Figure 3.6: Kernel timing for Aila, Ylitie and OptiX raytracers for all 6benchmark scenes. 8 SPT, 1 SPS. Figure 3.7a shows the number oftrianges per scene, Figure 3.7b shows 60 fps.Aila et al. and the Ylitie et al. kernels using the CUDA timing and profiling toolNVProf [57]. We timed the OptiX implementation using CUDA timing events [57]to generate the wall time, for OptiX we had to resort to wall time because the OptiXAPI does not allow for in-depth profiling.Figure 3.7a shows the performance results with respect to the number of trian-gles in the scenes. We observe a trend: the number of triangles grows exponentiallywhile the runtime grows linearly. This is in line with the runtime of the BVH ac-celeration structure which is O(log(n)), where n is the number of primitives in thescene. Comparing all three raytracers, it can be observed that Ylitie et al. andOptiX prime far outperform the Aila et al. raytracer. Interestingly, Ylitie et al.outperform OptiX prime on the large and complex Hariball scene due to the opti-mization for highly diverged workloads. Ylitie et al. trade performance on coherentrays for better performance on incoherent rays.Figure 3.7b shows the timing of the kernel and the 60fps cutoff enforced invideo games. On a high-end GPU, tracing complete AO remains well beyond thesmall scenes of Dragon, Sponza, and Buddha. Complex scenes like San Migueland Hariball exceed the cutoff by over 10×. These complex scenes remain farbeyond the reach of current GPU architectures.39SPSruntime (ms)02550751001251 2 4 8 16 32 64Aila Ylitie OptiX 60 fps(a) Kernel timing vs SPS (ms), 60 fps forreferenceSPSMRays/s01000200030001 2 4 8 16 32Aila Ylitie OptiX(b) Kernel timing vs SPS in million rays persecond MRays/sFigure 3.7: Kernel timing for Aila, Ylitie, and OptiX raytracers for the Bud-dha scene for different values of SPS. The kernel performance scaleslinearly with SPS. Effect of SPS on PerformanceFigure 3.7 shows the runtime of the RTAO benchmark for the Buddha scene withvarious numbers of SPS. SPS is the biggest knob for visual output quality in raytracing (Section 2.5, Section 2.2). Visual output increases with a higher numberof SPS. Figure 3.7a shows that performance scales linearly with SPS count. Fig-ure 3.7a shows the real-time line of 60fps for the Buddha scene. It is possible totrace at most 8 SPS while maintaining 60fps if we were to use the entire computecapabilities of a high-end GPU. AO is a post-processing effect and only a smalladdition to the overall rendering pipeline. It is, therefore, not feasible to utilize theentire GPU. Figure 3.7a shows that adding AO effect to rasterization engines as apost-processing effect has to be reduced to low SPS numbers in the range of one tofour.Figure 3.7b shows another popular performance metric in raytracing: millionrays per second (MRays/s). While Figure 3.7b shows the same data as Figure 3.7a,a different trend becomes visible: the higher the number of SPS, the higher therays/s. This is due to ray coherence. Large SPS generates multiple rays withsimilar origins and similar directions. Especially the OptiX raytracer capitalizeson this, when tracing 32 SPS OptiX is almost 30% faster than when tracing 16SPS.403.3.2 Memory EvaluationWe use GPGPU-Sim [6] version 3.2, modified to enable CUDA 10, to analyze themicroarchitectural behavior of our GPU AO benchmark on modern GPU architec-tures. GPGPU-Sim is a cycle-level simulator that simulates the CUDA computekernels running on a modern GPU and provides detailed statistics.As a case study, we evaluate the Ylitie et al. [93] raytracer for the Dragon andSponza scenes. We chose the Ylitie et al. raytracer over the Aila et al. raytracerbecause it performs better on modern hardware. The Ylitie et al. kernel is thefastest open-source GPU raytracer to date.In this section, we evaluate the Dragon and the Sponza scenes because they aremid-range scenes. Both scenes are representative of moderate mobile workloads,i.e. we would expect to be able to run scenes of this complexity on today’s phonesand tablets at 60 fps. Evaluating too small a scene results in incorrect results be-cause raytracing applications are memory-bound. When raytracing small scenes,however, the GPU is compute-bound [3]. The reason for this discrepancy is that forsmall scenes the entire acceleration structure and the leaf nodes fit in the L2 cache.This is not a realistic workload, today’s video game scenes and movie scenes aremuch larger than any L2 cache. Both the Sponza and the Dragon scene exceed theL2 cache of conventional GPUs and are therefore appropriate scenes for a mem-ory study. We refrain from evaluating large scenes because the trends observed formemory-bound scenes hold for even larger scenes. Memory Access ClassificationAll memory accesses are generated at one of the many GPU streaming multipro-cessors. We log these memory accesses when running the AO benchmark usingCUDA compute kernels before they enter the Load-Store queue (LDST). Whenmemory accesses are generated, they have not yet entered the cache hierarchy.Memory accesses are classified by buffer: memory accesses can be assigned toone of five buffers:1. Interior Nodes - a flattened buffer representing the interior nodes of theBVH tree. Each node stores its AABB values, the child node pointers, and41additional information such as depth and child node information.2. Leaf Nodes - a flattened buffer representing the triangle information of thescene. Each node stores triangle information for all triangles contained inthe leaf node.3. Ray Input - each entry represents a ray. A ray is represented by eightfloating-point values, three values for the ray origin O, three values for theray direction d and two values for the ray extent tmin, tmax.4. Ray Output - each entry is the result of the corresponding ray from the raybuffer. The ray result is composed of a single floating-point value and thetriangle id of the hit.5. Traversal stack - the traversal stack is maintained for acceleration structuretraversal. The traversal stack is kept in shared memory but can occasionallyspill to DRAM.Figure 3.8 shows the distribution of the memory accesses generated at theLoad-Store Queue (LDST) for each of the six buffers. From Figure 3.8 it becomesevident that the largest chunk of memory accesses are accessing the interior nodesof the acceleration structure, followed by memory accesses for the triangles of thescene. This shows that most memory accesses are generated when traversing thetree, which is a result of the pointer chasing nature of traversing a tree structure.The second largest cause for memory accesses is the triangle geometry, followedby the ray buffer and the ray result buffer. It is to be noted that the traversal stackcauses negligible amounts of memory traffic. This is because Ylitie et al. [93] havesignificantly compressed the stack traffic. Cache PerformanceWe study the cache performance of GPU architectures. GPGPU-Sim emulates thecache hierarchy of NVIDIA GPUs. Each compute core has a 128kB data cache thatis independent of all other L1 caches. Furthermore, there are different caches, suchas the instruction cache used for instructions, the constant cache, and the texturecache. In the Ylitie et al. [93] raytracer that we use for our cache evaluation, the429.4%4.4%69.5%16.3%Ray InputRay OutputInterior NodesLeaf NodesTraversal Stack(a) Dragon8.3%72.6%15.2%Ray InputRay OutputInterior NodesLeaf NodesTraversal Stack(b) SponzaFigure 3.8: Distribution of memory accesses generated at the GPU computecores. 100k rays, 8 SPS, 8SPT.L1D L1I Unified L2Miss Rate Dragon 76 0.8 32Number of accesses Dragon (M) 2.9 51.6 1.3Miss Rate Sponza(%) 75.1 0.8 48Number of accesses Sponza(M) 3.2 53.3 1.4Table 3.3: Cache miss rates for various GPU caches for the Dragon andSponza scenetexture and the constant cache are unused. The L2 cache is unified on NVIDIAGPUs. The L2 cache size used for simulation in GPGPU-Sim is 4MB.Table 3.3 shows the miss rates of the caches as measured by GPGPU-Sim. TheL1 data cache that funnels BVH interior nodes and scene geometry, as well as raydata to the cores, has a very high miss rate of 75%.The L2 cache miss rate ranges from 32% for the smaller dragon scene to 48%for the larger Sponza scene. The increase in the L2 miss rate is directly proportionalto the scene size. The larger the scene, the larger the BVH tree, the more pressureis put on the L2. This finding is in line with our observations from Section we found that kernel performance scales linearly with scene size. From Ta-ble 3.3 we can conclude that this correlation is due to a higher number of DRAMaccesses for larger scenes.Figure 3.9 shows the miss rates of the caches per buffer. We plot the L143data cache miss rates for both the Dragon in Figure 3.9a and the Sponza scenein Figure 3.9b. From the bars, we can see that the L1 data cache has a high missrate. All accesses to the ray input and the ray result data result in 100% misses(blue and red bars). This is expected because these are cold cache misses. Theray input and result data only access data once for the trace operation and see nosubsequent reuse. The BVH data (yellow bar) is the only buffer that sees somereuse in the L1 cache. The hit rate for the BVH buffer is around 68% for both theDragon and the Sponza scene, therefore scene size does not matter when it comesto L1 hit rates. The triangle buffer (green bar) experiences some reuse although themiss rate for the L1 hovers around 92% for both Dragon and Sponza. Figure 3.9aand Figure 3.9b confirms that the L1 is of little use, except when storing BVHdata.Figure 3.9c and Figure 3.9d show the unified L2 cache rates. Similarly to theL1, the L2 cache does not see any hits for the ray input or the ray output. TheL2 cache, however, has a significant hit rate for the BVH data (yellow bar), thestack data (orange bar), and the instructions (light blue bar). The most importantobservation is that the BVH miss rate in the L2 cache is under 10% for the Dragonscene, but only 25% for the Sponza scene. This discrepancy can be attributed tothe model size, the Sponza model is larger than the Dragon model, therefore moredata needs to be transferred from DRAM to the L2 cache, evicting data whichcould have been a hit. The number of cache misses increases with screen size.The increase in cache misses from Dragon to Sponza is more pronounced in theprimitive buffer (green bar). The miss rate for the primitive buffer of the Dragonscene is at 42% while the miss rate for the larger Sponza scene is at 86%.Figure 3.9 shows that the main reason for performance reduction with in-creased scene size is the number of conflict misses and the amount of data touchedby the compute cores. Memory ReuseIn this section, we present our memory reuse study of the Dragon scene using ourRTAO benchmark and GPGPU-Sim.We analyze the amount of memory reuse that is available in the RTAO compute44100 92.0066301667.9475818892.48071367 100L1D miss rate (%)0255075100Ray Input Ray Output Interior Nodes Leaf Nodes Traversal Stack(a) Dragon - L1 data cache100 10067.1870290993.68972415100L1D miss rate0255075100Ray Input Ray Output Interior Nodes Leaf Nodes Traversal Stack(b) Sponza - L1 data cache100 10023.7215843485.917115640.01994714008 1.25L2 miss rate (%)0255075100Ray InputRay OutputInterior NodesLeaf NodesTraversal StackInstructions(c) Dragon - L2100 10023.7215843485.917115640.01994714008 1.25L2 miss rate (%)0255075100Ray InputRay OutputInterior NodesLeaf NodesTraversal StackInstructions(d) Sponza - L2Figure 3.9: L1 data and L2 cache miss rates per buffer - Dragon and SponzaScene.kernels. Memory reuse is defined as the number of memory addresses that areaccessed more than once during the duration of the execution of the kernels.Table 3.4 shows the amount of fewer DRAM accesses if the L2 was infinitelysized. In real GPUs the L2 has to evict entries due to limited space capacity, this,in turn, results in capacity misses which increases the number of DRAM accesses.This study shows that a significant amount of DRAM accesses could be saved withan infinite L2 cache, which shows that there is a significant amount of memoryreuse that goes lost on current GPUs.We differentiate between perfect ray sorting, whereby rays are sorted by originand direction to maximize ray locality, and random ray ordering, whereby raysare randomly shuffled before being traced to minimize ray locality. Perfect raysorting assures that rays with nearby origins and similar directions are traced inthe same warps. Perfectly sorted rays are coherent. This sorting scheme lowers45Ray order Perfect Ray sorting Random Ray sortingFewer DRAM accesses 6× 8.5×Table 3.4: Limit study of the amount of DRAM accesses saved with an infi-nite L2 cachefraction of memory accesses to top X% of addresses (%)top 1% of addresses 49.2top 5% of addresses 66.9top 10% of addresses 74.9top 20% of addresses 82.1top 30% of addresses 85.3top 40% of addresses 87.5Table 3.5: Memory reuse study: What fraction of total accesses for top X%of memory addressesthe number of DRAM accesses that have to be made because the cache exploitsthe ray coherence via temporal locality. Randomized ray sorting assures that thereis minimal coherence between rays. This sorting scheme results in more DRAMaccesses because warps pack incoherent rays and therefore reduces the cache hitrate. Most current ray-tracing workloads are composed of highly incoherent rays.From Table 3.4 we see that if the rays are perfectly sorted, a practically im-possible scenario, we can save up to 6× the amount of DRAM accesses with aninfinitely large L2 cache. If the rays are randomly shuffled, a much more realisticscenario for modern divergent raytracing workloads, we could save up to 8.5× thenumber of DRAM accesses provided an infinitely sized L2 cache.Table 3.4 demonstrates that there is significant room for optimization in reduc-ing the overall DRAM accesses in raytracing applications. The cache hierarchy ofcurrent GPUs does not extract enough memory locality.Finally, we further analyze the distribution of memory addresses that exhibitthe highest amount of reuse. A memory address exhibits high amounts of reuse ifit is accessed multiple times. A perfect cache would result in a cold miss in the firstaccess and all subsequent accesses would result in cache hits. Using GPGPU-Simwe log all memory addresses at the LDST queue and record which addresses areaccessed how often.46Table 3.5 shows what fraction of total accesses for top X% of memory ad-dresses. From the table, we can see that the top 1% of addresses account for 50%of all memory accesses. This means that if we managed to store the correct 1% ofall memory addresses in the cache, we could get a cache hit rate of 50%.Figure 3.10 shows the classification according to buffers used in raytracing.For example, Figure 3.10a corresponds to the 1% of addresses that are most fre-quently accessed Section Furthermore, Figure 3.10a shows that amongall the memory addresses that belong to the group of top 1% most accessed mem-ory addresses, 98% belong to the Interior Node buffer. This indicates that the bestutilization of the cache hierarchy is to store the BVH acceleration structure. In Fig-ure 3.10b we show the distribution of the 5% most reused memory addresses. Wecan see that while most of these accesses are for the BVH acceleration structure,some heavily reused addresses are for the triangle primitives.We, therefore, conclude that a lot of memory address locality goes lost duringraytracing applications because of misses in the cache hierarchy.4798.6%Ray OutputInterior NodesLeaf Nodes testTraversal Stack(a) 1%80.2%19.8%Ray OutputInterior NodesLeaf Nodes testTraversal Stack(b) 5%63.5%36.5%Ray OutputInterior NodesLeaf Nodes testTraversal Stack(c) 10%49.9%49.9%Ray InputRay OutputInterior NodesLeaf Nodes testTraversal Stack(d) 20%12.5%35.1%48.5%Ray InputRay OutputInterior NodesLeaf Nodes testTraversal Stack(e) 30%25.2%8.1%27.1%39.6%Ray InputRay OutputInterior NodesLeaf Nodes testTraversal Stack(f) 40%Figure 3.10: Memory reuse study: Classification of memory addresses be-longing to the top X% of all accessed memory addresses.48Chapter 4Hash-Based Ray Path PredictionState-of-the-art ray tracing techniques operate on hierarchical acceleration struc-tures such as BVH trees which wrap objects in a scene into bounding volumesof decreasing sizes. Acceleration structures reduce the amount of ray-scene in-tersections that a ray has to perform to find the intersecting object. However, weobserve a large amount of redundancy when rays are traversing these accelerationstructures. While modern acceleration structures explore the spatial organizationof the scene, they neglect similarities between rays that traverse the structures andthereby cause redundant traversals.This chapter provides a limit study of a new promising technique, Hash-BasedRay Path Prediction (HRPP), which exploits the similarity between rays to predictleaf nodes to avoid redundant acceleration structure traversals. Our data shows thatacceleration structure traversal consumes a significant proportion of the ray tracingrendering time regardless of the platform or the target image quality. Our studyquantifies unused ray locality and evaluates the theoretical potential for improvedray traversal performance for both coherent and seemingly incoherent rays. Weshow that HRPP can skip, on average, 40% of all hit-all traversal computations.Ray tracing techniques [89] employ hierarchical acceleration structures, suchas Bounding Volume Hierarchies (BVH), that capture spatial locality through sub-dividing scenes into a hierarchy of ever tighter bounding boxes. These accelerationstructures, i.e., traversal trees, reduce the subset of a scene that a ray has to inter-sect. However, reducing ray-scene calculations comes at the cost of additional49ray-box intersections that have to precede ray-scene intersection computations.Consequently, a trade-off has to be made between the depth and the width ofa traversal tree, where the branching factor determines the depth and width of atree. Wide trees are shallow and able to quickly traverse a ray to a leaf node forray-scene intersection computations. On the other hand, deep trees need to traversemany interior levels to reach a leaf node but it entails less ray-scene calculations.This chapter proposes and studies the potential of a new technique, hash-basedray path prediction (HRPP), which reduces the cost of traversing deep trees byexploiting ray locality, where rays from close-by origins and similar directions fol-low a similar path through the tree. HRPP exploits ray locality present throughouta scene traversal to skip redundant ray-box intersections.Extensive work has been done in recent years exploiting ray coherence for ef-ficient traversal by mapping coherent rays to parallel hardware such as SIMD unitsand GPU Warps [64, 82, 83]. HRPP is different in that it proposes to skip redun-dant ray traversal computations and prevent rays with high similarity to previousrays from even entering the acceleration structure. Skipping interior node traver-sal avoids all DRAM traffic associated with interior nodes and reduces to overallcomputations needed to ray trace a frame.Similar to the limit study by Lam and Wilson [42] which presents an upperbound on control flow parallelism, we present a limit study on how many interiornodes can be skipped by exploiting ray locality using HRPP.This chapter makes the following contributions:• It highlights the potential of ray path prediction for accelerating hierarchicaltree traversal;• It proposes a hash function that takes into account ray properties so it can beused for predicting the path of similar rays;• It shows a theoretical limit study that quantifies ray locality and evaluatesits potential for performance improvements using HRPP, which was able toavoid 30% of ray-box intersections. This limit study serves as a basis fora future hardware-based implementation that can harness HRPP to improveray tracing performance.504.1 Ray Tracing Performance EvaluationWe studied the performance characteristics of hierarchical acceleration structuresusing two BVH-based ray tracing implementations: the CPU-based PBRT [63]renderer and Aila’s [4] GPU-based renderer. We use the eight example scenesshown in Figure 4.1a.PBRT differs from Aila’s in that it renders scenes at a higher visual quality.Aila’s GPU implementation is optimized for speed and real-time image genera-tion; hence, less effort is put into sophisticated sampling and lighting computationswhile texture lookups are omitted.Figure 4.1b shows the proportion of the total run-time consumed on render-ing ray-box intersections (interior nodes) and ray-scene intersections (leaf nodes)in both implementations on increasingly complex scenes. The proportion labeled‘Other’ in Figure 4.1b is spent on path tracing tasks such as sampling, texturelookups or lighting computations. The proportion of time spent computing ray-box and ray-triangle intersections is correlated with the geometrical complexity ofthe scene.For the CPU-based PBRT, the time spent in traversing the BVH tree is, onaverage, 40% of the total execution time, where around 80% of the traversal timeis spent on ray-box intersections.For Aila’s GPU-based ray tracer, the proportion of time spent on BVH traversalis, as expected, higher compared to PBRT. Figure 4.1b shows that the GPU pathtracer is spending upwards of 95% of the execution time performing tree traversal.Similar to PBRT, the most time-consuming step is ray-box intersections.The results above show that tree traversal remains the most time-consumingstep in ray tracing, even when using benchmarks that use very different renderingimplementations and target different levels of image quality. Thus, acceleratingthe traversal process would benefit both real-time ray tracing (e.g., Aila’s GPUimplementation) and offline ray tracing (e.g., PBRT), as in both cases calculatingray-box intersections remains the most costly part of the process.In this chapter, we provide a study showing the potential of using Hash-BasedRay Path Prediction to reduce the amount of ray-box intersections calculations(i.e., interior node traversal). Whereby, taking into account the properties of rays51and their path through a scene, we can predict which part of a scene a ray intersects;thus, avoiding traversing internal tree levels to reach target leaf nodes.52CPU Teapot CPU Killeroo CPU Buddha CPU Sportscar2k triangles 600k triangles 1M triangles 53M trianglesCPU San Miguel GPU Dragon GPU Sponza GPU Hairball7.8M triangles 800k triangles 250k triangles 1.4M triangles(a)CPUTeapotCPUKillarooCPUBuddhaCPUSportscarCPUSanMiguelGPUDragonGPUSponzaGPUHairball0204060801008666 6254 482510 11125 2933 384647 693 9 913 14294330TotalRuntime(%)Leaf Node IntersectionInterior Node IntersectionOther(b)Figure 4.1: Runtime proportions of increasingly complex scenes on the CPUand the GPU - 1024×1024, 8 SPP534.2 Related WorkRay Locality. Pharr et al. [64] observed that primary rays and simple light raysexhibit ray locality but chose not to further explore this type of locality. Boulos etal. [13] worked on pre-processing rays in the ray pool to accelerate ray traversal,however, their technique is limited to offline rendering and infeasible for onlinerendering. Wald et al [82, 83] use ray locality for packet tracing but do not skipredundant computations.GPU BVH Traversal. Aila and Laine [4] provided the most widely used GPUbenchmark for ray traversal. Aila and Kerras [3] propose to reduce the memorytraffic during tree traversal via changes to the GPU architecture to facilitate treeletcompaction. Aila and Laine’s GPU ray tracing benchmark has been further ex-tended recently by Ylitie et al. [93] where the compression of the memory trafficfrom an 8-wide BVH improved performance by a factor of up to 3.3 due to mem-ory reduction. Lier et al. [45] applied SIMD style ray tracing to wide BVH treesto improve incoherent ray traversal performance by 35%-45% on average.Harware Accelerated Ray Tracing. Shkurko et al. [71] achieved performancegains by buffering rays inside the acceleration structure, thereby achieving perfectprefetching and avoiding duplicate memory fetches. Their technique assumes cus-tom MIMD hardware and fixed function logic. More recently, Lu et al. [46] madechanges to the GPU SIMD architecture to minimize thread divergence to improvetraversal speed. Thread divergence was avoided by preventing certain threads inwarps from intersecting scene data, while others are traversing or fetching new raydata. This is achieved by reserving buffers that act as ray-queues. All rays that areabout to intersect triangle nodes are stored in a ray-queue. Once number of rays inthe queue has reached warp size, the rays are packaged into a warp and scheduledto execute on the streaming multiprocessor.544.3 Hash-Based Ray Path PredictionPredictionTAGRay Poolhash(ray_origin, ray_direction)computerayhashhash(ray_origin, ray_direction)VVVVHRPPFigure 4.2: Binary BVH Acceleration Structure augmented by HRPP. Theinterior nodes are represented by their bounding boxes. Leaf nodes haveboth bounding boxes and triangular scene geometry. HRPP additionsare marked in green. The red arrows represent the control flow fromroot to leaves through HRPP4.3.1 OverviewFigure 4.2 shows an example of a binary BVH tree used in PBRT and Aila’s GPUraytracer.The traversal of rays through the acceleration structure starts at the ray poolwhich dynamically collects all the rays to be traced throughout the rendering pro-cess. Once the rays exit the ray pool, they are traced through the BVH tree in atop-down fashion. On the CPU, ray traversal bundles each group of rays in pack-ets for SIMD parallelized traversal [83]. On the GPU, rays are bundled into muchwider packets called warps, which match the underlying architecture of GPUs [3].55A ray that intersects with the scene geometry has to traverse up to the full depthof the tree before calculating any scene intersections. Traversing interior-nodes,performing ray-box intersections along the way, is a costly operation that does notprovide any merit other than guiding rays to leaf nodes where the essential sceneintersections take place.Shadow Rays follow the ‘hit-any’ paradigm, whereby they search for any sceneintersection. Most other rays, however, need to determine the closest intersectionto the ray’s origin and, therefore, follow the ‘hit-all’ paradigm.If two rays exhibit locality they are mapped to similar leaf nodes. Trivially, pri-mary rays all start at the camera and their directions remain similar; consequently,primary rays are highly coherent and exhibit locality with each other.Subsequent rays, such as reflections rays, shadow rays, and incoherent globalillumination rays also exhibit locality. Many secondary rays, which have originatedfrom distinct primary rays, intersect with similar surfaces in the scene or are tracedtowards similar light sources. Even seemingly incoherent rays can show similar-ity with other incoherent rays throughout the rendering of a frame. This type oflocality, which goes beyond primary rays, is unexploited in today’s raytracers.4.3.2 Hash-Based Ray Path Prediction (HRPP)HRPP skips traversal operations by predicting ray paths. HRPP predicts whichnodes a ray is likely to intersect based on information gathered from previous sim-ilar rays.HRPP’s direct flow guides rays directly from the ray pool towards the leafnodes as shown in Figure 4.2 by dashed red lines. Rather than letting rays enterthe acceleration structure via the ray pool, HRPP computes a ray’s hash value. Thehash function employed by HRPP uses the ray’s physical properties of origin anddirection to associate a unique number with each ray. This number serves as alookup index into HRPP’s predictor table. Hash functions are discussed in moredetail in Section 4.4.Once rays are hashed, the predictor table serves as a storage of mappings fromunique ray hash values to node indices which can be used as pointers into the tree.Before any ray enters the tree, a ray’s hash value is computed, followed by a56lookup in the predictor table. If the hash value is present in the table, a predictionfor the ray is made. The prediction is then evaluated. If the evaluation results in avalid scene intersection, the traversal of all interior nodes are skipped. Otherwise,the ray is added back into the ray pool and traverses the tree as it would withoutHRPP.The predictor predicts either a leaf node or an interior node, as is illustrated inFigure 4.2. When an interior node is predicted, the ray has to traverse the interiorlayers preceding the target leaf node.4.3.3 Precision and RecallAs stated in the previous section, HRPP prediction can either hit or miss the tar-get leaf node. We intuitively define HRPP hits as positive predictions, and HRPPmisses as negative predictions. We further differentiate between four scenarios,true positives, false positives, true negatives, and false negatives. We will discusshow each of these scenarios can happen and how to deal with them to guaranteecorrectness.True positives occur when a ray hits in the predictor and the prediction isdeemed correct after evaluation. A prediction is determined to be correct whena ray-scene intersection is found in the predicted leaf node. This is the best-caseoutcome, whereby the traversal of the tree is successfully skipped. It is to be noted,however, that there is a slim possibility of a true positive leading to the wrongvisual output, as in the case where the ray is required to find the closest intersec-tion, e.g the ‘hit-all’ paradigm. Prediction can cause a ray to not find the closestintersection. This problem is inexistent for ‘hit-any’ rays such as shadow rays.On the other hand, false positives occur when a ray hits in the predictor tablebut the prediction made by HRPP is incorrect. Upon a predictor hit, the ray is testedagainst the geometry in the predicted node(s). If the ray misses in the node(s), theprediction is incorrect. In this case, the ray is added back into the ray pool andtraverses the acceleration structure as it would without HRPP. The performancecost of a false positive is the cost of the lookup in HRPP and the cost of intersectingthe triangles in the wrongly predicted leaf nodes.True negatives and false negatives are similar to each other in the way that they57simply miss in the predictor but they do so for different reasons. The reason forthe miss for a true negative is that there was no chance a similar ray has been en-countered previously. False misses occur when an entry has been evicted from thepredictor. These misses could have been avoided by using an infinitely sized pre-dictor table. Our current implementation lacks a replacement policy as we focus onevaluating the limits of our technique. We will therefore not differentiate betweentrue negatives and false negatives.The penalty for negatives is the cost of the predictor table lookup which in-cludes the cost of computing the hash function. Once a ray has been established asbeing falsely predicted, it is re-added to the ray pool awaiting traversal through thetree.We implement HRPP as an extension to PBRT. Our implementation is a soft-ware implementation that uses the c++ standard template library to implement ourcustom hash table. The hash table is implemented to use chaining, meaning that ifmultiple indices map to the same entry in the hash table, they are appended. Dur-ing prediction, if a ray hash hits in the predictor, all associated predictions must bevalidated until a hit is found.584.4 HRPP Hash FunctionHashing is used to map ray properties to a unique value that serves as an indexinto the predictor table. In our case, a good hash function maps rays with similarproperties to the same entry in the predictor table.1 1 0 0 1 0 1 0 032 bits1 sign bit 8 exponent bits 23 mantissa bitsFigure 4.3: Illustration of our hash function’s extraction of bits from IEEE754 floating point with an example precision of 2 bits. Bits marked ingreen are included in the hash representationMost path tracers encode ray origin and direction as single precision floatingpoints encoded in the IEEE 754 [1] floating point representation. HRPP’s hashfunction extracts relevant bits from the IEEE754 floating point encoding to create arelevant hash index. Furthermore, we propose to implement the hash encoder usinga fixed function hardware block for maximal performance. Figure 4.3 illustrates theproposed hash encoding using an example with a hash precision of 2 bits. In thisexample, the predictor index is composed of the sign bit and the 2 most significantbits from the exponent as well as the mantissa. A hash precision of 3 bits would becomposed of the sign bit and 3 bits from each exponent and mantissa.The same hash encoding is used for all 6 floats of each ray, where 3 floats areused for direction and 3 floats are used for origin. Finally, we swizzle origin hashvalues with direction hash values to obtain a unique predictor table entry that fitsinto at most 48 bits. The pseudo-code for the hash function is as follows:While this encoding minimizes hash conflicts, it deliberately does not avoid591 // extract up to 16 bits from IEEE float representation2 uint16_t hash_o_x = map_float_to_hash(ray.origin.x)3 uint16_t hash_o_y = map_float_to_hash(ray.origin.y)4 uint16_t hash_o_z = map_float_to_hash(ray.origin.z)5 uint16_t hash_d_x = map_float_to_hash(ray.direction.x)6 uint16_t hash_d_y = map_float_to_hash(ray.direction.y)7 uint16_t hash_d_z = map_float_to_hash(ray.direction.z)8910 // xor the hashes to save space11 uint16_t hash_0 = hash_o_x xor hash_d_z;12 uint16_t hash_1 = hash_o_y xor hash_d_y;13 uint16_t hash_2 = hash_o_z xor hash_d_x;1415 // form a unique index16 unsigned long long predictor_table_index = \17 (hash_0 << 0) or \18 (hash_1 << 8) or \19 (hash_2 << 16);Listing 4.1: Creating a unique ray hash index from ray origin and raydirectionthem; we want similar rays to result in hash conflicts by design. When encounter-ing a hash conflict, the leaf node of the conflicting ray is inserted into the existingpredictor entry. No predictor entry can hold duplicate nodes; therefore, the leaf isonly inserted if it is not already present in the predictor entry.4.4.1 Hash Function TradeoffIn this section, we discuss the tradeoff between using loose vs. tight hash functions.We define a hash function as loose if it maps too many rays to the same hashindex. As a result, the number of false positives increases and the overall numberof skipped computations decreases. A symptom of this is that the entries in thepredictor table become long because the predictor table associates many distinctleaves with each hash index. On a predictor hit, a ray has to check against all nodesin the table entry. Long table entries result in a large amount of predictor inducedoverhead computation and few skipped nodes.On the other hand, a tight hash function will only map rays to the same indexif they are very similar. When using a tight hash function, HRPP uses significantly60more memory due to a larger number of distinct hash values being stored in thepredictor table. Tight hashing produces few positives and, as a result, few compu-tations are skipped. The upside to a tight hash function, however, is that almost allpositives are true positives.Figure 4.4 illustrates this tradeoff using the Buddha scene using PBRT with8 samples per pixel at a resolution of 1024×1024. The number of precision bitsused in the hash function and the size of the predictor table is directly correlatedas shown in Figure 4.4 (a). In this particular example, the number of intersectionsskipped peaks at 5 bits.A hash function that is too tight (e.g. 6 bits and up) increases the size of thepredictor table without additional computations skipped.A hash function that is too loose (e.g 4 bits and below) produces a large numberof hash conflicts. This is illustrated in Figure 4.4 (b), where the average number ofleaves per entry significantly increases to over 600 with a precision of 4 bits.618765410203040Bits used for Hash FunctionInteriornodecomputationskipped[%]050100150PredictorTableSize[MB]Table SizeSkipped Computation(a)8765402004006001.051.084.2363.89605.51Bits used for Hash FunctionAveragenumberofleafsperentry(b)Figure 4.4: Illustration of the tradeoff between a tight and a loose hashfunction and its impact on the predictor table properties. Buddha -1024×1024 - spp 862Scene Teapot Killeroo Simple Killeroo Metal Buddha Sportscarnumber of triangles 2k 60k 500k 1mio 53miolighting complexity simple simple complex medium complexmax BVH depth 16 24 27 30 32hit-any savings(%) 4 48 <1 23 <1hit-any predictor table size (MB) 16 10.8 31 18 21hit-all savings(%) 69 52 30 41 32hit-all predictor table size (MB) 2 84 47 18 18Table 4.1: Evaluation of test scenes - resolution 1024×1024 - 8 spp - Go UpLevel 0 - hash precision 64.5 Limit StudyWe implement a software version of HRPP as an extension of the PBRT rendererto quantify the ray locality present in various example scenes. We present an exten-sive discussion of which parameters are relevant to HRPP and quantify their effecton prediction accuracy.While previous limit studies, such as the one by Lam and Wilson [42], show thepotential for improvement over the state of the art by quantifying the presence of anew type of locality, they do not provide a practical implementation. This chapter,however, presents both a limit study similar to [42] and HRPP as a technique thatcan be adopted in hardware to produce speedup over the current state of the artraytracers.For this limit study, our software implementation has the drawback of using alarge amount of additional memory, often exceeding over 100MB. A future hard-ware implementation of HRPP can address the memory problem by implementinga replacement policy for the HRPP predictor table. Future work will study theimpact of a replacement policy on the accuracy and the memory consumption ofHRPP.4.5.1 Overview TableResults are summarized in Table 4.1. We evaluate HRPP on five scenes with unifiedbinary BVH trees of increasing geometrical and illumination complexities. We63evaluate two distinct HRPP predictors for hit-any and the hit-all rays to avoid hashconflicts between hit-all and hit-any rays. The two predictors are identical but areused to store two distinct categories of rays.We find that the average number of interior nodes skipped by HRPP for hit-all rays is above 30% across all test scenes. This shows that we could skip up to30% of all interior node computations and the associated memory traffic by usingHRPP for all hit-all rays. We conclude that hit-all rays are highly suitable for HRPPprediction.We find that for scenes with complex lighting models such as the metal killa-roo scene and the sports car scene, hit-any HRPP achieves less than 1% skippedinterior nodes due to the low locality of global illumination rays. However, onthe geometrically complex Buddha scene, we skip 30% of all hit-any interior nodecomputations. We conclude that hit-any rays exhibit little ray locality in sceneswith complex lighting models while exhibiting significant ray locality if the light-ing model is simple.4.5.2 Go Up Level0 1 2 3 441424344HRPP Go Up LevelInteriornodecomputationskipped[%][106]Table sizeSkipped ComputationsFigure 4.5: Impact of Go Up Level on number of predictor entries. Buddha- 1024×1024 - hash precision 664We define HRPP’s go up level as the level in the acceleration structure tree thepredictor table predicts. A Go Up Level of 0 predicts the acceleration structure’sleaf nodes. A Go Up Level of 1 predicts the parent node of the leaf nodes. A GoUp Level of 2 predicts the grand-parent node of the leaf nodes, etc.As shown in Figure 4.5, the most significant impact of the predictor’s Go UpLevel is the amount of memory used for the predictor table. A Go Up Level of1 results in fewer entries and therefore less memory used at the cost of a smallreduction in skipped computations as compared to a Go Up Level of Samples Per Pixel8765402004006001.051.084.2363.89605.51Bits used for Hash FunctionAveragenumberofleafsperentryFigure 4.6: Impact of SPP on interor node computation skipped and the num-ber of predictor entries. Buddha - 1024×1024 - hash precision 6We study the impact of the number of Samples Per Pixel (SPP) on the efficiencyof HRPP in Figure 4.6.Counterintuitively, the number of SPPs is inversely correlated with the effi-ciency of HRPP because of hash conflicts which result in a large amount of chain-ing in the predictor hash table. As the number of hash conflicts increases, thenumber of memory used for HRPP increases as shown in Figure 4.6. We expecta replacement policy for HRPP to counter this trend but leave the evaluation of65replacement policies as future work.4.6 ConclusionThis chapter explores the benefits of exploiting ray locality to accelerate BVHtraversal by skipping interior nodes. In current acceleration structures ray local-ity goes unutilized, we show that there is speed up to be gained from exploitingthis locality.Our limit study uncovers the significant potential for speed up by skipping onaverage 30% of all hit-all rays during the rendering of one frame. We explore thedesign space of ray prediction by quantifying the impact of hash precision, scenecomplexity, illumination complexity, samples per pixels as well as HRPP’s Go UpLevel. We propose a hash mapping from IEEE 754 floating point and explore thetradeoffs for efficient hashing.Finally, we provide directions for future work that addresses the current short-comings of HRPP such as memory consumption and hash conflicts.66Chapter 5Conclusion and Future Work5.1 ConclusionIn this thesis, we have explored the bottleneck of real-time raytracing by focusingon current GPU architectures.In this thesis we have presented:• An open-source implementation of a Ray-Traced Ambient-Occlusion bench-mark for use as a hybrid rendering workload. We have shown that our bench-mark can accurately produce high-quality ambient occlusion for scenes thatare too complex for real-time rendering as of yet. We provide three state-of-the-art GPU compute raytracing kernels, two academic kernels, and onekernel that relies on the state-of-the-art commercially available API.• The first publicly described microarchitecture simulator-based evaluation ofthe memory behavior of current SIMT Compute raytracing of the highlydivergent RTAO benchmark on current GPU architectures. We have qualifiedthe effect on scene size, samples per pixel, and samples per triangles. Wehave quantified the detailed memory reuse patterns using a detailed GPUsimulator.• We have proposed a novel approach to accelerate BVH traversal by predict-ing; we show a possible reduction of 40% in interior node computations in67theory, but we conclude that the feasibility of our approach is low due tosignificant memory overhead.5.2 Future WorkWhile the scope of this thesis is limited to current GPU architectures and currentefforts to advance hybrid rendering, we have identified avenues for future work.• Evaluate other raytracing effects for hybrid rendering. We should evaluateray-traced shadows, ray-traced reflection and refraction, raytraced volumerendering, and ray-traced global illumination.• Evaluate which areas or effects benefit the most from raytracing effects.• Study future hardware architectures that follow the MIMD paradigm or cus-tom FPGA and ASIC hardware architectures for raytracing.68Bibliography[1] IEEE standard for binary floating-point arithmetic. Institute of Electricaland Electronics Engineers, New York, 1985. Note: Standard 754–1985. →page 59[2] M. Agate, R. L. Grimsdale, and P. F. Lister. The HERO Algorithm forRay-Tracing Octrees. In Proceedings of the Fourth EurographicsConference on Advances in Computer Graphics Hardware, EGGH’89, page61–73, Goslar, DEU, 1989. Eurographics Association. → page 12[3] T. Aila and T. Karras. Architecture Considerations for Tracing IncoherentRays. In Proceedings of the Conference on High Performance Graphics,HPG ’10, pages 113–122, Aire-la-Ville, Switzerland, Switzerland, 2010.Eurographics Association. URL → pages12, 14, 16, 41, 54, 55[4] T. Aila and S. Laine. Understanding the Efficiency of Ray Traversal onGPUs. In Proceedings of the Conference on High Performance Graphics2009, HPG ’09, pages 145–149, 2009. URL → pages16, 30, 31, 38, 51, 54[5] AUTODESK Inc. Autodesk Mudbbox documentation. us/index.html?url=files/GUID-5373CDF7-C8B5-4AFB-8219-CBB5FB441C33.htm,topicNumber=d30e32899, 2014. visited on 2020-06-10. → pages x, 19, 20[6] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt.Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEEInternational Symposium on Performance Analysis of Systems and Software,pages 163–174, April 2009. doi:10.1109/ISPASS.2009.4919648. → pages23, 38, 4169[7] C. Barre´-Brisebois, H. Hale´n, G. Wihlidal, A. Lauritzen, J. Bekkers,T. Stachowiak, and J. Andersson. Hybrid Rendering for Real-Time RayTracing, pages 437–473. Apress, Berkeley, CA, 2019. URL 25. → pages3, 13, 15, 22, 23, 26[8] R. Barringer, M. Andersson, and T. Akenine-Mo¨ller. Ray Accelerator:Efficient and Flexible Ray Tracing on a Heterogeneous Architecture.Computer Graphics Forum, 36, 10 2016. doi:10.1111/cgf.13071. → page 17[9] L. Bavoil and M. Sainz. Multi-Layer Dual-Resolution Screen-SpaceAmbient Occlusion. In SIGGRAPH 2009: Talks, SIGGRAPH ’09, 2009.doi:10.1145/1597990.1598035. URL → page 20[10] C. Benthin, I. Wald, and P. Slusallek. A Scalable Approach to InteractiveGlobal Illumination. Computer Graphics Forum, 22(3):621–630, 2003.doi:10.1111/1467-8659.t01-2-00710. URL →page 16[11] Blender Online Community. Blender - a 3D modelling and renderingpackage. Blender Foundation, Blender Institute, Amsterdam. URL visited on 2020-06-10. → page 33[12] S. Boulos, D. Edwards, J. D. Lacewell, J. Kniss, J. Kautz, P. Shirley, andI. Wald. Packet-based Whitted and Distribution Ray Tracing. In Proc.Graphics Interface, May 2007. → page 15[13] S. Boulos, I. Wald, and C. Benthin. Adaptive ray packet reordering. 2008IEEE Symposium on Interactive Ray Tracing, pages 131–138, 2008. → page54[14] P. Bourke. Object Files(.obj), 1998. URL visited on 2020-06-10. → page 26[15] P. Bourke. Polygon File Format(.ply), 1998. URL visited on 2020-06-10. → page 33[16] B. Burley, D. Adler, M. J.-Y. Chiang, H. Driskill, R. Habel, P. Kelly, P. Kutz,Y. K. Li, and D. Teece. The Design and Evolution of Disney’s HyperionRenderer. ACM Trans. Graph., 37(3), July 2018. doi:10.1145/3182159.URL → page 370[17] E. E. Catmull. A Subdivision Algorithm for Computer Display of CurvedSurfaces. PhD thesis, 1974. AAI7504786. → pages 2, 6[18] A. Chodos. APS NEWS, 2008. URL Volume 17, No. 9.→ page 2[19] P. Christensen, J. Fong, J. Shade, W. Wooten, B. Schubert, A. Kensler,S. Friedman, C. Kilpatrick, C. Ramshaw, M. Bannister, and et al.RenderMan: An Advanced Path-Tracing Architecture for Movie Rendering.ACM Trans. Graph., 37(3), Aug. 2018. doi:10.1145/3182162. URL → page 3[20] P. H. Christensen and W. Jarosz. The Path to Path-Traced Movies. Found.Trends. Comput. Graph. Vis., 10(2):103–175, Oct. 2016.doi:10.1561/0600000073. URL →pages 13, 18[21] R. L. Cook, L. Carpenter, and E. Catmull. The Reyes Image RenderingArchitecture. SIGGRAPH Comput. Graph., 21(4):95–102, Aug. 1987.doi:10.1145/37402.37414. URL →page 9[22] F. Demoullin. RTAO Benchmark, 2020. URL visited on 2020-06-10. → pages 25, 30, 31[23] F. Demoullin, A. Gubran, and T. Aamodt. Hash-Based Ray Path Prediction:Skipping BVH Traversal Computation by Exploiting Ray Locality, 2019.arXiv:1910.01304 [cs.GR]. → page 14[24] T. Duff, J. Burgess, P. Christensen, C. Hery, A. Kensler, M. Liani, andR. Villemin. Building an Orthonormal Basis, Revisited. Journal ofComputer Graphics Techniques (JCGT), 6(1):1–8, March 2017. URL → page 28[25] P. Dutre, P. Bekaert, and K. Bala. Advanced Global Illumination, AK Peters.CRC Press, Boca Raton, FL, 10:b10632, 2006. → page 28[26] C. Ericson. Real-Time Collision Detection. CRC Press, Inc., USA, 2004.ISBN 1558607323. → page 1371[27] D. C. Evans and J. Y. Leclerc. Address mapping and the control of access inan interactive computer. In American Federation of Information ProcessingSocieties: Proceedings of the AFIPS ’67 Spring Joint Computer Conference,April 18-20, 1967, Atlantic City, New Jersey, USA, volume 30 of AFIPSConference Proceedings, pages 23–30, 1967.doi:10.1145/1465482.1465486. URL → page 2[28] H. Feichtinger and K. Gro¨chenig. Theory and Practice of IrregularSampling. Wavelets: Mathematics and Applications, 01 1994. → page 26[29] T. Foley and J. Sugerman. KD-Tree Acceleration Structures for a GPURaytracer. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICSConference on Graphics Hardware, HWWS ’05, page 15–22, New York,NY, USA, 2005. Association for Computing Machinery.doi:10.1145/1071866.1071869. URL → page 16[30] K. Garanzha and C. Loop. Fast Ray Sorting and Breadth-First PacketTraversal for GPU Ray Tracing. Comput. Graph. Forum, 29:289–298, 052010. doi:10.1111/j.1467-8659.2009.01598.x. → pages 12, 16[31] F. Gieseke, J. Heinermann, C. Oancea, and C. Igel. Buffer K-d Trees:Processing Massive Nearest Neighbor Queries on GPUs. In Proceedings ofthe 31st International Conference on International Conference on MachineLearning - Volume 32, ICML’14, page I–172–I–180., 2014. →page 12[32] J. Goldsmith and J. Salmon. Automatic Creation of Object Hierarchies forRay Tracing. IEEE Computer Graphics and Applications, 7(5):14–20, 1987.→ page 12[33] A. A. Gubran and T. M. Aamodt. Emerald: Graphics Modeling for SoCSystems. In Proceedings of the 46th International Symposium on ComputerArchitecture, ISCA ’19, page 169–182, 2019.doi:10.1145/3307650.3322221. URL → pages x, 7, 8, 9[34] J. Gu¨nther, S. Popov, H.-P. Seidel, and P. Slusallek. Realtime ray tracing onGPU with BVH-based packet traversal. IEEE Symposium on Interactive RayTracing 2007, RT’07, IEEE, 113-118 (2007), 2007, 09 2007.doi:10.1109/RT.2007.4342598. → page 1572[35] T. Ize, C. Brownlee, and C. D. Hansen. Real-time ray tracer for visualizingmassive models on a cluster. In Proceedings of the 11th EurographicsConference on Parallel Graphics and Visualization, EGPGV ’11, page61–69, 2011. → page 16[36] T. L. Kay and J. T. Kajiya. Ray Tracing Complex Scenes. SIGGRAPHComput. Graph., 20(4):269–278, Aug. 1986. ISSN 0097-8930.doi:10.1145/15886.15916. URL →pages 2, 10[37] Khronos Group. The OpenGL Graphics System: A Specification (Version4.5), 2017. URL on 2020-06-10. → pages x, 7, 8[38] D. Kopta, J. Spjut, E. Brunvand, and A. Davis. Efficient MIMD architecturesfor high-performance ray tracing. In IEEE International Conference onComputer Design (ICCD), 2010. → page 17[39] D. Kopta, T. Ize, J. Spjut, E. Brunvand, A. Davis, and A. Kensler. Fast,Effective BVH Updates for Animated Scenes. In Proceedings of the ACMSIGGRAPH Symposium on Interactive 3D Graphics and Games, I3D ’12,page 197–204, 2012. doi:10.1145/2159616.2159649. URL → page 18[40] D. Kopta, K. Shkurko, J. Spjut, E. Brunvand, and A. Davis. MemoryConsiderations for Low Energy Ray Tracing. Comput. Graph. Forum, 34(1):47–59, Feb. 2015. doi:10.1111/cgf.12458. URL → page 14[41] S. Laine and T. Karras. Two Methods for Fast Ray-Cast Ambient Occlusion.In Proceedings of the 21st Eurographics Conference on Rendering,EGSR’10, page 1325–1333, 2010. doi:10.1111/j.1467-8659.2010.01728.x.URL → pages18, 20, 25, 28[42] M. S. Lam and R. P. Wilson. Limits of Control Flow on Parallelism. InProceedings of the 19th Annual International Symposium on ComputerArchitecture, ISCA ’92, pages 46–57, 1992. doi:10.1145/139669.139702.URL → pages 50, 63[43] W.-J. Lee, Y. Shin, J. Lee, J.-W. Kim, J.-H. Nah, S. Jung, S. Lee, H.-S. Park,and T.-D. Han. SGRT: A Mobile GPU Architecture for Real-Time Ray73Tracing. In Proceedings of the 5th High-Performance Graphics Conference,HPG ’13, page 109–119, 2013. doi:10.1145/2492045.2492057. URL → page 14[44] X. Liang, S. Ma, L.-X. Cen, and Z. Yu. Light Space Cascaded Shadow MapsAlgorithm for Real Time Rendering. Journal of Computer Science andTechnology, 26:176–186, 01 2011. doi:10.1007/s11390-011-9424-7. →page 9[45] A. Lier, M. Stamminger, and K. Selgrad. CPU-style SIMD Ray Traversal onGPUs. In Proceedings of the Conference on High-Performance Graphics,HPG ’18, pages 7:1–7:4, 2018. doi:10.1145/3231578.3231583. URL → page 54[46] Y. Lu¨, L. Huang, L. Shen, and Z. Wang. Unleashing the Power of GPU forPhysically-based Rendering via Dynamic Ray Shuffling. In Proceedings ofthe 50th Annual IEEE/ACM International Symposium on Microarchitecture,MICRO-50 ’17, pages 560–573, 2017. doi:10.1145/3123939.3124532. URL → pages 12, 54[47] L. Lutz. Mobile gaming industry analysis & trends in 2020, 2020. URL on 2020-06-10. → page 2[48] J. Mache, R. Broadhurst, and J. Ely. Ray Tracing on Cluster Computers. InPDPTA, 2000. → page 16[49] M. McGuire. Computer Graphics Archive, July 2017. URL visited on 2020-06-10. → page 34[50] M. McGuire and M. Mara. Efficient GPU Screen-Space Ray Tracing.Journal of Computer Graphics Techniques (JCGT), 3(4):73–85, December2014. URL → page 14[51] M. McGuire, M. Mara, and D. Luebke. Scalable Ambient Obscurance. InProceedings of the Fourth ACM SIGGRAPH / Eurographics Conference onHigh-Performance Graphics, EGGH-HPG’12, page 97–103, 2012. → page20[52] M. J. Misˇic´, M. Durevic´, and M. V. Tomasˇevic´. Evolution and trends inGPU computing. In 2012 Proceedings of the 35th International ConventionMIPRO, pages 289–294, 2012. → page 174[53] J.-H. Nah, J.-S. Park, C. Park, J.-W. Kim, Y.-H. Jung, W.-C. Park, and T.-D.Han. T&I Engine: Traversal and Intersection Engine for HardwareAccelerated Ray Tracing. ACM Trans. Graph., 30(6):1–10, Dec. 2011.doi:10.1145/2070781.2024194. URL → page 17[54] P. A. Navra´til, D. S. Fussell, C. Lin, and H. Childs. Dynamic Scheduling forLarge-Scale Distributed-Memory Ray Tracing. In Eurographics Symposiumon Parallel Graphics and Visualization. The Eurographics Association,2012. doi:10.2312/EGPGV/EGPGV12/061-070. → page 16[55] H. Nguyen. Gpu Gems 3. Addison-Wesley Professional, first edition, 2007.ISBN 9780321545428. → page 9[56] M. Nimier-David, D. Vicini, T. Zeltner, and W. Jakob. Mitsuba 2: ARetargetable Forward and Inverse Renderer. ACM Trans. Graph., 38(6),Nov. 2019. doi:10.1145/3355089.3356498. URL → pages 15, 25[57] NVIDIA Inc. NVIDIA Turing GPU Architecture, 2019. URL visitedon 2020-06-10. → pages 2, 14, 39[58] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin. Shape Distributions.ACM Trans. Graph., 21(4):807–832, Oct. 2002.doi:10.1145/571647.571648. URL→ page 26[59] B. O¨ztu¨rk and A. O. Akyu¨z. Semi-Dynamic Light Maps. In ACMSIGGRAPH 2017 Posters, SIGGRAPH ’17, 2017.doi:10.1145/3102163.3102202. URL → page 9[60] S. G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke,D. McAllister, M. McGuire, K. Morley, A. Robison, and M. Stich. OptiX: AGeneral Purpose Ray Tracing Engine. ACM Trans. Graph., 29(4):66:1–66:13, July 2010. doi:10.1145/1778765.1778803. URL → pages 14, 30, 31, 33, 38[61] S. G. Parker, H. Friedrich, D. Luebke, K. Morley, J. Bigler, J. Hoberock,D. McAllister, A. Robison, A. Dietrich, G. Humphreys, and et al. GPU RayTracing. Commun. ACM, 56(5):93–101, May 2013.75doi:10.1145/2447976.2447997. URL → page 16[62] M. Pharr. On the Importance of Sampling, pages 207–222. Apress, Berkeley,CA, 2019. ISBN 978-1-4842-4427-2. doi:10.1007/978-1-4842-4427-2 15.URL 15. → page 18[63] M. Pharr and G. Humphreys. Physically Based Rendering, Second Edition:From Theory To Implementation. Morgan Kaufmann Publishers Inc., SanFrancisco, CA, USA, 2nd edition, 2010. ISBN 0123750792,9780123750792. → pages 10, 12, 15, 18, 25, 51[64] M. Pharr, C. Kolb, R. Gershbein, and P. Hanrahan. Rendering ComplexScenes with Memory-coherent Ray Tracing. In Proceedings of the 24thAnnual Conference on Computer Graphics and Interactive Techniques,SIGGRAPH ’97, pages 101–108, 1997. doi:10.1145/258734.258791. URL → pages 50, 54[65] T. J. Purcell. Ray Tracing on a Stream Processor. PhD thesis, Stanford, CA,USA, 2004. AAI3128683. → page 16[66] T. J. Purcell, I. Buck, W. R. Mark, and P. Hanrahan. Ray Tracing onProgrammable Graphics Hardware. ACM Transactions on Graphics, 21(3):703–712, July 2002. → page 16[67] R. Sajko and Z. Mihajlovic. Fast Image-Based Ambient Occlusion IBAO.IJVR, 10:61–65, 01 2011. doi:10.20870/IJVR.2011.10.4.2830. → page 20[68] A. Sanders. An Introduction to Unreal Engine 4. A. K. Peters, Ltd., USA,2016. ISBN 1498765092. → page 3[69] J. Schmittler, I. Wald, and P. Slusallek. SaarCOR: A Hardware Architecturefor Ray Tracing. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICSConference on Graphics Hardware, HWWS ’02, page 27–36, 2002. → page17[70] A. Shilov. Discrete Desktop GPU Market Trends Q3 2016: GPU ShipmentsHit Two-Year High, 2016. URL visited on 2020-06-10. →page 1[71] K. Shkurko, T. Grant, D. Kopta, I. Mallett, C. Yuksel, and E. Brunvand.Dual Streaming for Hardware-accelerated Ray Tracing. In Proceedings of76High Performance Graphics, HPG ’17, pages 12:1–12:11, 2017.doi:10.1145/3105762.3105771. URL → pages 17, 54[72] K. Shkurko, T. Grant, E. Brunvand, D. Kopta, J. Spjut, E. Vasiou, I. Mallett,and C. Yuksel. SimTRaX: Simulation Infrastructure for ExploringThousands of Cores. In Proceedings of the 2018 on Great Lakes Symposiumon VLSI, GLSVLSI ’18, page 503–506, 2018.doi:10.1145/3194554.3194650. URL → page 18[73] J. Spjut, D. Kopta, E. Brunvand, and A. Davis. A Mobile AcceleratorArchitecture for Ray Tracing. In 3rd Workshop on SoCs, HeterogeneousArchitectures and Workloads (SHAW-3), 2012. → page 17[74] M. Sunet and P.-P. Vazquez. Optimized Screen-Space Ambient Occlusion inMobile Devices. In Proceedings of the 21st International Conference onWeb3D Technology, Web3D ’16, page 127–135, 2016.doi:10.1145/2945292.2945300. URL → page 20[75] I. E. Sutherland. Sketchpad: A man-machine graphical communicationsystem. In Proceedings of the May 21-23, 1963, Spring Joint ComputerConference, AFIPS ’63 (Spring), page 329–346, 1963.doi:10.1145/1461551.1461591. URL → page 2[76] Unity Technologies. Unity Game Engine Documentation. URL visitedon 2020-06-10. → pages x, 3, 20, 21[77] K. Vardis, G. Papaioannou, and A. Gaitatzes. Multi-View AmbientOcclusion with Importance Sampling. In Proceedings of the ACMSIGGRAPH Symposium on Interactive 3D Graphics and Games, I3D ’13,page 111–118, 2013. doi:10.1145/2448196.2448214. URL → page 20[78] E. Vasiou, K. Shkurko, E. Brunvand, and C. Yuksel. Mach-RT: A ManyChip Architecture for High-Performance Ray Tracing. In High-PerformanceGraphics (HPG 2019), 2019. → page 1777[79] T. Viitanen, M. Koskela, P. Ja¨a¨skela¨inen, A. Tervo, and J. Takala.PLOCTree: A Fast, High-Quality Hardware BVH Builder. Proc. ACMComput. Graph. Interact. Tech., 1(2), Aug. 2018. doi:10.1145/3233309.URL → page 18[80] I. Wald and V. Havran. On building fast kd-Trees for Ray Tracing, and ondoing that in O(N log N). In 2006 IEEE Symposium on Interactive RayTracing, pages 61–69, 2006. → page 12[81] I. Wald, P. Slusallek, and C. Benthin. Interactive distributed ray tracing ofhighly complex models. In Rendering Techniques 2001, pages 277–288.Springer, 2001. → page 16[82] I. Wald, P. Slusallek, C. Benthin, and M. Wagner. Interactive Rendering withCoherent Ray Tracing. Computer Graphics Forum, 20(3):153–165, 2001.doi:10.1111/1467-8659.00508. URL → pages15, 50, 54[83] I. Wald, C. P. Gribble, S. Boulos, and A. Kensler. SIMD Ray StreamTracing-SIMD ray traversal with generalized ray packets and on-the-flyre-ordering. Informe Te´cnico, SCI Institute, 2007. → pages 50, 54, 55[84] I. Wald, S. Woop, C. Benthin, G. S. Johnson, and M. Ernst. Embree: AKernel Framework for Efficient CPU Ray Tracing. ACM Trans. Graph., 33(4), July 2014. doi:10.1145/2601097.2601199. URL → pages 15, 30[85] I. Wald, G. Johnson, J. Amstutz, C. Brownlee, A. Knoll, J. Jeffers,J. Gunther, and P. Navratil. OSPRay - A CPU Ray Tracing Framework forScientific Visualization. IEEE Transactions on Visualization and ComputerGraphics, 23(1):931–940, Jan. 2017. doi:10.1109/TVCG.2016.2599041.URL → page 15[86] M. Wawrzonowski and D. Szajerman. Optimization of screen-spacedirectional occlusion algorithms. Open Physics, 17:519–526, 09 2019.doi:10.1515/phys-2019-0054. → page 20[87] M. H. Weik. Cathode-Ray Tube, pages 180–180. Springer US, Boston, MA,2001. ISBN 978-1-4020-0613-5. doi:10.1007/1-4020-0613-6 2295. URL 2295. → page 278[88] WEPC. 2020 Video Game Industry Statistics, Trends & Data, 2020. URL visited on 2020-06-10.→ page 1[89] T. Whitted. An Improved Illumination Model for Shaded Display. Commun.ACM, 23(6):343–349, June 1980. doi:10.1145/358876.358882. URL → pages 2, 10, 11, 49[90] S. Woop, J. Schmittler, and P. Slusallek. RPU: A programmable RayProcessing Unit for realtime ray tracing. ACM Trans. Graph., 24:434–444,07 2005. doi:10.1145/1186822.1073211. → page 17[91] S. Woop, C. Benthin, and I. Wald. Watertight ray/triangle intersection.Journal of Computer Graphics Techniques (JCGT), 2(1):65–82, June 2013.ISSN 2331-7418. URL → page 12[92] C. Wylie, G. W. Romney, D. C. Evans, and A. Erdahl. Half-tone perspectivedrawings by computer. In American Federation of Information ProcessingSocieties: Proceedings of the AFIPS ’67 Fall Joint Computer Conference,November 14-16, 1967, Anaheim, California, USA, volume 31 of AFIPSConference Proceedings, pages 49–58, 1967.doi:10.1145/1465611.1465619. URL → page 2[93] H. Ylitie, T. Karras, and S. Laine. Efficient Incoherent Ray Traversal onGPUs Through Compressed Wide BVHs. In Proceedings of HighPerformance Graphics, HPG ’17, pages 4:1–4:13, 2017.doi:10.1145/3105762.3105773. URL → pages12, 16, 30, 32, 38, 41, 42, 54[94] S. Zhukov, A. Iones, and G. Kronin. An ambient light illumination model.In Rendering Techniques ’98, pages 45–55, Vienna, 1998. → pages 9, 18, 2579Appendix ASupporting Materials80(a) 1 SPT (b) 2 SPT(c) 4 SPT (d) 8 SPT(e) 16 SPT (f) 32 SPTFigure A.1: SPT tradeoff for the dragon scene. SPP is fixed at 32.81(a) 1 SPP (b) 2 SPP(c) 4 SPP (d) 8 SPP(e) 16 SPP (f) 32 SPPFigure A.2: SPP tradeoff for the teapot scene. SPT is fixed at 32.82


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items