{"http:\/\/dx.doi.org\/10.14288\/1.0421312":{"http:\/\/vivoweb.org\/ontology\/core#departmentOrSchool":[{"value":"Applied Science, Faculty of","type":"literal","lang":"en"},{"value":"Electrical and Computer Engineering, Department of","type":"literal","lang":"en"}],"http:\/\/www.europeana.eu\/schemas\/edm\/dataProvider":[{"value":"DSpace","type":"literal","lang":"en"}],"https:\/\/open.library.ubc.ca\/terms#degreeCampus":[{"value":"UBCV","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/creator":[{"value":"Ng, Christopher Cheuk Wing","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/issued":[{"value":"2022-10-19T16:35:19Z","type":"literal","lang":"en"},{"value":"2022","type":"literal","lang":"en"}],"http:\/\/vivoweb.org\/ontology\/core#relatedDegree":[{"value":"Master of Applied Science - MASc","type":"literal","lang":"en"}],"https:\/\/open.library.ubc.ca\/terms#degreeGrantor":[{"value":"University of British Columbia","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/description":[{"value":"The emergence of deep learning has launched many works in deep learning accelerators. To fully realize the potential of these accelerators, dataflow mapping must be optimized in order to reduce the number of memory accesses. Dataflow mapping is crucial to improving the performance of deep learning workloads, but mapping optimization is a difficult problem due to the enormous, non-convex, and non-differentiable search space. As workloads become larger and larger, the problem becomes harder while the importance of dataflow increases.\r\n\r\nTo tackle the problem, prior work reduces the search space using empirically driven, or arbitrary heuristics. However, these heuristics are either too simple and the optimization process is still too slow, or are too aggressive and remove optimal mappings. Prior work also explored using black-box optimizers, but reformulating the problem into the input of these black-box optimizers is not always feasible and scalable, leading to sub-optimal or even invalid solutions.\r\n\r\nIn this thesis, we tackle the problem by first formally analyzing how the different aspects of mapping (tiling, ordering, unrolling) algebraically affect memory reuse and performance in order to identify sub-optimal spaces. Next, we introduce new state-space representations and traversal methods to enable the pruning of these spaces, which dramatically reduces the search space without rejecting the best solutions. Finally, we extend these analyses and techniques to tackle problems closely related to mapping optimization, such as memory configuration optimization.\r\nSunstone, our proof-of-concept implementation, improves the optimization time for some of the complex tensor operations by up to 10x compared to prior work, and can yield mappings with up to 1.5-2.5x lower EDP.","type":"literal","lang":"en"}],"http:\/\/www.europeana.eu\/schemas\/edm\/aggregatedCHO":[{"value":"https:\/\/circle.library.ubc.ca\/rest\/handle\/2429\/82892?expand=metadata","type":"literal","lang":"en"}],"http:\/\/www.w3.org\/2009\/08\/skos-reference\/skos.html#note":[{"value":"Analytically Driven Software\/Hardware Co-Design forAccelerating Tensor WorkloadsbyChristopher Cheuk Wing NgBASc, University of British Columbia, 2020A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of Applied ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)October 2022c\u00a9 Christopher Cheuk Wing Ng, 2022The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Analytically Driven Software\/Hardware Co-Design for Accelerating Ten-sor Workloadssubmitted by Christopher Cheuk Wing Ng in partial fulfillment of the require-ments for the degree of Master of Applied Science in Electrical and ComputerEngineering.Examining Committee:Mieszko Lis, Electrical and Computer EngineeringSupervisorAlexandra Fedorova, Electrical and Computer EngineeringSupervisory Committee MemberPrashant Nair, Electrical and Computer EngineeringSupervisory Committee MemberiiAbstractThe emergence of deep learning has launched many works in deep learning ac-celerators. To fully realize the potential of these accelerators, dataflow mappingmust be optimized in order to reduce the number of memory accesses. Dataflowmapping is crucial to improving the performance of deep learning workloads, butmapping optimization is a difficult problem due to the enormous, non-convex, andnon-differentiable search space. As workloads become larger and larger, the prob-lem becomes harder while the importance of dataflow increases.To tackle the problem, prior work reduces the search space using empiricallydriven, or arbitrary heuristics. However, these heuristics are either too simple andthe optimization process is still too slow, or are too aggressive and remove optimalmappings. Prior work also explored using black-box optimizers, but reformulatingthe problem into the input of these black-box optimizers is not always feasible andscalable, leading to sub-optimal or even invalid solutions.In this thesis, we tackle the problem by first formally analyzing how the dif-ferent aspects of mapping (tiling, ordering, unrolling) algebraically affect memoryreuse and performance in order to identify sub-optimal spaces. Next, we intro-duce new state-space representations and traversal methods to enable the pruningof these spaces, which dramatically reduces the search space without rejecting thebest solutions. Finally, we extend these analyses and techniques to tackle problemsclosely related to mapping optimization, such as memory configuration optimiza-tion. Sunstone, our proof-of-concept implementation, improves the optimizationtime for some of the complex tensor operations by up to 10\u00d7 compared to priorwork, and can yield mappings with up to 1.5\u20132.5\u00d7 lower EDP.iiiLay SummaryDeep learning powers many applications around us, from speech translation todriver assistance. In order to run them efficiently without using too much energy, alot of attention is placed on reordering the instructions in the underlying workload,which can improve energy efficiently by more than two times. However, findingthe best instruction ordering is a difficult problem, due to the sheer amount oforderings available. Many prior works either take a long time to generate goodorderings, or generates mediocre orderings in a short amount of time.In this thesis, we aim to solve this issue by developing an optimizer that cangenerate good orderings quickly by formally analyzing how different parts of theordering can affect performance, and exploiting these observations to only exploreorderings that are potentially optimal.ivPrefaceThis dissertation is based on joint work done by me and Phd student Mohammad-Hossein Olyaiy. We both contributed to the problem formation, analysis, ideas andevaluation together. Specifically, we both focused on analyzing the tiling proper-ties, while Mohammad focused on the spatial unrolling properties and I focusedon the ordering and inter-level pruning analysis. I also focused on adapting theanalyses for generic workloads, and the memory configuration search analyses andcase study. Parts of chapter 1-4 and 6 were adapted from a manuscript that waswritten by me, Mohammad, professor Mieszko Lis and professor Alexandra Fe-dorova. Professor Mieszko Lis also served as the advisor of this project, providedinvaluable feedback during discussions, and helped edit this work.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 The Difficulties of Mapping Optimization . . . . . . . . . . . . . 21.2 Tackling the Mapping Optimization Problem . . . . . . . . . . . 31.3 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . 62 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 Tensor Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 DNN accelerator architecture . . . . . . . . . . . . . . . . . . . . 92.3 Dataflow mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.1 Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.2 Tile Traversal Order . . . . . . . . . . . . . . . . . . . . 11vi2.3.3 Spatial Unrolling . . . . . . . . . . . . . . . . . . . . . . 122.3.4 Dataflow Walk-through . . . . . . . . . . . . . . . . . . . 132.3.5 Dataflow and Performance . . . . . . . . . . . . . . . . . 152.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1 Random search . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Directed search . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Black-box Optimizers . . . . . . . . . . . . . . . . . . . . . . . . 173.4 Other Miscellaneous Works . . . . . . . . . . . . . . . . . . . . . 183.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Analytically Driven Mapping Flow . . . . . . . . . . . . . . . . . . . 194.1 Mapping Flow of Sunstone . . . . . . . . . . . . . . . . . . . . . 194.2 Parameter Representation . . . . . . . . . . . . . . . . . . . . . . 224.3 Inferring reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.4 Loop order (inter-tile reuse) . . . . . . . . . . . . . . . . . . . . . 234.4.1 Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.4.2 Differences vs. prior work . . . . . . . . . . . . . . . . . 254.4.3 Representation . . . . . . . . . . . . . . . . . . . . . . . 254.4.4 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.4.5 Differences vs. prior work . . . . . . . . . . . . . . . . . 274.5 L1 tile size optimization . . . . . . . . . . . . . . . . . . . . . . 274.5.1 Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.5.2 Representation . . . . . . . . . . . . . . . . . . . . . . . 294.5.3 Differences vs. prior work . . . . . . . . . . . . . . . . . 314.6 Spatial unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.6.1 Insight . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.6.2 Differences vs. prior work . . . . . . . . . . . . . . . . . 344.7 Optimizing the L2 level and beyond . . . . . . . . . . . . . . . . 344.8 Dynamic inter-level pruning . . . . . . . . . . . . . . . . . . . . 344.8.1 Differences vs. prior work . . . . . . . . . . . . . . . . . 354.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35vii5 Extending Mapping Flow for Architecture Search . . . . . . . . . . 375.1 Memory Sizing on Mapping and Performance . . . . . . . . . . . 375.2 Extending Sunstone for Architecture Search . . . . . . . . . . . . 395.2.1 Tile Space Traversal . . . . . . . . . . . . . . . . . . . . 395.2.2 Inter-level Pruning . . . . . . . . . . . . . . . . . . . . . 415.2.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 425.2.4 Differences from Prior Work . . . . . . . . . . . . . . . . 435.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436 Mapping Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.1.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 456.1.2 Architectures . . . . . . . . . . . . . . . . . . . . . . . . 456.1.3 Prior art . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.1.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.2 Prior search-type optimizers . . . . . . . . . . . . . . . . . . . . 486.2.1 EDP on DNN layers . . . . . . . . . . . . . . . . . . . . 486.2.2 Time to solution on DNN layers . . . . . . . . . . . . . . 506.3 Black-box optimizers on DNN layers . . . . . . . . . . . . . . . . 526.3.1 CoSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.3.2 Mindmappings . . . . . . . . . . . . . . . . . . . . . . . 556.4 Non-DNN workloads . . . . . . . . . . . . . . . . . . . . . . . . 576.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597 Case Study: Using Mapping to Prune Memory Configuration Space 617.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . 637.3.1 Reduction of Memory Configuration Space . . . . . . . . 637.3.2 Reduction of Total Map Space . . . . . . . . . . . . . . . 647.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . 68viiiBibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70A Pseudo-Code for Algorithms . . . . . . . . . . . . . . . . . . . . . . 79ixList of TablesTable 1.1 Mapping size and expected runtime comparison between lay-ers from state-of-the-art networks from different years. Theexpected runtime is calculated by assuming a single mappingevaluation takes 0.3 milliseconds, and 32 threads are evaluatingmappings in parallel. These numbers are empirically obtainedfrom running prior work [46]. . . . . . . . . . . . . . . . . . . 3Table 2.1 Comparison between the number of accesses for the tiled anduntiled mapping. . . . . . . . . . . . . . . . . . . . . . . . . . 15Table 4.1 Inferred reuse of each tensor in 4.1. . . . . . . . . . . . . . . . 23Table 4.2 The ordering space size and the size of the problem orderingsubset for various tensor workloads . . . . . . . . . . . . . . . 27Table 6.1 Evaluated benchmarks . . . . . . . . . . . . . . . . . . . . . . 46Table 6.2 Tensors and matrices used for various benchmarks. The 3Dtensors are used for MTTKRP and TTMc, whereas the 2D ma-trices are for SDDMM. The tensors and matrices are from theFROSTT [55] and SparseSuite [15], respectively. While thesetensors are sparse, we still treat the workload as dense. . . . . . 46Table 6.3 Evaluated accelerator configurations . . . . . . . . . . . . . . 47Table 6.4 Hyperparameters for fast and slow configurations for Timeloop (TL)and dMazeRunner (dMaze). For TL, TO = timeout and VC = vic-tory condition. For dMaze, util. = minimum utilization threshold. 47xTable 7.1 Configuration for the L1 buffer size optimization problem . . . 63xiList of FiguresFigure 1.1 Example of a typical accelerator and a mapped tensor work-load (in this case, convolution). . . . . . . . . . . . . . . . . 2Figure 1.2 Three Inception V3 [57] layers mapped with CoSA [26] anddMazeRunner [14] with fast and slow settings each. There isno \u201cgolden\u201d parameter setting: each tool either returns invalidmappings or bad EDP. . . . . . . . . . . . . . . . . . . . . . 4Figure 2.1 A generic spatial accelerator architecture with per-PE (L1) mem-ories and a shared on-chip L2 memory. . . . . . . . . . . . . 9Figure 2.2 Operation space of a 1D convolution. Ifmap is shifted, repli-cated, and padded by 0 along the sliding window dimensionR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Figure 2.3 1D convolution operation space after tiling. P dimension isdivided into PL2 tiles of size PL1. Similarly, K dimension istiled into KL2 tiles of size KL1. . . . . . . . . . . . . . . . . . 12Figure 2.4 Illustration of the data movement for the mapping example . . 14Figure 4.1 The stages in the Sunstone search process . . . . . . . . . . . 20Figure 4.2 Representing the order space of the 1D convolution problemas a trie, and how it can be pruned. . . . . . . . . . . . . . . . 25Figure 4.3 L1 tile size search and pruning. The workload is the 1D con-volution where P = 14,K = 4,C = 4,R = 3, and L1 size is 8entries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30xiiFigure 4.4 How ranking the partial-mappings first and using the ideal L2-cost can be used to prune the tiling space. This analytically-driven method conservatively prunes sub-optimal tiles. . . . . 36Figure 5.1 Example illustrating the redundant design points from exe-cuting a sweep through the L1 sizes and running the mapperthrough each memory size . . . . . . . . . . . . . . . . . . . 39Figure 5.2 Example illustrating how memory search can be integrated withSunstone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Figure 6.2 EDP of mappings from Sunstone vs. prior art. . . . . . . . . . 49Figure 6.4 Compilation time of Sunstone vs. prior art . . . . . . . . . . . 51Figure 6.5 Solution EDP: Sunstone vs CoSA . . . . . . . . . . . . . . . 53Figure 6.6 Optimization time: Sunstone vs CoSA . . . . . . . . . . . . . 54Figure 6.7 Normalized EDP: Various CoSA Configurations vs Sunstone . 55Figure 6.8 Sunstone vs mindmappings . . . . . . . . . . . . . . . . . . . 56Figure 6.9 Normalized EDP on non-DNN workloads (top: edge-class,bottom: datacenter-class) . . . . . . . . . . . . . . . . . . . . 57Figure 6.10 Time-to-solution on non-DNN workloads (top: edge-class, bot-tom: datacenter-class) . . . . . . . . . . . . . . . . . . . . . 58Figure 6.11 Effect of Sunstone techniques on search space size and resultEDP. Note the log scale. . . . . . . . . . . . . . . . . . . . . 59Figure 7.1 Energy breakdown of the 4 commonly stacked layers in theResNet [23] family of networks . . . . . . . . . . . . . . . . 62Figure 7.2 The number of memory configurations explored in Sunstone-arch vs grid search . . . . . . . . . . . . . . . . . . . . . . . 64Figure 7.3 The number of explored L1 tiles in Sunstone-arch vs grid search 65Figure 7.4 The number of explored L2 tiles in Sunstone-arch vs grid search 65xiiiAcknowledgmentsFirst and foremost, I would like to thank my family and friends who supported me.It has been a turbulent journey with many ups and downs, but your support andencouragement kept me sane throughout the whole ride.Next, I would like to thank Mieszko Lis for guiding me through this academicjourney, demonstrating to me what academic research is like, and what it takes tosucceed in this domain. Your advice and lessons were invaluable, and I will takethem with me for the rest of my technical career.Finally, I would like to thank my colleagues and friends I made along the jour-ney. Our technical discussions and your feedback has been critical to my develop-ment as a learner, and I will cherish the relationships we made along the way.xivChapter 1IntroductionDeep learning permeates our world, its applications spanning from everyday con-veniences such as speech\/visual recognition [16, 20], language translation [6], rec-ommendation models [43] and driving assistance [61], to industrial domains suchas medical research [48, 51] and finance [24]. At the heart of these applications aretensor workloads, which typically consist of independent, deeply nested loops (seechapter 2 and figure 1.1).The ubiquity and computational demands of deep learning has yielded a cornu-copia of dedicated tensor-compute accelerators, most comprising an array of pro-cessing elements (PEs) arranged in a 2D grid [1, 2, 11, 12, 18, 21, 22, 27, 44, 45, 64,etc.], as shown in figure 1.1. Depending on their architectural parameters such asarea, memory and array sizes, these accelerators targets either datacenter or mobileuse.Regardless, these accelerators all focus on optimizing tensor workloads. Thespecialized nature and amount of parallelism of these accelerators allows for manydifferent ways to map and schedule instructions of the inherently parallel work-loads onto these machines. Without dependencies, the loop operations of theseworkloads can be freely reordered while still computing the correct values. Howthese loops are mapped to the 2D array of PEs \u2014 known as the dataflow \u2014 has ahuge impact on the execution time and energy efficiency [11, 31, 35, 40, 46, 47].Due to the different type and amount of data reuse available for each mapping(which we will expand on in chapter 2), the energy difference between two func-1MemoryMap\/Schedule OperationsFigure 1.1: Example of a typical accelerator and a mapped tensor workload(in this case, convolution).tionally equivalent mappings can be up to 10\u00d7.1.1 The Difficulties of Mapping OptimizationFinding the best mapping is a hard problem: the search space is non-convex, non-differentiable, and astronomically large. Firstly, for a d\u2212dimensional problem, theloop ordering space for a single memory level is O(d!), meaning the full loop or-dering space of a 3-level memory hierarchy (like many conventional accelerators)is O((d!)3). Secondly, the tiling and unrolling space scales poorly with the fac-torability of the problem dimensions, and the tiling space scales further with thenumber of memory levels. Combining all three spaces yields on the order of 1019solutions even for a single CONV layer [9], meaning simply enumerating every so-lution and evaluating it is intractable. With deep learning models getting larger andlarger, the number of solutions will only continue to increase. Additionally, deeplearning training workloads are being deployed onto these accelerators, which in-volves additional dimensions since deep learning training is performed in batches,further increasing the number of solutions. Table 1.1 compares the mapping sizeand expected runtime of the different types of layers across the years. Note thatit is already infeasible to enumerate and evaluate every mapping for outdated net-works, and the problem scales extremely poorly as networks get bigger and morecomplex.The use for mapping optimizers is not just limited to improving performanceof a given accelerator design. Since the architecture performance is closely tied to2Layer Number of Mappings Seconds YearsAlexNet conv2 (2012) 1.9\u00d71017 1.7\u00d71012 5.4\u00d7104ResNet conv2 1 (2015) 8.6\u00d71017 7.7\u00d71012 2.4\u00d7105ResNet conv2 1 (2015, Batched) 3.6\u00d71021 3.4\u00d71014 1.1\u00d7109Table 1.1: Mapping size and expected runtime comparison between layersfrom state-of-the-art networks from different years. The expected run-time is calculated by assuming a single mapping evaluation takes 0.3milliseconds, and 32 threads are evaluating mappings in parallel. Thesenumbers are empirically obtained from running prior work [46].the dataflow, the mapping problem itself is a sub-problem of the larger problem ofarchitecture search, or hardware resource configuration, which involves decidinghow to allocate given hardware resources (e.g., silicon area) to different parts ofthe accelerator. Evaluating each accelerator design candidate will require manyinvocations of some dataflow mapper [41, 66, 67] in order to accurately capturethe best possible performance, meaning mapping optimizers generating poor map-pings can degrade the quality of solution during the architecture search, while aslow mapping optimizer becomes a major bottleneck since many accelerator de-sign candidates are explored by the architecture search tool.Furthermore, prior work has explored using accelerator performance to pruneneural networks during training, and mapping optimizers are invoked throughoutthe training process [65]. Again, a slow mapping optimizer becomes a bottleneckin an already long process in neural network training.1.2 Tackling the Mapping Optimization ProblemWhile several prior proposals [9, 14, 25, 26, 28, 41, 46, 66] have attacked thisdataflow mapping problem, all suffer from significant shortcomings. Some toolsrely on a few simple heuristics [46, 66] to prune the search space (as shown inchapter 2); unfortunately, these do not always lead to optimal solutions, and in anycase the limited pruning leaves the tools too slow to be used as components in-side architecture searches [31] or energy-efficient pruning [65]. Many other toolslike some of the blackbox optimizers [25, 26, 28] or some of the brute-force ap-proaches [14, 41] employ more aggressive heuristics, usually derived from empiri-3Figure 1.2: Three Inception V3 [57] layers mapped with CoSA [26] anddMazeRunner [14] with fast and slow settings each. There is no\u201cgolden\u201d parameter setting: each tool either returns invalid mappingsor bad EDP.cal data; this can lead to suboptimal solutions as well, or sometimes no solution atall, depending on what parameters are chosen.For these tools, it\u2019s not at all clear how to set the various empirically-derivedparameters for different workloads. Figure 1.2 shows the runtime and mappingEDP obtained with recent tools CoSA [26] and dMazeRunner [14] on three layersfrom Inception V3 [57]. We run the tools with the default configuration from thepaper and either a \u201cfast\u201d and \u201cslow\u201d variant that improves either EDP or runtime(see 6.4 for details for dMazeRunner; for CoSA, we sweep the weights of theobjective function).Observe that there is no \u201cgolden\u201d setting for either tool. In both cases, the moreaggressive settings can return invalid mappings for some layers (tiles that do not fitin on-chip memories or no dataflow that meets all the utilization constraints), andcannot be safely used in the general case. Both CoSA and dMazeRunner generatesolutions with suboptimal EDP or invalid solutions even with the \u201cslow\u201d settings(e.g., 3x3 early and 3x3 middle, respectively). To ensure the best results, the userwould need to run different tools with a range of different parameters, compound-ing the already complex mapping problem.Finally, such empirical heuristics are typically hardcoded for CONV layers [23],4and do not generalize well to different types of workloads, such as matricized ten-sor times Khatri-Rao product (MTTKRP) [54], Tensor-Times-Matrix chain (TTMc) [3],Matrix-Multiply Chain (MMC) [39], Sampled Dense-Dense Matrix Product (SD-DMM) [8] and Tensor Contraction layers (TCL) [33]. These kernels are bottle-necks for tensor factorizations and contractions, operations prevalent in the ma-chine learning and optimization domains [17, 33, 36, 52, 56, 59]. Adapting theempirically-driven and convolution-specific heuristics to these workloads will in-volve tuning the hyper-parameters in order to find the optimal configurations.In summary, prior work each suffer some shortcomings; depending on thereheuristics and hyperparameters, they can either generate good solutions, at the ex-pense of long time-to-solutions, or generate mediocre solutions quickly, or generategood solutions quickly for only certain workloads before hyper-parameters requirechanging.1.3 Our ContributionsIn this thesis, we eschew empirically-derived heuristics and instead focus on asystematic, algebraic analysis of reuse patterns. Specifically, we aim to answer thefollowing research questions:RQ1: How can we formally analyze how each mapping part (tiling, ordering, spa-tial unrolling) affects reuse and energy for generic tensor workloads?RQ2: How can we leverage these algebraic properties that govern the amount ofreuse in order to identify and prune provably sub-optimal spaces?RQ3: What representations and space traversal methods are required efficientlyenable this pruning?RQ4: How can we extend the algebraic analysis, state space representations, andpruning techniques to aid the architecture search (precisely, how can we ex-tend these analyses and techniques to prune the joint mapping-architecturespace)?To accomplish this, we first break down the mapping problem into stages (notunlike prior work), and shed light on how ordering, tiling and unroll algebraically5affect energy and performance for generic tensor workloads, in order to identifysub-optimal spaces within each mapping subspace. Leveraging these observations,we then introduce new state-space representations that enables us to efficientlyidentify and prune these sub-optimal spaces. With efficient techniques to pruneeach stage, we use them to form a mapper that either prune more of the searchspace than other tools, or prune it more intelligently (i.e., keeping better solutions).Furthermore, these techniques also apply to a broad range of tensor workloads.Finally, we extend these analytical observations and techniques to tackle parts ofthe architecture search (specifically, how much area to allocate for PE memories).1.4 Thesis OrganizationThe rest of this thesis is organized as follows. Chapter 2 contains background infor-mation on tensor workloads, spatial accelerators, dataflow, and how it affect execu-tion performance. Chapter 3 contains related work and their shortcomings. Chapter4 covers how performance can be inferred from the workload representation, andhow it can be exploited to prune sub-optimal mappings via new state-space repre-sentations. Chapter 5 covers how certain techniques can be extended to tackle partsof architecture search. Chapter 6 contains the evaluation and discussion of our pro-totype implementation against prior work, and case studies. Chapter 7 contains acase study to demonstrate the effectiveness of extending mapping techniques forarchitecture search. Chapter 8 discusses future work and conclusion to this thesis.6Chapter 2BackgroundIn this chapter, we introduce the target applications (tensor workloads), target ar-chitecture (spatial accelerators), and the fundamental building blocks of dataflow(tiling, ordering, and unrolling). We also illustrate how dataflow affects perfor-mance.2.1 Tensor WorkloadsTensor workloads are the backbone of deep learning systems. For instance, multi-layer perceptrons (MLP) can be formulated as a matrix multiply:O[i,k] =\u2211jA[i, j]\u00d7B[ j,k], while convolutions make up most of the layers in many image\/vision related net-works:ofmap[k, p,n] =\u2211r\u2211s\u2211cifmap[p+ r,q+ s,c,n]\u00d7weight[r,s,c,k](MLPs can be reduced to convolutions by setting r = s = p = q = 1). To lower thecomputational demands, many convolution variants are used, such as pointwiselayers (where r = s = 1) and depthwise layers (where c = k = 1, but with a newdepth dimension).In addition, tensor workloads are the heart of other optimization workloads [17,733, 36, 52, 56, 59], such as MTTKRP [54] and TTMC [3] (kernels for tensor de-compositions commonly used with domains with high dimensional data, such asrecommender systems), SDDMM (used in alternating least squares, a common op-timization kernel for measuring loss), and MMc (used in natural language pro-cessing), and TCL (an experimental DNN layer). These workloads are similar toconvolutions. For instance, definition of MTTKRP isO[i, j] =\u2211k\u2211lA[i,k, l]\u00d7B[k, j]\u00d7C[l, j]The for-loop notation that corresponds to convolutions and MTTKRP is shownin algorithm 1 and 2, respectively. These workloads typically contains a set of loops, whose loop body is a single instruction. Observe that for both workloads, theinstruction in the loop body has the same structure \u2014 products of multiple inputsare added to a single tensor. Since the product operands have no dependencies,and since additions are associative, these loops (and more broadly, the unrolledinstructions) can be freely reordered.Algorithm 1 Convolution implemented as for-loopsfor c\u2190 [0 : C) dofor k\u2190 [0 : C) dofor p\u2190 [0 : P) dofor q\u2190 [0 : Q) dofor r\u2190 [0 : R) dofor s\u2190 [0 : S) dofor n\u2190 [0 : N) doofmap[k, p]+ =ifmap[p+ r,q+ s,c,n]\u00d7weight[r,s,c,k]Algorithm 2 MTTKRP implemented as for-loopsfor i\u2190 [0 : I) dofor i\u2190 [0 : J) dofor i\u2190 [0 : K) dofor i\u2190 [0 : L) doO[i, j]+ = A[i,k, l]\u00d7B[k, j]\u00d7C[l, j]8Shared Global Memory (L2)off-chipmemoryNetwork on Chip (NoC)...............Spatial Accelerator ChipPE PE PElocal weightbufferlocal ifmapbufferlocalofmapbuffer\u00d7\/+PE1 localofmapbufferweightbufferifmapbufferofmapbuffer\/PEFigure 2.1: A generic spatial accelerator architecture with per-PE (L1) mem-ories and a shared on-chip L2 memory.The mathematical definitions of these tensor workloads can be found in sec-tion 6.1.2.2 DNN accelerator architectureFigure 2.1 shows the architecture of a typical DNN accelerator [1, 11, 12, 18, 22,45, 64, etc.], implemented as a 2D array of processing elements (PEs). Many sys-tolic array-based designs (such as TPUs [27]) can be similarly abstracted. Each PEhas a multiply-and-accumulate (MAC) functional unit and local (L1-level) mem-ories; these either consist of separate memories for each datatype (i.e., ifmap,weights, and ofmap), or are unified memories that store all three datatypes. L1is commonly double-buffered to overlap computation and memory refill. Note thatthe MAC can be replaced by any functional unit or pipeline that implements theinstructions that make up the loop body.Accelerators also commonly include a larger memory shared among the PEs(L2-level), as well as a large off-chip DRAM. The PEs and L2 are typically inter-connected via one simple on-chip interconnect per datatype [11].This architecture supports three kinds of reuse. First, operands may be reusedspatially, that is, broadcast to a subset of (or all) PEs. L1 memories support short-term temporal reuse of operands within each PE, and the global L2 memories sup-port longer-term temporal reuse for multiple PEs in order to avoid the expensive9DRAM accesses (similar to a last-level cache).The desired computation (e.g., DNN inference) is tiled and mapped to thesestructures to maximize hardware utilization and minimize energy, through a pro-cess called dataflow mapping and described below.2.3 Dataflow mappingRecall that dataflow mapping can major influence on performance (up to 2\u00d7 [46]).Dataflow mapping consists of tiling, loop ordering, and spatial unrolling. To ex-plain each with a concrete example, we consider the convolution of a 1D tensorifmap with K 1D filters with length R, defined by their weights:ofmap[k, p] =\u2211rifmap[p+ r]\u00d7weight[k,r]Typically, this is expressed as a nested loop [46]:1: for k\u2190 [0,K) do2: for p\u2190 [0,P) do3: ofmap[k, p]\u2190 04: for r\u2190 [0,R) do5: ofmap[k, p] += ifmap[p+ r]\u00d7weight[k,r]This defines a 3D operation space of K\u00d7P\u00d7R MAC operations shown in fig-ure 2.2a, where the operands for each MAC can be obtained by projecting its pointin the operation space onto the \u201cwalls.\u201d The walls thus correspond to the ifmap,ofmap, and weight tensors.2.3.1 TilingThe per-PE (L1) memories are far too small to contain the entire ifmap, ofmap, andweight tensors. To support temporal reuse, the operation space must be tiled intoL1-tiles with memory footprints that fit in the L1 memories.Figure 2.2b shows this for the running convolution example. The volume ofeach tile shows the MAC operations performed in this tile, while the surfaces W1,O1, and I1 correspond to the regions of the weight, ofmap, and ifmap accessed. Ifthese are stored in L1 memories, they can be reused: e.g., W1 can be temporallyreused across the P extent of the tile.10PIFMAPKR(p,k) (k, r)P......321P......32P......3(p, r)(k, r, p)(a) operation spaceKKL2PL21212...RPTile3Tile4I1Tile2I1P L1O1W1KL1 Tile1P L1KL1...(b) operation space tilingFigure 2.2: Operation space of a 1D convolution. Ifmap is shifted, replicated,and padded by 0 along the sliding window dimension R.This corresponds to the following pseudocode, where the K dimension is di-vided into KL2 equal tiles of size KL1, the P dimension into PL2 equal tiles of sizePL1:1: for k2\u2190 [0,KL2) do2: for p2\u2190 [0,PL2) do3: for k1\u2190 [0,KL1) do4: for p1\u2190 [0,PL1) do5: k\u2190 k2\u00d7KL1 + k16: p\u2190 p2\u00d7PL1 + p17: ofmap[k, p]\u2190 08: for r\u2190 [0,R) do9: ofmap[k, p] += ifmap[p+ r]\u00d7weight[k,r]L2 tileL1 tile2.3.2 Tile Traversal OrderIn addition to the intra-tile reuse described above, some tensor regions can be fur-ther temporally reused over multiple tiles. In figure 2.2b, for example, region W1can remain in L1 if tile 3 is processed in the same PE right after tile 4 (as it is inthe pseudocode).We can control what is reused between tiles by changing the tile traversal order.11L2 KL2 = 1, RL2 = 1, PL2 = 2 L2PE1 PE2KL2 = 1, RL2 = 1, PL2 = 2PE1 PE2Tile 1 Tile 3 Tile 2 Tile 4Figure 2.3: 1D convolution operation space after tiling. P dimension is di-vided into PL2 tiles of size PL1. Similarly, K dimension is tiled into KL2tiles of size KL1.We will write orders by listing loop bounds outermost-to-innermost, so the pseu-docode above is KL2PL2KL1PL1R. If we swap lines 1 and 2 (order PL2KL2KL1PL1R),tile 2 will be processed right after tile 4, reusing the I1 region of ifmap in L1.2.3.3 Spatial UnrollingFinally, loops can be spatially unrolled so that different tiles are assigned to dif-ferent PEs. For example, in figure 2.3 the K dimension is unrolled spatially acrosstwo PEs, so tiles 1 and 3 will be computed by PEs 1 and 2 in the first step, andthen tiles 2 and 4 in the next. This allows inter-tile spatial reuse: each pair of tilesprocessed concurrently by the PEs accesses the same region of the ifmap, whichcan be broadcast to both PEs.Considering all combinations of tiling, traversal order, and unrolling results inan enormous search space for operations such as convolution. For example, thethird layer of ResNet [23] yields 2.6\u00d7 1010 possible configurations for executionon an Eyeriss-like accelerator [11], even with batch size 1 and only consideringequally-sized tiles.122.3.4 Dataflow Walk-throughHere, we will put the previously discussed concepts together in a full exampleusing the following mapping on a 2-PE, 2-level system:1: for k2\u2190 [0,2) do2: #pragma unroll3: for psp\u2190 [0,2) do4: for k1\u2190 [0,2) do5: for p1\u2190 [0,2) do6: k\u2190 k2\u00d72+ k17: p\u2190 psp\u00d72+ p18: ofmap[k, p]\u2190 09: for r\u2190 [0,3) do10: ofmap[k, p] += ifmap[p+ r]\u00d7weight[k,r]L2 tilespatialL1 tileFigure 2.4 illustrates the the number of work-tiles for each processing pass, andhow it affects the number of accesses to execute the workload. In the first process-ing pass (k2 = 0), each PE works on separate ifmap and ofmap tiles \u00b6 \u00b7, so thoseare uniquely sent to each PE, and the number of L2-reads and L1-fills are just thesize of each work tile scaled by the number of PEs (note that more intricate accel-erator designs can exploit the overlapping region between PEs\u2019 ifmap tiles, but forsimplicity, we assume in this example that the overlapping regions are redundantlyread). However, the weight tiles are identical between the two PEs\u00b8\u00b9. Therefore,it is only fetched once and broadcasted to the PEs (spatial reuse \u00bc). Observe thatwithin each PE, only 14 ifmap, weight, and ofmap elements are filled for 12 MACoperations due to intra-tile reuse (as opposed to 36 fills, one of ifmap, ofmap andweight per MAC).In the second processing pass (k2 = 1), each PE still works on separate ifmapand ofmap tiles. However, note that the ifmap tile across processing passes is thesame \u00bb, hence it is temporally reused (inter-tile reuse \u00bc \u00bd \u0141), while the ofmaptiles are fetched once again. The weight tile across processing passes are differentso they are fetched once again as well (though spatial reuse is still present \u00bf).Now that we have the number of accesses, we can compare this to the numberof accesses of a un-tiled version of the same workload (where every MAC operationalways reads its operands from L2 to L1 before it is consumed, without the notionof reuse). Table 2.1 demonstrates the differences.13L2 Iteration 1:PE 0ifmap fills (this iteration\/total): 4\/4weight fills (this iteration\/total): 6\/6ofmap fills (this iteration\/total): 4\/4L2PE 1ifmap reads (this iteration\/total): 8\/8weight reads (this iteration\/total): 6\/6ofmap reads (this iteration\/total): 8\/8ofmap updates (this iteration\/total): 8\/8ofmap writebacks (this iteration\/total): 4\/4L2 Iteration 2:ifmap fills (this iteration\/total): 4\/4weight fills (this iteration\/total): 6\/6ofmap fills (this iteration\/total): 4\/4ofmap writebacks (this iteration\/total): 4\/4PE 0ifmap fills (this iteration\/total): 0\/4weight fills (this iteration\/total): 6\/12ofmap fills (this iteration\/total): 4\/8L2PE 1ifmap reads (this iteration\/total): 0\/8weight reads (this iteration\/total): 6\/12ofmap reads (this iteration\/total): 8\/16ofmap updates (this iteration\/total): 8\/16ofmap writebacks (this iteration\/total): 4\/8ifmap fills (this iteration\/total): 0\/4weight fills (this iteration\/total): 6\/12ofmap fills (this iteration\/total): 4\/8ofmap writebacks (this iteration\/total): 4\/8Legend Tensor Tile for PE 0 Tile for PE 1123 45678 910Figure 2.4: Illustration of the data movement for the mapping example14Un-tiled Tiled ReductionL2 Readsifmap 48 8 6\u00d7weight 48 12 4\u00d7ofmap 48 16 3\u00d7Total 144 36 4\u00d7L1 Fillsifmap 48 8 6\u00d7weight 48 24 2\u00d7ofmap 48 16 3\u00d7Total 144 48 3\u00d7L1\u2192L2 Writeback 48 16 3\u00d7Table 2.1: Comparison between the number of accesses for the tiled and un-tiled mapping.2.3.5 Dataflow and PerformanceFrom the previous section, we see how dataflow can affect the number of accesses,and hence the energy consumption, but it also affects the latency of the workload aswell. Since the execution of these workloads is deterministic (no control-flow) andevery memory request and statically known, the main source of stalling within eachPE is outstanding memory requests. Even with double-buffering, if the operandsare being consumed faster than the data can reach the PEs, the PEs will inevitablystall. By increasing reuse and minimizing the number of accesses, the chance ofstalling is minimized, ultimately improving the latency of the workload.2.4 SummaryIn this chapter, we introduced the relevant background information regarding dataflowmapping, and how it affects performance. Next, we will introduce prior and relatedwork.15Chapter 3Related WorkIn this chapter, we discuss how prior work has tackled the mapping problem, andtheir shortcomings.3.1 Random searchTimeloop [46] is an analytical framework to model performance of a user-definedmapping on a user-defined architecture. While the modeling tool is analytical,the provide mapper is a basic mapper that randomly samples design points of themapping space and keeps the best one seen so far. With the design space being soexplosively large, the search terminates when certain user-defined conditions aremet (e.g., after a certain number of invalid mappings has been sampled). Users canalso set certain mapping constraints (e.g., only explore certain spatial unrollings)in order to speed up the search.While Timeloop can optimize the mapping of arbitrary tensor computations, itsuffers from many major shortcomings. First, the random search inherently lacksany heuristics or techniques to shrink the map space and instead, relies on user-setconstraints. It also does not use any information from prior samples to inform orguide the search, and good mappings are only found due to hard-to-reach termina-tion conditions in order to extend the search and explore more mappings. We showlater how these termination conditions influence the time-to-solution and perfor-mance of the mappings.163.2 Directed searchdMazeRunner [14], Marvel [9], Interstellar [66], and ZigZag [41] tackles the map-ping problem with their own set of heuristics to shrink the map space, in order togive the search some direction as to which mappings can be avoided.dMazeRunner [14] first shrinks the ordering space by only evaluating a subsetof the ordering. The subset is manually determined, and is specific to convolutions.It also uses buffer capacity to prune the search space. However, this thresholdis determined by the user, which has major implications on time-to-solution, andwhether a mapping is even found. As shown in chapter 1, setting this thresholdtoo low leads to long runtimes, and setting it too high could lead to sub-optimal, oreven no mappings.Marvel [9] shrinks the map space by also employing threshold-based heuris-tics for the spatial unrolling. It also employs a layer-by-layer approach by firstfinding the optimal mapping between the DRAM and L2, before finding the opti-mal mapping for the the rest of the architecture. While this level-by-level approachis promising, it makes greedy assumptions and could lead to sub-optimal map-pings. In addition, Marvel still lacks techniques to shrink the other spaces, such asordering and tiling.Like Marvel, ZigZag [41] employs a greedy layer-by-layer approach, but itstarts from the inner-most memory level instead of the outermost. However, likeMarvel, it still lacks techniques to shrink the other spaces, such as tiling.Interstellar [66] shrinks the map-space by empirically demonstrates that spa-tial unrolling does not matter for convolutional workloads, and suggests to pickany spatial unrolling scheme from the start. However, we show later that thisscheme does not always work for every tensor workload, or even convolutionalworkloads. Interstellar, like Marvel and ZigZag, also lacks techniques to shrinkthe other spaces.3.3 Black-box OptimizersMindMappings [25] and CoSA [26] relies on black-box optimizers to solve themapping optimization problem, as opposed to using enumeration+search.MindMappings first approximates the cost function as a multi-level perception17(MLP), before running gradient descent on it in order to minimize the cost. How-ever, this technique requires a plethora of training examples for each type of tensorworkload, and it is unclear how well it adapts to different architectures.CoSA reduces the mapping problem into a Mixed-Integer Program (MIP) be-fore using off-the-shelf solvers to find the minimum cost (which is approximatedby buffer capacity, memory traffic, and utilization). However, we show later inchapter 6 that since the convolutional window size cannot be expressed as the MIPconstraints, some of the optimized mapping it generates is invalid. In addition, asshown in chapter 1, the optimized mapping\u2019s performance is sensitive to the hyper-parameters, which will require tuning for future workloads and architectures.3.4 Other Miscellaneous WorksSome works [4, 19, 37, 58] use polyhedral compilation techniques. These eithertarget different platforms (e.g., GPU [19, 58] or CPU [38]), or are too general-ized [4, 37] and suffer from inaccurate cost models which can lead to suboptimalsolutions [5, 38].Other works use black-box optimizations as well, but for different\/additionalobjectives in mind. DiGamma [30] uses genetic algorithms to tackle both the map-ping problem and architecture search problem together. TVM [10] uses gradienttree boosting for mapping optimization for CPUs and GPUs. ConfuciuX [29] usesreinforcement learning to mainly tackle architecture search, but extends it to tacklespatial unrolling.3.5 SummaryIn this chapter, we introduce prior work and their shortcomings. Specifically, thesolutions either suffer from long runtimes, or mediocre solutions (depending onyour hyper-parameters). It is also possible for the generated mapping to be invalid.18Chapter 4Analytically Driven MappingFlowIn this chapter, we start by analyzing how different parts of mapping (e.g., tiling andordering) affects data movement (RQ1, RQ2), before introducing new representa-tions in order to prune spaces that we can analytically guarantee to be sub-optimal(RQ3).4.1 Mapping Flow of SunstoneOur prototype implementation, which we call Sunstone, follows the flow demon-strated in figure 4.1. It takes a layer-by-layer approach by exploring the mappingspace of the innermost memory levels first, and building upon the potentially opti-mal mappings of these innermost levels.The tool first accepts a description of the target architecture as well as thedescription of the tensor workload\u00b6 (section 4.2), and infers which tensors can bereused across which dimensions \u00b7 (section 4.3). Classifying the dimensions helpus better understand the workload properties, and how the rest of the map spacecan be pruned.Next, it determines a set of promising loop orderings\u00b8 (section 4.4) by pruningthe O(d!) ordering space, using a new ordering-space representation to identify andremove provably redundant and sub-optimal orderings. Note that this stage is done19Architecture  ConfigurationL1 TilingSearch EngineSpatial UnrollingEngine (If applicable)L1 Cost EngineLoop OrderingSearchInfer ReuseLX TilingSearch EngineLX CostEngineProblem (dimensions, tensor  description)L1 BufferSizesPE GridConfigurationsPromising UnrollingsPromising Tiles +unrollingReuseDimensionsSet of L1 (Tiling + unrolling +ordering) candidatesSet of L1, L2 ... LX Tiling,unrolling candidatesSet of L1, L2 ... LX(Tiling + unrolling +ordering) candidatesRepeat for levelsL1,...,LXDataflow Mapping1234567Figure 4.1: The stages in the Sunstone search process20without any assumptions of what the tiling is, so the set of promising loop orderingsis the same across the different memory tiling levels, meaning this stage only needsto be done once.Afterwards, it finds a set of spatial unrolling candidates (i.e., dimensions thatwill be spatially unrolled across the PE array) \u00b9 (section 4.6), again removingprovably sub-optimal spatial unrollings, and viable tiling candidates for L1\u00ba (sec-tion 4.5).1 Again, using several observations on how these spaces affect data-movement, we identify and remove provably redundant and sub-optimal tilingsand spatial unrollings. This is crucial to keep the tool scalable with respect to thenumber of memory levels, since pruning the map space early means less space ofthe higher memory levels need to be explored. The spatial unrolling is done earlyin order to maximize the utilization of the compute resources, which improves la-tency, but for clarity of explanation, we explain this after the tiling and orderingstages.Next, we estimate the energy of each tiling candidate and their respective un-rolling candidates, and apply a variant of alpha-beta pruning to remove suboptimalsolutions \u00bb (section 4.8). This keeps the tool scalable with respect to the numberof memory levels, since the mapping process is repeated for each L1 tile for thenext level.Finally, we repeat steps\u00b9 and\u00bb at each enclosing memory level (L2, L3, ...)\u00bc(section 4.7) for each solution candidate, using the inter-tile traversal ordering ofthe prior level as the intra-tile order for the current level (e.g., intra-tile traversal atL2 equals the inter-tile order at L1).In the rest of this chapter, we explain each of these steps in detail. As the run-ning example, we will use a slightly more complex version of the 1D convolutionexample from chapter 2, where the ifmap vector has C channels and the ofmapvector has K channels:ofmap[k, p] =\u2211c\u2211rifmap[c, p+ r]\u00d7weight[c,k,r]. (4.1)1The spatial unrolling step may occur later in the sequence if the PE has multiple private memorylevels.214.2 Parameter RepresentationMany prior works [14, 26, 66] hardcode assumptions about how operands interact,and so only support optimization of a small subset of tensor workloads (mostlyCONV and fully connected layers).Sunstone instead accepts both the architecture description and the problem de-scription as inputs (like Timeloop [46]). For example, the following describes anaccelerator with 4\u00d72 PE grid, private L1s with 16 words for weights and ifmaps,and 64 words for ofmaps, a NoC with 4 words\/cycle read and write bandwidth, anda 1024-word unified L2 shared among all PEs:acce le ra to r = {X PEs :4 , Y PEs :2 ,s p a t i a l l e v e l : 1 , #memory l e v e l where s p a t i a l fanout happensbandwidth : ( 4 , 4 ) , #NoC read \/ w r i t e BWmemory s izes : [(16 , 16 , 64) , #s ize o f each L1 bu f f e r1024 , #s ize o f u n i f i e d L2]}Meanwhile, the following describes an operation on two input operands and oneoutput, where the first operand is 2D and indexed by C and some combination ofindices P and R (e.g., p+ r), the second is 3D and indexed by K, C and R, and theoutput is 2D and indexed by K and P (this corresponds to 4.1). Here, the indicesare bound by 0\u2264 K,C < 4, 0\u2264 P < 7, and 0\u2264 R < 3.dimensions = {K:4 , C:4 , P:7 , R:3}t e n so r desc r i p t i o n = {operand1 = [C, (P, R) ] ,operand2 = [K, C, R] ,output = [K, P ]}4.3 Inferring reuseFrom this problem description, Sunstone infers reuse as follows. First, observe thatthe location accessed in each tensor can change only when the problem dimensionsthat index this tensor (the tensor\u2019s indexing dimensions) change. In contrast, whenany of the tensor\u2019s non-indexing dimensions change, the location in the tensor stays22tensor indexed by fully reused by partially reused byofmap k, p cr \u2212ifmap c, p,r k p,rweight c,k,r p \u2212Table 4.1: Inferred reuse of each tensor in 4.1.the same. For example, in equation 4.1, C, R, and P are indexing dimensions forifmap, while K is a non-indexing dimension. It follows that the tensor can be fullyreused across any non-indexing dimension.A second kind of reuse arises when a tensor index is an arithmetic combinationof problem dimensions, such as the p+ r index of ifmap in equation 4.1. This isbecause p+r can have the same value for multiple values of p and r. Typically thisinvolves less reuse than is due to non-indexing dimensions \u2014 e.g., in equation 4.1,ifmap[0] from tile 1 is not reused but ifmap[1] and ifmap[2] are \u2014 so we refer tothis as partial reuse.In general, inferring this reuse requires algebraic analysis on the indexing ex-pression [5, 62]. Sunstone, however, only considers the common case where thepartial reuse is due to communtative arithmetic combinations of multiple problemdimensions. Table 4.1 shows the relationship between problem dimensions andreuse inferred from equation 4.1.4.4 Loop order (inter-tile reuse)Recall from chapter 2 that the order in which loops are nested determines the inter-tile reuse. In this section, we show how Sunstone examines the potential ordersand rejects those that exhibit strictly worse reuse.4.4.1 InsightsSunstone relies on the following observations, which we explain by discussing thetraversal between L1 tiles (i.e., ordering at the L2 level) in the running example(equation 4.1). Below, we have isolated the L2 level of an example dataflow forthis computation:231: for k2\u2190 [0,KL2) do2: for p2\u2190 [0,PL2) do3: for c2\u2190 [0,CL2) do4: for r2\u2190 [0,RL2) do5: L1 tile computationL2 tileL1 tileObserve that while K is a non-indexing dimension of ifmap, in this loop orderifmap actually cannot be reused across K. This is because the loop that iterates overC (line 3) is inside the K loop (line 1) \u2014 that is, within each iteration of the K loop,there are multiple iterations of the C loop that replace the ifmap tensor in L1 andprevent reuse between K iterations. In the general case, this observation becomes:OBSERVATION 1For a non-indexing dimension of an operand to actually reuse the operand, itmust either be the inner-most loop, or the loops inside it must be limited to theother non-indexing dimensions of that same operand.Next, note that while the innermost loops R (line 4) and C (line 3) lead to thereuse of the ofmap tensor, reordering the loops above them (i.e., lines 1 and 2) doesnot impact the number of accesses to weight and ifmap even though these tensorscould potentially be reused across P and K respectively. This is because C and Rindex weight and ifmap, and they load different tiles of these tensors within eachiteration of P and K, destroying any potential reuse. The general case is:OBSERVATION 2Only a subset of the loops \u2014 precisely, the innermost loops that reuse the sametensor \u2014 determine the reuse, and hence only the ordering of those loops needsto be optimized.Lastly, observe that C (line 3) and R (line 4) fully reuse ofmap, while R alsopartially reuses ifmap. If we make R the innermost loop, we fully reuse ofmapacross R and C and we partially reuse ifmap across R. On the other hand, if Cis inner-most (i.e., lines 4 and 3 are swapped), the partial reuse of ifmap will bedestroyed because C is an indexing dimension of ifmap. In general:OBSERVATION 3Even when considering only the inner loops that reuse the same tensor, certainordering of these inner loops may lead to less reuse than others.24xxxxxxRKP: if xxCKxxPK P: ifxxCR C: ofxxRC R: ofxxxK K: if xxxR R: of,  R: ifxxxC C: ofxxxP P: w, P:ifxxPRxxKR xxRPxxKCxxPC xxCPxxKPInnermostoutermost1243 56Kept Pruned7 8Figure 4.2: Representing the order space of the 1D convolution problem as atrie, and how it can be pruned.4.4.2 Differences vs. prior workObservations similar to 1 and 2 were made by prior work [14, 38, 41] for thespecific case of convolution and FC layers; we generalize them to any nested loopswithin our target workloads. Observation 3 does not appear to have been previouslypublished.4.4.3 RepresentationTo take advantage of these observations, Sunstone represents the loop orderingsearch space by a trie, illustrated in figure 4.2 on the running example from equa-tion 4.1.Each node represents a partially-determined loop order, and is annotated withthe available reuse. At the root, the dimensions of all four nested loops are undeter-mined (represented by x). The immediate children represent the possible choicesfor the innermost loop: e.g., xxxC means C is traversed in the innermost loop whilethe outer loops are undetermined. Their children, in turn, represent the traversal or-der of the innermost loop and the next-innermost loop: e.g., xxRK traverses K as25the innermost loop and R in the next-innermost, and so on.Each node is annotated with the operand(s) that can be reused: in \u00b6 ofmap(of ) is reused when the innermost loop traverses the input channel (xxxC), in \u00b7both ofmap and ifmap are reused when the innermost loop is the filter dimensionR, and so on. Note that reuse can remain or disappear at higher levels due toObservation 1: for example, in node \u00b8, the ofmap reuse across C is availablebecause all innermost loops also reuse ofmap \u00b7, while the weight reuse acrossP in \u00ba is not available because weight is not reused across R in \u00b7.4.4.4 PruningOnce the trie has been constructed, some nodes can be pruned as strictly worse,relying on the two rules below. (Our implementation in fact never generates thesenodes, but we discuss the concepts in terms of pruning for clarity.)First, any nodes that offer no further reuse compared to their parent node can bepruned, since none of their children will offer any further reuse either (Observation2). For example, xxCK\u00bc is pruned away because C reuses ofmap, but ofmap reusehas already been destroyed by the inner K loop. Based on the same argument,xxKR \u00bd and xxPR \u00ba will also be pruned.Second, if two child nodes either: (i) lead to reusing the same tensor \u03b1 fromthe same dimensions A and B (although with different innermost orderings, suchas xxAB and xxBA), or (ii) one node leads to reusing the same tensor \u03b1 as theother but also additionally reuses another tensor \u03b2 , then one of the children canbe pruned (Observation 3). Figure 4.2 shows that since xxCR \u00b8 reuses the sameoperands as xxxC \u00b6 (i.e., ofmap via R and C), but also leads to additional partialreuse of ifmap via the R dimension, the xxxC node is pruned away \u00bb.Sunstone is conservative and only prunes orderings that provably lead to subop-timal reuse or reuse that is already being offered by another ordering. For example,in figure 4.2, both xxRK and xxPK are kept: even though both reuse ifmap, thepartial reuse is offered by different dimensions, so one could lead to more reusethan the other depending on the P and R dimensions. We will see later how thesedimensions affect the reuse later in section 4.5, but since we looking for the setof promising orderings for any tiling, we make no assumptions of the dimension26Workload Dimensions Ordering Space Potential OrderingsBatched Convolution 7 5040 (7!) 8MTTKRP 4 24 (4!) 7TTMc 5 150 (5!) 10SDDMM 3 6 (3!) 3TCL 6 720 (6!) 36Table 4.2: The ordering space size and the size of the problem ordering subsetfor various tensor workloadsvalues.Removing provably-worse orderings significantly prunes the ordering space.For example, for batched convolution, the set of potential orderings is pruned from7! = 5040 to 10 orderings. Table 4.2 demonstrates the how much of the orderingspace is pruned for various tensor workloads.4.4.5 Differences vs. prior workSome prior works [14, 38, 41] manually hand-tune the ordering space via enumer-ation for the special case of convolutions. In contrast, we describe a technique todetermine these orderings automatically; as such, our method immediately appliesto other generic tensor computations without the need for manual per-workloadanalysis.4.5 L1 tile size optimizationNext, Sunstone relies on algebraic analysis to select efficient L1 memory tile con-figurations, which in turn determines both the intra-L1 reuse and what can bereused at higher levels. We again explain this in terms of pruning, although Sun-stone never actually generates the suboptimal configurations.4.5.1 InsightsTo determine suboptimal L1 tilings, let us consider the following dataflow for equa-tion 4.1:271: for p2\u2190 [0,PL2) do2: for k2\u2190 [0,KL2) do3: for c2\u2190 [0,CL2) do4: for p1\u2190 [0,PL1) do5: for k1\u2190 [0,KL1) do6: for c1\u2190 [0,CL1) do7: for r1\u2190 [0,R) do8: computationL2 tileL1 tileHere, the L1 tile sizes are PL1\u00d7KL1 for ofmap, CL1\u00d7KL1\u00d7R for weight, and (PL1+R\u22121)\u00d7CL1 for ifmap, and the total number of L1 tile iterations is PL2\u00d7KL2\u00d7CL2.Thus, to execute the full workload, the total number of L2 memory accesses wouldbe #passes \u00d7 tile size, broken down as:ifmap :KL2\u00d7PL2\u00d7CL2(PL1+R\u22121)\u00d7CL1= KL2\u00d7C\u00d7PL2(PL1+R\u22121)(4.2)weight :KL2\u00d7PL2\u00d7CL2(CL1\u00d7KL1\u00d7R)= C\u00d7K\u00d7R\ufe38 \ufe37\ufe37 \ufe38problem dimensions\u00d7 PL2 (4.3)ofmap :KL2\u00d7PL2\u00d7\b\b* reusedCL2(PL1\u00d7KL1)= P\u00d7K\u00d7\b\b* reusedCL2 = P\u00d7K(4.4)Here, ofmap is reused CL2 times \u2014 that is, ofmap remains in L1 between L1 tileiterations \u2014 because C is the innermost L2 loop. The total L2 access count is thesum of equation 4.2, 4.3, and 4.4:L2 accesses = KL2\u00d7C\u00d7PL2(PL1+R\u22121)+C\u00d7K\u00d7R\u00d7PL2+P\u00d7K(4.5)For best L1 reuse, our task is to minimize this, under the constraint that the L1 tilesof all datatypes fit in the L1 memories \u2014 e.g., RL1\u00d7CL1\u00d7KL1 must not exceed theL1 weight buffer.What are the degrees of freedom here? Equation 4.2,4.3, and 4.4 involve eitherfull problem dimensions (e.g., C, K and R in equation 4.3) \u2014 which we cannotchange \u2014 or loop bounds (e.g., PL2 in equation 4.3) \u2014 which we can select to28change L1 tile dimensions. For example, the ofmap access count P\u00d7K only in-cludes full problem dimensions (equation 4.4), so we cannot change it by alteringL1 tile dimensions. Our options are therefore to decrease KL2 or PL2 to minimizeifmap and weight fetches.Now, consider for the sake of argument two configurations where (a) KL2 = 2or (b) KL2 = 3, and no other loop bounds change. If both (a) and (b) fit in the L1memories, then (a) offers strictly more reuse (and lower energy), because there arefewer ifmap and weight fetches. We can generalize this observation and use it toprune suboptimal L1 tile sizes:OBSERVATION 4For any given tile T , if any of its L1 dimension can be enlarged while still fittingin the L1, the larger tile will have fewer data accesses; therefore T can be pruned.4.5.2 RepresentationTo take advantage of observation 4, we again formulate the problem as a searchtree, this time rooted at the smallest L1 tile possible (where every dimension is 1).Figure 4.3 shows this for the running example where the total problem dimensionsare P = 14,K = 4,R = 4,C = 4 and the unified L1 memory has 8 entries.Each node is an L1 tile size candidate, annotated with its L1 tile dimensionsand the L1 memory footprint (we show a unified L1 here for clarity; there wouldbe separate per-datatype footprints if L1 memories were separated by datatype).Each of its children is a candidate that is identical to the parent node except forone dimension, which is enlarged to the next higher factor of the correspondingproblem dimension. For example, node\u00b6, which represents the L1 tile with CL1 =2, PL1 = 1, KL1 = 1,RL1 = 1, while its child \u00b7 is the same except PL1 = 2.Based on observation 4, nodes that have at least one child that still fits in L1can be pruned, because the child offers strictly more reuse. For example, \u00b7 stillfits in L1 and has more reuse than its parent \u00b6, so \u00b6 can be pruned.In contrast, node \u00b7 cannot be enlarged in any dimension without exceedingthe L1 capacity. This is therefore a candidate for the optimal L1 tile, and remainsunpruned.Note that we can only use this method to draw conclusions between a node and292,1,1,1 mem:51,2,1,1 mem:51,1,2,1 mem:5 1,1,1,3 mem:72,2,1,1 mem:82,1,2,1 mem:8 2,1,1,3 mem:11Pruned bymemoryPruned bylarger factorKept2434,1,1,1 mem:9 4,1,2,1 mem:14 2,1,4,1 mem:14 2,1,2,3 mem:20 2,1,1,3 mem:11 2,2,2,1 mem:12 1,7,1,1 mem:15 1,2,1,3 mem:9 1,2,2,1 mem:8 1,1,4,1 mem:9 1,1,2,3 mem:131,2,1,3 mem:91,1,1,1 mem: 32,2,1,3 mem:14 2,7,1,1 mem:23 4,2,1,1 mem:14 Legend:11,7,2,1 mem:23 1,2,4,1 mem:14 1,2,2,3 mem:16 Figure 4.3: L1 tile size search and pruning. The workload is the 1D convolu-tion where P = 14,K = 4,C = 4,R = 3, and L1 size is 8 entries.its descendants (such as \u00b6 and \u00b7). Our pruning rule cannot draw further conclu-sions about nodes where different dimensions have been enlarged: for example, \u00b7and\u00b8 cannot be directly compared without knowing the next-level (e.g., L2) tiling,so both nodes are kept.Applying this technique, Sunstone reduces the L1 tile search space for ResNet-18 [23] convolution layers by up to 80% (vs all valid L1 tile candidates).Finally, for each remaining L1 tile, we compute the number of memory ac-cesses under each remaining loop ordering (section 4.4), and pair the L1 tile withthe ordering that leads to the fewest memory accesses.304.5.3 Differences vs. prior workSome prior works [14, 26] rely on hyperparameters (like % L1 memory utiliza-tion [14]) to prune tiling candidates. Unfortunately, it\u2019s not clear how to set these,and the \u201cbest\u201d settings vary per layer (see figure 1.2); as a result, these frequentlyeither miss the optimal tiling or fail to fully prune the search space. In contrast,Sunstone pruning relies on an analysis of L2 access equations, and has no hyper-parameters that must be set. In addition, Sunstone works out of the box for othertensor workloads (see 4.2).4.6 Spatial unrollingAt this point, we have a set of candidate L1 tiles, each with its optimal L2 looporder. For each of these options, we find the best ways to spatially distribute thework among multiple PEs.For clarity and analysis, we demonstrate this step between the L1 and L2 levelsin this thesis, but Sunstone enumerates and prunes the spatial unrolling candidatesfirst before tiling the L1. If there are other instances of spatial unrolling, we wouldenumerate and prune the spatial unrolling candidates first before tiling the privatememory level.4.6.1 InsightSpatially unrolling some loops can reduce the number of L2 accesses if the samedata is needed by multiple PEs at the same time by broadcasting said data (seechapter 2).We continue the running example, now adding a spatial tile in each dimension,so that now P = PL2\u00d7Pspatial\u00d7PL1, etc.:311: for k2\u2190 [0,KL2) do2: for p2\u2190 [0,PL2) do3: for c2\u2190 [0,CL2) do4: for kspatial\u2190 [0,Kspatial) do5: for pspatial\u2190 [0,Pspatial) do6: for cspatial\u2190 [0,Cspatial) do7: for k1\u2190 [0,KL1) do8: for p1\u2190 [0,PL1) do9: for c1\u2190 [0,CL1) do10: for r\u2190 [0,R) do11: computeL2 tilespatialL1 tileTo account for the spatial unrolling, we expand the equations from 4.5.1:ifmap :KL2\u00d7PL2\u00d7CL2(Pspatial\u00d7PL1+R\u22121)\u00d7Cspatial\u00d7CL1 = KL2\u00d7C\u00d7PL2(PL1+R\u22121)(4.6)weight :KL2\u00d7PL2\u00d7CL2(Cspatial\u00d7CL1\u00d7Kspatial\u00d7KL1\u00d7R) =C\u00d7K\u00d7R\u00d7PL2(4.7)ofmap :KL2\u00d7PL2\u00d7\b\b* reusedCL2(Pspatial\u00d7PL1\u00d7Kspatial\u00d7KL1) = P\u00d7K\u00d7\b\b* reusedCL2 = P\u00d7K(4.8)Observe that the L2 access count for each tensor is affected only by the spatiallyunrolled dimensions that index that tensor. For example, Pspatial does not affectweights accesses because weights are not indexed by P and can be broadcast to allPEs across which P is unrolled.Once again, ofmap is temporally reused across L1 tiles and thus, CL2 does notaffect the total number of L2 accesses (i.e., the sum of 4.6, 4.7, and 4.8). Therefore,to reduce the total access count we must reduce some combination of PL2 andKL2. This time, however, each candidate tile has PL1 and KL1 already determinedin 4.5, so those cannot change. Instead, we can unroll P and K spatially, i.e.,maximize Pspatial, Kspatial, or some combination of those. Sunstone do not make anyconclusion about the combination of the factors that should be unrolled (i.e., it triesall the possible combination for those factors), but rather infers what dimensionsshould and should not be unrolled (e.g., CL2 should not be unrolled in this example,32since it is non-indexing for the tensor reused temporally across L1 tiles). In general:OBSERVATION 5To maximize the spatial reuse when unrolling dimensions, the non-indexing di-mensions of the operand(s) that are temporally reused across L1 tiles should notbe spatially unrolled, since they do not reduce the total number of accesses anyfurther.Conversely, since we enumerate the spatial unrollings first, the combinations ofdimensions that are spatially unrolled should be the indexing dimensions of thesame tensor. This helps operands that are not temporally reused across L1 tiles getspatially reused.In the context of Sunstone, we first enumerate and prune the possible spatialunrollings by attempting to unroll each set of indexing dimensions of each tensorto reach 100% PE utilization. If that is not possible due to the dimension factors,we resort to unrolling each set of indexing dimensions fully until expanding anydimension leads to invalid mappings (like the L1 case, but with number of PEsinstead of buffer capacity), in order to minimize the data movement (according tothe equations above). However, since spatial unrolling also determines the latencyvia MAC utilization, it is possible to sacrifice data movement for higher MACutilization, and hence better performance. As a result, we also unrolling each setof indexing dimensions while supplementing the unrolling with other dimensionsin order to reach 100% utilization (but the product of indexing dimensions of thefinal unrolling should still be greater than the other dimensions to ensure low datamovement).In general, we want to prioritize both latency and energy improvements butwhen that is not possible due to the available factors, we prioritize one or the otherand dynamically determine the better trade-off.Then, for each unrolling, we create the search tree described in section 4.5,since each unrolling leads to a different remaining sub-problem to be tiled.With our unrolling technique, Sunstone can prune up to 90% of unrolling can-didates for ResNet-18 [23] convolution layers and a 14\u00d7 12 PE array, similar tothe one used in [11].334.6.2 Differences vs. prior workIn contrast to Sunstone, prior works [14, 46, 66] lack an effective unrolling strategy.These either examine all [46] or most [14] of the unrolling possibilities whichmake the tools slow when there are many possibilities, or only a specific subset ofpossibilities [66] which may lead to suboptimal mappings, as we will later show insection 6.2.2 and 6.2.1 respectively.4.7 Optimizing the L2 level and beyondAt this point, we have a set of candidate L1 tiles, each with its optimal L2 loopordering and potential spatial unrollings.For each candidate, Sunstone generates the potential L2 tile sizes in the sameway as for L1 (section 4.5), exploring and pruning the tile space using the samesearch tree representation; again, each L2 tile is paired with its optimal L3 loopordering. (The spatial unrolling step is only repeated if the next memory level (L3)is shared among multiple L2s).Sunstone repeats this process until tile candidates have been computed for allon-chip memory levels.Note that this will likely result in multiple L2 tile options for each L1 tilecandidate, and so on; we show how to avoid evaluating all of them below.4.8 Dynamic inter-level pruningRather than examine all possible combinations of tile sizes (L1, L2 tiles, etc.), Sun-stone assigns a cost to the tile sizes, and evaluates them in ascending order, whileemploying a variant of alpha-beta pruning [32] to dynamically reject suboptimaltiles.To do this, we first assign a cost to each L1 tile, equal to energy incurred fromthe transactions between the L1 and L2 (the number of transactions obtained fromusing the equations in 4.5.1); we refer to this cost as the L1-cost. Similarly, werefer to the energy from the transactions between the L2 and DRAM (or L3) as theL2-cost, determined by the L2 equivalents of the same equations, and so on.Next, we also compute an ideal bound for the L2-cost, by assuming each34operand is only ever fetched once (cf. 4.5.1); we call this L2-ideal. This number isunrealistic because specific loop orders reuse only subsets of operands; however,we can use it to prune the search space as follows.We start the search by examining the L1 tile with the lowest L1-cost. Wegenerate and examine all of the L2 tile candidates for this L1 tile, select the L2 tilewith the minimum L2-cost, and obtain the current-best L1+L2 cost candidate byadding the L1- and L2-costs.Next, we examine the L1 tile with the next-lowest L1-cost. We compute theideal L1+L2-cost by adding the actual L1-cost to the L2-ideal cost lower bound.If this cost exceeds the current-best L1+L2 cost, we can reject the current L1 tile,because no possible L2 ordering can have a lower L2 cost than the ideal. Moreover,we can also reject the remaining L1 tiles, because all of them have a higher L1 costand therefore a higher L1+L2 lower bound.If, on the other hand, the ideal cost is less than current-best, we evaluate allL2 tile options for this L1 tile, update the current-best L1+L2 cost as before, andcontinue the process. Figure 4.4 graphically demonstrates how this approach.Applying this technique on ResNet-18 [23], Sunstone reduces the number ofsearched L1 tiles by 50\u201480%, while reducing the number of evaluated L2 tiles by60\u201490%.4.8.1 Differences vs. prior workSome prior works [9, 41] also follow a level-by-level approach. However, theseworks greedily keep the best tiling at each memory level, as opposed to using amore dynamic approach in order to be more conservative.4.9 SummaryIn this chapter, we start with a simple convolution example, and systematicallydemonstrate how reuse and performance can be analytically derived and optimized.We also demonstrate how the pruning of sub-optimal space can be realized via newstate-space representations and traversal methods.35CostMappings Ranked based on L1-costMapping 1Mapping 2Mapping 3Mapping 4Mapping 5Mapping 6PrunedL1-cost L2-cost L2-idealLowest Costso farFigure 4.4: How ranking the partial-mappings first and using the ideal L2-cost can be used to prune the tiling space. This analytically-drivenmethod conservatively prunes sub-optimal tiles.36Chapter 5Extending Mapping Flow forArchitecture SearchThe previous chapter introduces a variety of techniques to analyze and prune themapping space for a fixed architecture. In this chapter, we address RQ4 \u2014 withmapping and architecture so closely linked, how can we extend these observationsand techniques aid the architecture search problem (more specifically, what shouldthe memory sizes be)?More precisely, prior work [41, 66] performs memory size optimization byfirst enumerating the possible memory configurations before running their mapperon each configuration, but are there sub-optimal or redundant spaces that can bepruned within the repeated calls to the mapper with the same tensor workload? Toaccomplish this, we first analyze how memory sizes affect reuse and energy, beforedemonstrating how extending our techniques can remove sub-optimal spaces.5.1 Memory Sizing on Mapping and PerformanceIn this section, we observe how different memory sizes affect the mapping spaceand performance. We begin by looking at how memory size affects performanceon the same mapping. For tensor workloads, energy comprises of the number ofdata accesses, and the energy-per-access.For a fixed architecture, the energy-per-access is fixed, so we find the dataflow37with the least number of data accesses. Likewise, for a fixed dataflow mapping,the number of data accesses is fixed, so we should we find the architecture withthe lowest energy per accesses. This usually corresponds to the smallest memoryconfigurations that can still hold every tile at every level.With this observation, we analyze multiple mapper calls with the same prob-lem, but different memory configurations in order to identify sub-optimal designpoints. We do this using a trivial example in figure 5.1. The example representsthe tiling space the same way as Sunstone. For simplicity, we assume that the onlyparameter in the memory configuration is the L1 size.In this example, every node represents a joint L1-tile + L1-size design point.It also assumes that we can choose between a 3,4, and 8-entry L1, and, for sim-plicity, we assume that Sunstone is not used (every mapping that satisfies buffercapacity constraints are enumerated and evaluated). The green nodes in this figurerepresents the visited nodes in the repeated mapper calls.We start with the leftmost box (3-entry L1), where only the root node is tra-versed \u00b6, since the rest of the nodes already exceed the memory capacity. Next,we repeat the traversal, but for the 4-entry L1, where again, only the root node istraversed \u00b7. This means that the only valid tiles that are explored by assuming a 3or 4-entry L1 is the same.We also know that this tile will perform better on a 3-entry L1, due to its lowerenergy per access. In fact, a 4-entry L1 does not enable any new valid dataflowscompared to the 3-entry L1. That is, every tile that is valid for the 4-entry L1 willfit in the 3-entry L1. As a result, the design point that pairs any dataflow with the4-entry L1 can be pruned \u00b8.Similarly, throughout multiple mapper calls, this tile will be evaluated with the8-entry L1 as well \u00b9, but again, we know it pairs better with the 3-entry L1 andhence, we do not need to evaluate this design point, and only the remaining greennodes need to be evaluated \u00ba.This example illustrates our next observation:382,1,1,1 mem:51,2,1,1 mem:51,1,2,1 mem:5 1,1,1,3 mem:71,1,1,1 mem: 32,1,1,1 mem:51,2,1,1 mem:51,1,2,1 mem:5 1,1,1,3 mem:71,1,1,1 mem: 32,1,1,1 mem:51,2,1,1 mem:51,1,2,1 mem:5 1,1,1,3 mem:71,1,1,1 mem: 3L1 Size = 3 L1 Size = 4 L1 Size = 81 2 43 56Figure 5.1: Example illustrating the redundant design points from executinga sweep through the L1 sizes and running the mapper through eachmemory sizeOBSERVATION 6Since the tiling configuration space traversed by different memory configura-tions is the same, and since identifying the smallest memory size for a givenmapping is straight-forward, each tile only needs to be traversed through once,and evaluated with the smallest memory configuration.This observation suggests that the memory configuration should be co-optimizedwith mapping.5.2 Extending Sunstone for Architecture Search5.2.1 Tile Space TraversalTo extend Sunstone to exploit observation 6, we slightly modify how we traversethrough the tiling space. We call this extension Sunstone-arch.Given a set of memory sizes and a problem to tile, we still start at the smallestnode, but at each node, we calculate the smallest memory size required to holdthe given tile. Next, we adhere to our tile traversal technique from section 4.5,checking to make sure all of its children nodes fit in the same memory. If at leastone child fits, then the parent node is pruned. However, if none of the children fit,we keep the node, but we also keep exploring the child nodes, repeating the processof determining the smallest size it takes to hold the tile, and check its children. Bydoing so, we continue to use our mapping technique while only considering eachtile with its optimal memory size. In fact, this method can skip evaluating memory392,1,1,1 mem:5 (8)1,2,1,1 mem:5 (8)1,1,2,1 mem:5 (8) 1,1,1,3 mem:7 (8)2,2,1,1 mem:8 (8)2,1,2,1 mem:8 (8) 2,1,1,3 mem:11Pruned bymemoryPruned bylarger factorKept2434,1,1,1 mem:9 4,1,2,1 mem:14 2,1,4,1 mem:14 2,1,2,3 mem:20 2,1,1,3 mem:11 2,2,2,1 mem:12 1,7,1,1 mem:15 1,2,1,3 mem:9 1,2,2,1 mem:8 (8) 1,1,4,1 mem:9 1,1,2,3 mem:131,2,1,3 mem:91,1,1,1 mem: 3 (3)2,2,1,3 mem:14 2,7,1,1 mem:23 4,2,1,1 mem:14 Legend:15671,2,2,3 mem:16 1,2,4,1 mem:14 1,7,2,1 mem:23 Figure 5.2: Example illustrating how memory search can be integrated withSunstone.sizes that do have a promising dataflow that cannot be supported by any smallermemory sizes.Figure 5.2 illustrates this example with a 3-,4-, and 8- entry L1. The example(and search tree) is almost identical to the same idea shown in section 4.5, but eachnode is now annotated with the smallest memory size that can fit the given tile inbrackets (note that tiles that do not fit the largest memory available do not have acorresponding memory size).We first start at the root \u00b6. Since the memory size for the root is 3, and everychildren of the root \u00b7 \u00b8 \u00b9 requires a larger memory size, we keep the root,marking it as a potentially good design point when paired with a 3-entry L1. Wethen continue to explore the root\u2019s children. For nodes\u00b7,\u00b8, and\u00b9, a 8-entry L1 is40required, and at least one of their children (\u00bb, \u00bc) also fits in a 8-entry L1. By ourmapping technique, these nodes should be pruned, since we know these mappingsshould have less accesses. In contrast, node \u00ba requires a 8-entry L1, while noneof its children fit in any memory, so this node is also kept. Finally, none of thechildren for nodes \u00bb and \u00bc and fit in any memory, hence these nodes are kept. Byintegrating the architecture search with the mapping search, we effectively prunedall design points containing with 4-entry L1.5.2.2 Inter-level PruningRecall from section 5.1 that the energy-per-access of the memory levels are usedto calculate an ideal cost in order to prune tiles in-between memory levels. Forexample, given a set of L1 tiles, we use the energy-per-access of the L2 and L3to calculate an ideal cost, which can be used to prune the set of L1 tiles while weexplore the L2 tiling of each L1 tile.However, what if we try to search for the L2 memory size? The L2 energy-per-access will be dependent on the L2 tile, which we have yet to explore, but whetherthe L2 tiling exploration is carried out depends on this L2 energy-per-access. Inorder to minimize the chance of pruning a optimal tile early, we take a conservativeapproach, and use the L2 energy-per-access of the smallest L2 possible. The trade-off is that the ideal cost is now a looser lower bound, meaning fewer tiles arepruned.Likewise, if the L3 memory size is explored, we use the smallest L3 size\u2019senergy-per-access while calculating the ideal cost.Other than this modification, the inter-level pruning operates the same, even ifthe mappings are paired with different architectures. For instance, even if the L1size is searched, once we find all the promising tiles and their appropriate L1 sizes,we still sort them based on cost and prune them when exploring each of their L2tiles.Alternatively, for easier extension to optimizing the memory size for a set ofworkloads instead of a single workload, we can perform the inter-level pruningusing a more hierarchical approach. First, we bin the set of promising tiles basedon their memory sizes. Next, we assign a cost to each memory size to be the lowest41cost among the tiles within each bin, and we rank the memory sizes based on thiscost. We then carry out the inter-level pruning on the tiles in the bin with the lowestcost. Once the pruning is complete, we update the global best cost to be used toprune the other memory sizes. Finally, we repeat this process for the bin of tileswith the next lowest cost, but before beginning the inter-level pruning for eachsubsequent bin of tiles, we check to see if the cost of the bin, plus the ideal costexceeds the global best cost. If that is the case, the search is terminated and the restof the memory sizes, and their tiles can be pruned. This allows dynamic pruningof both tiles within each memory configuration, as well as memory configurations.5.2.3 LimitationsSingle Layer OptimizationIf we run Sunstone-arch for each layer, an optimal memory size will be discov-ered for each layer. However, accelerators are never designed with a single layerin mind. To further extend Sunstone-arch to optimize the memory size for a setof layers, we could explore the map and architecture space of multiple problemstogether, and carry out any cost-related pruning by considering the cost across allworkloads instead of just one. This can be done by extending the hierarchical ap-proach from above.First, we bin the tiles based on memory (we refer to each bin as M0,M1, \u00b7 \u00b7 \u00b7 ,Mn),and within each bin, we further bin the tiles based on workload (which we refer toas Mi, j for memory size i and workload j). Next, for each bin, we assign a cost,which is the lowest cost within the tiles among each bin. We then assign a cost toeach memory size bin, which is the sum of costs for each workload in the samememory bin:Cost(Mi) =Cost(Mi,0)+Cost(Mi,1)+ \u00b7 \u00b7 \u00b7Finally, like the hierarchical approach, we start with the memory size with thelowest cost, and deploy the inter-tile pruning for each workload. Once we find theoptimal L2 mapping for each workload for the first memory, we update the globalcost and repeat for the next memory size (and like above, we use this global cost toprune the memory sizes).42While this assumes that the best memory configuration across all workloadswill be the best memory size for at least of one the workloads, we believe it isa valid assumption to make. Often times, there will be layers that are repeatedlystacked, or has substantially more MACs than the other layers, and optimizing theconfiguration for those layers should yield the most performance benefits.Coarse Memory ChoicesThe memory size optimization extension exploits the fact that the optimal set oftiles across memory sizes may contain overlaps, and those tiles only need to beevaluated once with the smallest memory size (since we know evaluating it withany larger memory will be sub-optimal). However, if the list of memory sizechoices is coarse, and the difference between the memory sizes is massive, therewill not be any overlaps and hence, the joint search is no better than grid search.Fortunately, due to the inter-level pruning, it is still possible for certain designpoints that are evaluated on the grid search to be pruned, if the performance gapbetween the design points is large enough. For example, a sub-optimal mappingfor a 8-entry L1 may not be pruned in the grid search since it is compared to theoptimal 8-entry L1 mapping, but it may be pruned in Sunstone-arch during theinter-level pruning, since the sub-optimal mapping is now pruned based on the bestmapping across all L1 sizes. We show later this has5.2.4 Differences from Prior WorkInterstellar and ZigZag both extend their mappers to perform architecture search(specifically, memory configuration search), but they do so by enumerating thememory configurations and running their mapper on each configuration while keep-ing track of the best memory configuration and mapping. In contrast, we extendthe mapper to prune parts of the joint space that we analytically determined to besub-optimal.5.3 SummaryIn this chapter, we address RQ4 by extending the analyses and techniques intro-duced in the previous chapter to tackle the memory configuration search problem.43Specifically, we extend the tiling and inter-level pruning techniques. By integrat-ing the memory configuration into the state representations, we can identify sub-optimal design points that can be detected and pruned using our traversal method.44Chapter 6Mapping EvaluationIn this chapter, we evaluate Sunstone against a variety of prior work to answerRQ1-3. Precisely, we want to see if Sunstone prunes any optimal points, and howfast Sunstone can be once it prunes the sub-optimal spaces. We also include anablation study to demonstrate the space pruning due to the different introducedtechniques.6.1 Methodology6.1.1 BenchmarksWe use a broad set of workloads listed in 6.1. These include DNN kernels likeconvolution (conv), point-wise layers (PW), and depth-wise layers (DW) fromResNet18 [23], MobileNet v2 [49], and Inception v3 [57], as well as other ten-sor workloads described in 4.2.6.1.2 ArchitecturesWe evaluate Sunstone on two representative architectures: an edge-class config-uration and a data-center-class configuration, detailed in table 6.3. For each, weevaluate two L1 memory options: three tensor-specific L1s (as in Eyeriss [11]),and a single unified L1, used for evaluation in several prior work [9, 14, 66].45Workload Algebraic Definition Application Application InstanceConv ofmap[p,q,k,n] = ifmap[p+ r,q+ s,c,n]\u00d7w[r,s,c,k] CNN ResNet [23], INC-3 [57]PW ofmap[p,q,k,n] = ifmap[p,q,c,n]\u00d7w[c,k] CNN RES [23], Inception-v3 [57]DW ofmap[p,q,d,n] = ifmap[p+ r,q+ s,d,n]\u00d7w[r,s,d] Edge-device Networks MobileNet-v2 [49]FC ofmap[k,n] = ifmap[c,n]\u00d7w[c,k] Neural Networks RES [23], INC-3 [57]MTTKRP out[i, j] = A[i,k, l]\u00d7B[k, j]\u00d7C[l, j] CP-Decomposition refer to table 6.2SDDMM out[i, j] = A[i, j]\u00d7B[i,k]\u00d7C[k, j] Alternating Least Squares refer to table 6.2TTMc out[i, l,m] = A[i, j,k]\u00d7B[ j, l]\u00d7C[k,m] Tucker Decomposition refer to table 6.2MMc out[i, l] = A[i, j]\u00d7B[ j,k]\u00d7C[k, l] NLP (Transformers) Attention Model [60]TCL out[l,m,n] = A[i, j,k]\u00d7B[i, l]\u00d7C[ j,m]\u00d7D[k,n] Tensor Contraction Layer [33] AlexNet [34], VGG [53]Table 6.1: Evaluated benchmarksTable 6.2: Tensors and matrices used for various benchmarks. The 3D ten-sors are used for MTTKRP and TTMc, whereas the 2D matrices are forSDDMM. The tensors and matrices are from the FROSTT [55] and Spars-eSuite [15], respectively. While these tensors are sparse, we still treat theworkload as dense.Tensor\/Matrix DimensionsNell2 [55] 12092\u00d79184\u00d728818netflix [55] 480189\u00d717770\u00d72048poisson-1 [55] 256\u00d7256\u00d7256bcsstk17 [15] 10974\u00d710974cant [15] 62451\u00d7624516.1.3 Prior artWe compare against Timeloop [46] (TL), dMazeRunner [14] (dMaze), Interstel-lar [66] (INTER), mindmappings [25], and CoSA [26], the current state-of-the-artDNN mappers, using code from their respective repositories linked in their paper.For TL and dMaze, we use fast (-fast) and slow (-slow) configurations (see 6.4);dMaze-slow is the default configuration from the repository [13], and is perhapsalready tuned for DNNs on our evaluated architectures. dMaze-slow is set to closethe time-to-solution gap between dmazerunner and Sunstone via parameters (at theexpense of potential invalid solutions). For CoSA, we use the default configuration.For INTER, we preset the spatial unrolling to CK as prescribed in the paper [66],but allow unrolling other dimensions whenever CK cannot fully utilize the PE grid.As TL can be extremely slow, we additionally terminate TL after one hour for eachlayer, and take the best mapping found so far.We use 8 threads for Sunstone and all prior work (except INTER, which does46Name Edge-Device Data-Center(Eyeriss-Like \/ Unified L1) (Eyeriss-Like \/ Unified L1)MAC Single 16-bit MAC per PEPE grid 14\u00d712 \/ 16\u00d716 32\u00d732 \/ 32\u00d732L1 (words) (Weights:192, Ofmap: 16, Ifmap:12) \/ Unified: 256L2 Unified: 108KB \/ 128KB Unified: 3.1 MB \/ 3.1 MBBW (words\/cycle) (Read: 9, Write: 9) \/ (Read:64, Write:64) \/(Read: 9, Write: 9) (Read:64, Write:64)NoC Interleaved multi-castinter-PE ofmap communicationTable 6.3: Evaluated accelerator configurationsPrior Work Fast\/Aggressive Slow\/ConservativeTimeloop [46] (TL)TO 20000 80000VC 25 1500dMazeRunner [14] (dMaze)L1 util. 80% 80%L2 util. 50% 50%PE util. 100% 80%spatial reduction not allowed allowedTable 6.4: Hyperparameters for fast and slow configurations forTimeloop (TL) and dMazeRunner (dMaze). For TL, TO = timeoutand VC = victory condition. For dMaze, util. = minimum utilizationthreshold.not support multi-threading), and evaluate on an 8-core CPU.6.1.4 MetricsFor fair comparison across tools, we evaluate each proposed mapping using thehardware-validated cost model of Timeloop [46], with access energies generatedby Accelergy [63] (which itself relies on Cacti [42] for SRAM and Aladdin [50] forother components). We model an interconnect similar to that of Eyeriss [11] usingAccelergy, and include this in the total energy cost. Similar to prior work [14], weuse energy-delay product (EDP) as the key figure of merit, in order to capture howdataflow mapping affects both the latency and energy of the workload execution.47(a) Inference on MobileNet v2 layers; edge-class, unified L1; invalid = no mapping canmeet the minimum utilization constraints(b) Inference of ResNet18 layers; edge-class, per-datatype L1s (not supported by INTERand dMaze)6.2 Prior search-type optimizers6.2.1 EDP on DNN layersFigure 6.2 shows the EDP for the evaluated systems for all four architecture con-figurations. Neither INTER nor dMaze support per-datatype L1 buffers, so we leftthose out from those configurations.Overall, Sunstone consistently matches prior work, or produces mappings withbetter EDP than prior art, often by a large margin.In the fast configuration that has a runtime closer to Sunstone when it works(see section 6.2.2), dMaze returns invalid mappings on a number of layers (fig-ures 6.1a and6.2a). That is because its aggressive fast-config termination condi-tions do not appear to generalize well to different workloads (e.g., DW and smallerconvolution layers do not utilize the whole PE array). Overall, its results are highly48(a) Weight update (batch 16) of Inception v3 layers; datacenter-class, unified L1; invalid= no mapping meets the minimum utilization constraints, no mapping can use the preset un-rolling, or the returned mapping does not correspond to the original computation.(b) Weight update (batch 16) of Inception v3 layers; datacenter-class, per-datatype L1s(not supported by INTER and dMaze)Figure 6.2: EDP of mappings from Sunstone vs. prior art.inconsistent: its solution on b7-PW1 in figure 6.1a is close to optimal, but it cannotfind a mapping for any of the DW layers.dMaze-slow produces valid results for most layers, but is, on b7-PW1 for ex-ample, over an order of magnitude slower than both the fast configuration andSunstone (section 6.2.2). Even so, Sunstone still produces mappings that match oroutperform dMaze-slow.Finally, unlike Sunstone, dMaze appears to assume that convolutions are al-ways symmetric, so we were not able to use it to obtain valid mappings for asym-metric convolution layers 1\u00d77 deep and 3\u00d71 deep (figure 6.2a).TL is able to handle both unified and dedicated L1 buffers, and always returnsvalid mappings. However, being the earliest and relatively simple mapper, either49takes a long time or produces poor-EDP solutions. In the fast configuration, theEDP of the returned mappings is generally poor, because it prematurely terminateswithout exploring much of the optimization space: its solution can have EDP up to17\u00d7 worse than Sunstone (e.g., 3\u00d73 early in figure 6.2b).TL-slow setup explores more space and finds better solutions, but can still endup with mappings worse than Sunstone. For example, Sunstone returns mappingswith 1.9\u00d7 lower EDP on 3x3 early of Inception v3 (figure 6.2b). This is largely dueto TL\u2019s undirected random search approach: for example, it solution to 3x3 earlyonly utilizes 8% of total L1 buffers capacity, which could have been extended viarules like observation 4 from section 4.5.1. In contrast, the mapping from Sunstoneutilizes 87% of total L1 buffers capacity.Finally, mappings from INTER have poor EDP on several layers, especiallyin figure 6.2a. This can be attributed to their spatial heuristic. For example, on1x1 common, INTER reuses ofmap both temporally and spatially, due to the pre-set CK dataflow and the factorability of the problem dimensions. This goes againstour observation 5 (section 4.6), and yields poor EDP. Sunstone instead temporallyreuses the ofmap tensor while spatially reusing ifmap and weight tensors, achieving1.49\u00d7 better EDP. While INTER is able to find better solutions for the edge-classaccelerator (figure 6.1a) than the datacenter-class accelerator (figure 6.2a), its so-lutions still fall short of Sunstone in terms of EDP in some cases, for example by1.17\u00d7 for b17 dw.6.2.2 Time to solution on DNN layersSince mapping is crucial to evaluating architecture performance [46], and is usu-ally a sub-problem of a much larger problem (e.g., architecture search or neuralarchitecture search), we also use time-to-solution as a figure of merit. The timetaken to return a mapping is shown in figure 6.4. Overall, Sunstone outperformsmost other tools, often by a large margin.Both TL-fast and TL-slow are significantly slower than Sunstone due to TL notusing any heuristics, relying instead on random search and stringent terminationconditions to continue long enough to find acceptable mappings.As noted earlier, dMaze-fast returns invalid mapping for a large number of the50(a) Inference of MobileNet v2 layers, edge class, unified L1(b) Inference of ResNet18 layers, datacenter class, split L1(a) Weight update (batch 16), Inception v3, datacenter cfg, unified L1(b) Weight update (batch 16), Inception v3, datacenter cfg, split L1Figure 6.4: Compilation time of Sunstone vs. prior art51benchmark layers. In some instances, such as the later layers in figure 6.4a, eventhe fast configuration is far slower than Sunstone, perhaps on account of the lack ofeffective spatial-unrolling techniques (as these layers have more spatial unrollingoptions). dMaze-slow is usually significantly slower than Sunstone, even as muchas 90\u00d7 slower than Sunstone.INTER is also generally slower than Sunstone (e.g., 133\u00d7 slower than Sun-stone when optimizing 3\u00d73 deep), even though it only searches a preset CK un-rolling. This is attributable to the lack of pruning strategies for loop ordering andtiling. When INTER finishes close to Sunstone (e.g., the depthwise layers of Mo-bileNet v2, figure 6.3a), the mappings returned have worse EDP: e.g., on b17\u2212dw,INTER yields a mapping with 1.17\u00d7 worse EDP than Sunstone.6.3 Black-box optimizers on DNN layersIn this section, we compare Sunstone against black-box optimizers, such as CoSA [26]and mindmappings [25].6.3.1 CoSACoSA represents the problem as a mixed-integer programming (MIP) problem anduses off-the-shelf solvers like Gurobi [7] to find the solution. Since the utilizationpercentage of each tensor needs to be specified for each memory level, we first setit to an equal split among every tensor first, before exploring different splits to seehow it affects the quality of solution.A key shortcoming of CoSA is that the sliding-window reuse of ifmap is dif-ficult to formulate as MIP constraints. As a result, CoSA often returns invalidsolutions with ifmap tiles that are too large for the memory hierarchy. For instance,using Timeloop\u2019s configuration file of Eyeriss [11] with CoSA does not return validmappings. As such, we cannot compare CoSA\u2019s performance on Eyeriss-like ac-celerators. In addition, CoSA only returns valid solutions for four layers of Incep-tion V3 (out of the eight we benchmarked), due to the sliding-window reuse issue,which is particularly prominent in training layers.52(a) Inference of MobileNet v2; edge-class config(b) Weight update (batch 16) of Inception v3; datacenter-classFigure 6.5: Solution EDP: Sunstone vs CoSAEDP on DNN layersFigure 6.5 and 6.6 demonstrates results between Sunstoneand CoSA for edge-configuration and datacenter-configuration, respectively. In general, Sunstone pro-duces mappings with better EDP than those found by CoSA. The EDP is closefor the early layers of batched training and the MobileNet v2 layers, but the gapwidens for the deep batched training layers. This is difficult to root-cause becauseCoSA uses a black-box optimizer; however, we believe that the gap is due to CoSAusing memory utilization and NoC traffic proxies as weighted objectives instead ofdirectly using data movement.Time to solution on DNN layersFigure 6.5 and 6.6 demonstrates the time-to-solution for Sunstone and CoSA. Mostof the time, Sunstone completes faster than CoSA. Again, because CoSA relies on53(a) Inference of MobileNet v2; edge-class configuration(b) Weight update (batch 16) of Inception v3; datacenter-classFigure 6.6: Optimization time: Sunstone vs CoSAa black-box optimizer, it is difficult to determine the runtime bottleneck and how itrelates to the workload and accelerator.On the other hand, the runtime of Sunstonetends to be directly proportional to how factorable the workload dimensions are.Parameter Sweep: Buffer Utilization Per TensorTo see how much the EDP gap is due to CoSA\u2019s buffer utilization hyper-parameters,we rerun CoSA, but with one of the tensors at 50% at the L1, instead of an equalsplit. Figure 6.7 demonstrates the results. By changing the hyper-parameters,a better EDP can be obtained to match Sunstone\u2019s performance (e.g., b7\u2212pw2,5x5 early, 3x3 middle). In fact, certain hyper-parameters are able to generatevalid solutions for some layers where the equal split configuration cannot do so(e.g., 5x5 early, 1x7 deep). However, for other layers, running the optimizer un-54(a) Inference of MobileNet v2; edge-class configuration(b) Weight update (batch 16) of Inception v3; datacenter-classFigure 6.7: Normalized EDP: Various CoSA Configurations vs Sunstoneder different hyper-parameters still yield suboptimal solutions. The optimal hyper-parameters also vary across workloads (e.g., 50% Ofmap performs well for 5x5 early,but struggles for 3x3 middle, which prefers 50% Weight), and unless it is known a-priori, finding better solutions require hyper-parameter sweeps for each new work-load. In fact, L1-utilization per tensor is only one hyper-parameter \u201dknob\u201d to tune,and L2 utilization per tensor can also be adjusted as well to find potentially bettersolutions at the expense of time-to-solution. In contrast, Sunstone does not requireany hyper-parameter tuning.6.3.2 MindmappingsMindmappings approximates EDP as a MLP, and uses gradient descent to optimizethe EDP. In addition to loop-ordering, tiling, and spatial unrolling, Mindmappings55Figure 6.8: Sunstone vs mindmappingsalso determines the number of banks each tensor occupies within the memory lev-els. To compare against Mindmappings, we use the same architecture in their coderepository 1. In addition, we modify the size calculation of Sunstone to accountfor the banking structure of the memory levels (i.e., each bank should only occupyone tensor in order to avoid bank conflicts, in addition to the capacity constraints).We also do not compare time-to-solution, since mindmapping\u2019s runtime is depen-dent on whether a GPU is used to perform the back-propagation (though the timeit takes to involve the GPU kernel calls, as well as the kernel runtime consistentlyexceeds the runtime of Sunstone).Figure 6.8 compares Sunstone\u2019s results against mindmappings, where mindmap-pings consistently generates poor solutions (likely due to the implementation bug).Like CoSA, due to the black-box nature of the MLP, it is hard to diagnose whythe generated solutions are poor, and whether certain bottlenecks are not handledproperly by the solution. In fact, due to feature engineering, it is difficult and non-intuitive to interpret why the trained weights converge to their values, and whatthey mean in the context of mapping search.56Figure 6.9: Normalized EDP on non-DNN workloads (top: edge-class, bot-tom: datacenter-class)6.4 Non-DNN workloadsIn this section, we demonstrate Sunstone on non-DNN tensor workloads from ta-ble 6.1. We evaluate MTTKRP, TTMC, and SDDMM with ranks 32, 8, and 512,respectively, on the two architectures with unified L1, and assume that each PE hasa datapath that can fully consume every operand and produce one partial output ateach cycle when operating at line rate. As TL is the only prior work that nativelysupports these tensor workloads, we also compare with its slow configuration.Figures 6.9 and 6.10 show that Sunstone outperforms TL in both solution EDPand time-to-solution. Again, this can be attributed to TL\u2019s simple search strategy;as a result, it must search more configurations to discover better mappings.1It is possible that there is a bug in the implementation, where the mapping is never updated bythe gradient descent and the final solution is essentially just the randomly generated mapping used asa starting point. Applying the \u201dlogical\u201d fix resulted in even worse solutions, so the original \u201dbuggy\u201dimplementation is the one being evaluated.57Figure 6.10: Time-to-solution on non-DNN workloads (top: edge-class, bot-tom: datacenter-class)The EDP advantage of Sunstone is up to 2\u00d7 higher on the edge-class configu-ration, while on the datacenter-class configuration time-to-solution is much lowerthan that of TL-slow on the data-center accelerator. This can be attributed to theless constrictive nature of the datacenter-class accelerator: with more resources,random mappings explored by TL are more likely to be valid, and the terminationcondition based on invalid mappings is not met as quickly.6.5 Ablation StudiesTo determine the effectiveness of the techniques Sunstone uses to prune the op-timization space, we performed an ablation study. As benchmarks, we used rep-resentative convolution layers from the beginning (Conv2), the middle (Conv10),and the end (Conv15) of ResNet18, batched regular (3\u00d7 3 deep) and asymmet-ric (1\u00d7 7 deep) conv layers from Inception v3, and pointwise layers from Mo-58Figure 6.11: Effect of Sunstone techniques on search space size and resultEDP. Note the log scale.bileNet v2 (b2 PW1 and b6 PW1). We targeted the edge-class architecture withper-datatype L1s (table 6.3). Figure 6.11 shows the results on a log scale.We start by showing the size of the optimization space for an exhaustive search.Next, we introduce our methods one-by-one into the optimization process. First,we prune the ordering space as discussed in section 4.4 (loop order). Second,we reduce the number of candidates by alpha-beta-style pruning as discussed insection 4.8 (alpha-beta-like). Third, we prune the spatial unrolling space basedon the observation in section 4.6 (spatial). Finally, we add tile optimization fromsection 4.5 (L1).For the right-most bars, we also show the returned solution EDP; for the fullsearch space and loop-order-only, the search space remains too large to search inreasonable time. When all of our techniques are in place, the space is pruned by afactor of 106 to 108, with negligible EDP change (geomean 1.0).The ablation study shows that the techniques we outline in this paper are able todramatically reduce the mapping search space without losing track of the optimalsolutions.6.6 SummaryIn this chapter, we demonstrate that our techniques often matches or outperformsprior work on different hyper-parameters, while pruning a lot more of the map59space, as evident by the faster time-to-solution. We also demonstrate how much ofthe map space can be pruned by the different techniques through the ablation study.60Chapter 7Case Study: Using Mapping toPrune Memory ConfigurationSpaceIn this chapter, we address RQ4 by evaluating the techniques introduced in chap-ter 5 to see how they perform in practice. Specifically, we want to measure howmuch of the memory configuration space, as well as the joint memory configura-tion and mapping space can be pruned just from leveraging the simple observationsthat we discussed in chapter 5.First, we describe the problem space of this case study (specifically, what mem-ory configuration are we optimizing), and methodology of this case study. Then,we will demonstrate and discuss the results.7.1 BackgroundTo find a memory configuration to optimize, we first investigate remaining bot-tlenecks that mapping cannot solve by examining the energy breakdown of thesolutions generated by Sunstone, shown in figure 7.1. Note that the major energybottleneck is the MAC energy (which depends on the MAC unit circuitry and bit-width of data), and the accesses between the L1 and MAC unit, which is dependenton the problem dimensions and not the mapping itself. In order to lower this en-61Figure 7.1: Energy breakdown of the 4 commonly stacked layers in theResNet [23] family of networksergy, the L1 energy-per-access must be lowered by lowering the size of the L1, atthe expense of reuse. Prior work [66] observed something similar, and proposes toshrink the innermost memory to 8 bytes, but here we take an orthogonal approachof splitting the unified-L1 into tensor-specific buffers instead, like prior acceleratordesigns [11, 22].Converting the unified-L1 abstraction to a tensor-specific buffer design intro-duces another wrinkle in the memory optimization problem, since each buffer nowhas a list of sizes to choose from. Specifically, given an area budget for each PEand a set of memory sizes, how should we split the memory across the operands?7.2 MethodologyWe use Sunstone-arch to optimize both the mapping and the buffer splits of the ac-celerator for the 4 commonly stacked convolution layers of the ResNet family [23].Table 7.1 demonstrates the architecture configuration. The total area of the tensor62ConfigurationPE grid 16\u00d716L2 128KBL1Ifmap Choice of: 8,16,32,64,128 wordsWeight Choice of: 8,16,32,64,128 wordsOfmap Choice of: 8,16,32,64, 128 wordsTable 7.1: Configuration for the L1 buffer size optimization problembuffers are subject to be less than the area of a 128-word buffer. We compareagainst a grid-search baseline (similar to prior work [41, 66]), where we first enu-merate every memory split configuration (which we refer to as L1-configuration)that fits within the area budget, before running Sunstone on it and selecting theoptimal memory configuration and mapping.7.3 Results and DiscussionSunstone-arch finds the same mapping and L1-configuration as the baseline forthree out of the four layers (conv2\u2212 conv4), while for conv1, Sunstone-arch findsa design point that is within 0.1% of the solution found by the baseline. How-ever, Sunstone-arch\u2019s time-to-solution is far less than the baseline, which can beattributed to the joint search space pruning enabled by our techniques. Next, weidentify precisely how much pruning can be done for the different spaces (jointspace vs memory configuration space).7.3.1 Reduction of Memory Configuration SpaceFirst, we examine how much of the memory configurations are pruned. Figure 7.2demonstrates the amount of pruning in the L1-configuration as a result of the jointsearch. Recall that this happens when every tile in a given L1-configuration\u2019s set ofpromising tiles belongs to the set of promising tiles of some other L1-configuration.This tends to occur for the last two layers (up to 30% pruned for conv4), but fewto no configurations are pruned for the earlier layers. This may be attributed to thefactorability of the earlier layers. With more factors available to choose from andsmaller gaps between the factors, any changes to the L1-configuration will allow63Figure 7.2: The number of memory configurations explored in Sunstone-archvs grid searchmore unique mappings, and hence higher probability that the promising tiles cannotfit in the smaller L1-configurations. This also suggests that for certain workloadswhere the dimensions are that factorable (like the later conv layers), the memoryconfiguration granularity can be set to be even coarser, since many configurationsthat are close together do not actually enable more of the mapping space.7.3.2 Reduction of Total Map SpaceHere, we demonstrate how much of the joint space is pruned. We demonstrate boththe amount of pruned L1 and L2 tiling to distinguish the effects of the differenttechniques.L1 Tile PruningFigure 7.3 illustrates how many promising L1 tiles are evaluated under Sunstone-arch versus the baseline. Sunstone-arch evaluates up to 7\u00d7 less L1 space comparedto the baseline. Interestingly, the benchmark that has the most pruning is conv1,which had no memory configuration pruning (see figure 7.2). This implies that,even if every memory configuration enables new mappings, there is still a largeamount of overlaps between the valid map spaces of these memory configurations.64Figure 7.3: The number of explored L1 tiles in Sunstone-arch vs grid searchFigure 7.4: The number of explored L2 tiles in Sunstone-arch vs grid search65Another explanation for this observation is that, since conv1 is the most factorableproblem, the baseline L1 tiling is also much larger than the other benchmarks, sothe decrease in map space is more amplified in this particular benchmark.From this, we see that Sunstone-arch can efficiently identify and prune theoverlap map spaces between the memory configurations, which is crucial for scal-ability, since in the layer-by-layer mapping scheme, more pruned L1 tiles indicateless L2 tiles to explore. This suggests that more fine-grained memory configurationsearches can be enabled by this, without trading off too much runtime.L2 Tile PruningFigure 7.4 illustrates the number of L2 tiles that are evaluated. To further separatethe pruning due to the reduction of explored L1 tiles, and the pruning due to thehierarchical inter-level pruning (where certain configurations can be pruned), theconfiguration pruning is disabled for the middle bar.Echoing the takeaways from section 7.3.2, the reduction of L1 tiling space hasmajor impact on the reduction of L2 tiling space, since less L1 tiles are expandedupon, as shown by the gap between the leftmost and middle bars.However, once the configuration pruning is enabled as well, even more L2 tilescan be pruned (as little as 3 memory configurations are evaluated before pruningfor conv3). This can be attributed to the fact that the bottleneck L1-to-MAC energyis accounted for, and certain large L1-configurations, where this energy becomesa major bottleneck without offering enough reuse between the L1 and L2, can beconservatively pruned early without having to explore their L2 tiles.This also suggests that potentially, more architectual decisions within a givenlevel of memory can be integrated to the mapping search and allow inter-levelpruning to conservatively remove sub-optimal configurations early (provided thatthe cost implications exist, and a lower bound can be derived and incorporated tothe overall cost model).7.4 SummaryIn this case study, we address RQ4 by evaluating our techniques from chapter 5,showing that fast time-to-solution via co-optimization is possible without sacrific-66ing performance. We also demonstrate there are opportunities to prune sub-optimalparts of the joint memory-mapping search spaces, and the feasibility of integratingvarious parts of the architecture search into the mapping optimization problem.67Chapter 8Conclusion and Future WorkDataflow optimization is critical to accelerating many emerging DNN and tensorworkloads. In this thesis, we tackle this problem using a more analytical approachby first breaking down the problem into stages, and formally analyzing the alge-braic properties of how each stage affects reuse and performance. We also lever-age these properties to identify sub-optimal map spaces, which can be pruned us-ing a different state-space representations and traversal schemes. Furthermore, wedemonstrate that these analyses and techniques can be extended to tackle parts ofthe architecture search problem by incorporate memory configuration search intothe state space representation and traversal.Using these pruning techniques, we build a mapper that can outperform priorwork by up to 10\u00d7 in optimization time, and up to 1.5\u20132.5\u00d7 in EDP. More im-portantly, we show that breaking down a complex search problem into stages andapplying rudimentary algebraic analysis on each stage can reveal many propertiesof the sub-optimal solutions that are easily identified now, but were not so obviousbefore. By leveraging this new-found knowledge, and rethinking how we representand traverse through these different stages, we can then prune a significant amountof the search space in order to speed up the search, demonstrating that empirically-based heuristics or black-box optimizers (which do not fully or directly exploitthese observations) may not need to be heavily relied upon.There are two directions for future work. The first one is to use this work asa foundation to tackle other subproblems in architecture search, such as number68of PEs, or the optimal number of memory levels in the hierarchy. Many priorworks in this space focus on black-box optimizers [29, 30], and it is only untilrecently that co-optimization of mapping and architecture is explored [30], so thereis opportunity to apply more disciplined, analytical, and transparent techniques tothis space.The second one is to extend the techniques (or explore more analytical tech-niques) for sparse workloads. With the emergence of sparse neural networks andtensor workloads, future work includes mapping that accounts for sparsity, whichincludes new challenges such as load balancing in the presence of unstructuredsparsity. Unlike dense workloads where the mapper only needs to be run once foreach set of problem dimensions, changes to sparsity patterns can lead to differ-ent optimal mappings, meaning in settings where the sparsity pattern is constantlychanging (like during DNN training), the time-to-solution is key to make sure thatobtaining the optimal mapping does not become a bottleneck itself.69Bibliography[1] V. Akhlaghi, A. Yazdanbakhsh, K. Samadi, R. K. Gupta, andH. Esmaeilzadeh. Snapea: Predictive early activation for reducingcomputation in deep convolutional neural networks. In Proceedings of theACM\/IEEE 45th Annual International Symposium on Computer Architecture(ISCA), pages 662\u2013673, 2018. doi:10.1109\/ISCA.2018.00061. \u2192 pages 1, 9[2] M. Alwani, H. Chen, M. Ferdman, and P. Milder. Fused-layer cnnaccelerators. In Proceedings of the 49th Annual IEEE\/ACM InternationalSymposium on Microarchitecture (MICRO), pages 1\u201312, 2016.doi:10.1109\/MICRO.2016.7783725. \u2192 page 1[3] W. Austin, G. Ballard, and T. G. Kolda. Parallel tensor compression forlarge-scale scientific data. In Proceedings of the IEEE international paralleland distributed processing symposium (IPDPS), pages 912\u2013922. IEEE,2016. \u2192 pages 5, 8[4] R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy,S. Verdoolaege, A. Betts, A. F. Donaldson, J. Ketema, J. Absar,S. Van Haastregt, A. Kravets, A. Lokhmotov, R. David, and E. Hajiyev.Pencil: A platform-neutral compute intermediate language for acceleratorprogramming. In Proceedings of the International Conference on ParallelArchitecture and Compilation (PACT), pages 138\u2013149, 2015.doi:10.1109\/PACT.2015.17. \u2192 page 18[5] R. Baghdadi, J. Ray, M. B. Romdhane, E. Del Sozzo, A. Akkas, Y. Zhang,P. Suriana, S. Kamil, and S. Amarasinghe. Tiramisu: A polyhedral compilerfor expressing fast and portable code. In Proceedings of the 2019IEEE\/ACM International Symposium on Code Generation and Optimization(CGO), CGO 2019, page 193205. IEEE Press, 2019. ISBN 9781728114361.\u2192 pages 18, 2370[6] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointlylearning to align and translate. arXiv preprint arXiv:1409.0473, 2014. \u2192page 1[7] B. Bixby. The gurobi optimizer. Transp. Research Part B, 41(2):159\u2013178,2007. \u2192 page 52[8] J. Canny and H. Zhao. Big data analytics with small footprint: Squaring thecloud. In Proceedings of the 19th ACM SIGKDD international conferenceon Knowledge discovery and data mining (KDD), pages 95\u2013103, 2013. \u2192page 5[9] P. Chatarasi, H. Kwon, N. Raina, S. Malik, V. Haridas, A. Parashar,M. Pellauer, T. Krishna, and V. Sarkar. Marvel: A data-centric compiler fordnn operators on spatial accelerators. arXiv preprint arXiv:2002.07752,2020. \u2192 pages 2, 3, 17, 35, 45[10] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan,L. Wang, Y. Hu, L. Ceze, et al. {TVM}: An automated {End-to-End}optimizing compiler for deep learning. In 13th USENIX Symposium onOperating Systems Design and Implementation (OSDI 18), pages 578\u2013594,2018. \u2192 page 18[11] Y.-H. Chen, J. Emer, and V. Sze. Eyeriss: A spatial architecture forenergy-efficient dataflow for convolutional neural networks. In Proceedingsof the ACM\/IEEE 43rd Annual International Symposium on ComputerArchitecture (ISCA), pages 367\u2013379, 2016. doi:10.1109\/ISCA.2016.40. \u2192pages 1, 9, 12, 33, 45, 47, 52, 62[12] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze. Eyeriss: An energy-efficientreconfigurable accelerator for deep convolutional neural networks. IEEEJournal of Solid-State Circuits, 52(1):127\u2013138, 2016. \u2192 pages 1, 9[13] S. Dave and MPSLab. dmazerunner: Dataflow acceleration optimizationinfrastructure for coarse-grained programmable accelerators. URLhttps:\/\/github.com\/MPSLab-ASU\/dMazeRunner. \u2192 page 46[14] S. Dave, Y. Kim, S. Avancha, K. Lee, and A. Shrivastava. Dmazerunner:Executing perfectly nested loops on dataflow accelerators. ACMTransactions on Embedded Computing Systems (TECS), 18(5s):1\u201327, 2019.\u2192 pages xii, 3, 4, 17, 22, 25, 27, 31, 34, 45, 46, 4771[15] T. A. Davis and Y. Hu. The university of florida sparse matrix collection.ACM Transactions on Mathematical Software (TOMS), 38(1):1\u201325, 2011.\u2192 pages x, 46[16] L. Deng, G. Hinton, and B. Kingsbury. New types of deep neural networklearning for speech recognition and related applications: An overview. InProceedings of the IEEE International Conference on Acoustics, Speech andSignal Processing, pages 8599\u20138603. IEEE, 2013. \u2192 page 1[17] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploitinglinear structure within convolutional networks for efficient evaluation.Advances in Neural Information Processing Systems, 27, 2014. \u2192 pages 5, 7[18] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, andO. Temam. Shidiannao: Shifting vision processing closer to the sensor. InProceedings of the 42nd Annual International Symposium on ComputerArchitecture (ISCA), ISCA \u201915, page 92104, New York, NY, USA, 2015.Association for Computing Machinery. ISBN 9781450334020.doi:10.1145\/2749469.2750389. URLhttps:\/\/doi.org\/10.1145\/2749469.2750389. \u2192 pages 1, 9[19] V. Elango, N. Rubin, M. Ravishankar, H. Sandanagobalane, and V. Grover.Diesel: Dsl for linear algebra and neural net computations on gpus. InProceedings of the 2nd ACM SIGPLAN International Workshop on MachineLearning and Programming Languages, MAPL 2018, page 4251, NewYork, NY, USA, 2018. Association for Computing Machinery. ISBN9781450358347. doi:10.1145\/3211346.3211354. URLhttps:\/\/doi.org\/10.1145\/3211346.3211354. \u2192 page 18[20] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable objectdetection using deep neural networks. In Proceedings of the IEEEconference on computer vision and pattern recognition (CVPR), pages2147\u20132154, 2014. \u2192 page 1[21] M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis. Tangram:Optimized coarse-grained dataflow for scalable nn accelerators. InProceedings of the Twenty-Fourth International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOS),ASPLOS \u201919, page 807820, New York, NY, USA, 2019. Association forComputing Machinery. ISBN 9781450362405.doi:10.1145\/3297858.3304014. URLhttps:\/\/doi.org\/10.1145\/3297858.3304014. \u2192 page 172[22] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally.Eie: Efficient inference engine on compressed deep neural network. ACMSIGARCH Computer Architecture News, 44(3):243\u2013254, 2016. \u2192 pages1, 9, 62[23] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition. In Proceedings of the IEEE conference on computer vision andpattern recognition (CVPR), pages 770\u2013778, 2016. \u2192 pagesxiii, 4, 12, 30, 33, 35, 45, 46, 62[24] J. B. Heaton, N. G. Polson, and J. H. Witte. Deep learning for finance: deepportfolios. Applied Stochastic Models in Business and Industry, 33(1):3\u201312,2017. \u2192 page 1[25] K. Hegde, P.-A. Tsai, S. Huang, V. Chandra, A. Parashar, and C. W. Fletcher.Mind mappings: enabling efficient algorithm-accelerator mapping spacesearch. In Proceedings of the 26th ACM International Conference onArchitectural Support for Programming Languages and Operating Systems(ASPLOS), pages 943\u2013958, 2021. \u2192 pages 3, 17, 46, 52[26] Q. Huang, A. Kalaiah, M. Kang, J. Demmel, G. Dinh, J. Wawrzynek,T. Norell, and Y. S. Shao. Cosa: Scheduling by constrained optimization forspatial accelerators. In Proceedings of the ACM\/IEEE 48th AnnualInternational Symposium on Computer Architecture (ISCA), pages 554\u2013566.IEEE, 2021. \u2192 pages xii, 3, 4, 17, 22, 31, 46, 52[27] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao,C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V.Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho,D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski,A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy,J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin,G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan,R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda,A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov,M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian,H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H.Yoon. In-datacenter performance analysis of a tensor processing unit. InProceedings of the 44th Annual International Symposium on ComputerArchitecture (ISCA), ISCA \u201917, page 112, New York, NY, USA, 2017.Association for Computing Machinery. ISBN 9781450348928.73doi:10.1145\/3079856.3080246. URLhttps:\/\/doi.org\/10.1145\/3079856.3080246. \u2192 pages 1, 9[28] S.-C. Kao and T. Krishna. Gamma: Automating the hw mapping of dnnmodels on accelerators via genetic algorithm. In Proceedings of the 39thInternational Conference on Computer-Aided Design (ICCAD), ICCAD \u201920,New York, NY, USA, 2020. Association for Computing Machinery. ISBN9781450380263. doi:10.1145\/3400302.3415639. URLhttps:\/\/doi.org\/10.1145\/3400302.3415639. \u2192 page 3[29] S.-C. Kao, G. Jeong, and T. Krishna. Confuciux: Autonomous hardwareresource assignment for dnn accelerators using reinforcement learning. InProceedings of the 53rd Annual IEEE\/ACM International Symposium onMicroarchitecture (MICRO), pages 622\u2013636. IEEE, 2020. \u2192 pages 18, 69[30] S.-C. Kao, M. Pellauer, A. Parashar, and T. Krishna. Digamma:domain-aware genetic algorithm for hw-mapping co-optimization for dnnaccelerators. In Proceedings of the Design, Automation & Test in EuropeConference & Exhibition (DATE), pages 232\u2013237. IEEE, 2022. \u2192 pages18, 69[31] L. Ke, X. He, and X. Zhang. Nnest: Early-stage design space explorationtool for neural network inference accelerators. In Proceedings of theInternational Symposium on Low Power Electronics and Design (ISLPED),ISLPED \u201918, New York, NY, USA, 2018. Association for ComputingMachinery. ISBN 9781450357043. doi:10.1145\/3218603.3218647. URLhttps:\/\/doi.org\/10.1145\/3218603.3218647. \u2192 pages 1, 3[32] D. E. Knuth and R. W. Moore. An analysis of alpha-beta pruning. Artificialintelligence, 6(4):293\u2013326, 1975. \u2192 page 34[33] J. Kossaifi, A. Khanna, Z. Lipton, T. Furlanello, and A. Anandkumar. Tensorcontraction layers for parsimonious deep nets. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition Workshops(CVPR), pages 26\u201332, 2017. \u2192 pages 5, 8, 46[34] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification withdeep convolutional neural networks. Advances in Neural InformationProcessing Systems, 25, 2012. \u2192 page 46[35] H. Kwon, A. Samajdar, and T. Krishna. MAERI: Enabling Flexible DataflowMapping over DNN Accelerators via Reconfigurable Interconnects, page461475. Association for Computing Machinery, New York, NY, USA, 2018.74ISBN 9781450349116. URL https:\/\/doi.org\/10.1145\/3173162.3173176. \u2192page 1[36] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky.Speeding-up convolutional neural networks using fine-tunedcp-decomposition. arXiv preprint arXiv:1412.6553, 2014. \u2192 pages 5, 8[37] C. Lengauer. Polly\u2014performing polyhedral optimizations on a low-levelintermediate representation. Parallel Processing Letters, 22, 12 2012.doi:10.1142\/S0129626412500107. \u2192 page 18[38] R. Li, Y. Xu, A. Sukumaran-Rajam, A. Rountev, and P. Sadayappan.Analytical characterization and design space exploration for optimization ofcnns. In Proceedings of the 26th ACM International Conference onArchitectural Support for Programming Languages and Operating Systems(ASPLOS), pages 928\u2013942, 2021. \u2192 pages 18, 25, 27[39] L. Lu, N. Guan, Y. Wang, L. Jia, Z. Luo, J. Yin, J. Cong, and Y. Liang.Tenet: A framework for modeling tensor dataflow based on relation-centricnotation. In Proceedings of the ACM\/IEEE 48th Annual InternationalSymposium on Computer Architecture (ISCA), pages 720\u2013733. IEEE, 2021.\u2192 page 5[40] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li. Flexflow: A flexibledataflow accelerator architecture for convolutional neural networks. InProceedings of the IEEE International Symposium on High PerformanceComputer Architecture (HPCA), pages 553\u2013564, 2017.doi:10.1109\/HPCA.2017.29. \u2192 page 1[41] L. Mei, P. Houshmand, V. Jain, S. Giraldo, and M. Verhelst. Zigzag:Enlarging joint architecture-mapping design space exploration for dnnaccelerators. IEEE Transactions on Computers, 70(8):1160\u20131174, 2021. \u2192pages 3, 17, 25, 27, 35, 37, 63[42] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi. Cacti 6.0: A toolto model large caches. HP laboratories, 27:28, 2009. \u2192 page 47[43] M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park,X. Wang, U. Gupta, C.-J. Wu, A. G. Azzolini, et al. Deep learningrecommendation model for personalization and recommendation systems.arXiv preprint arXiv:1906.00091, 2019. \u2192 page 175[44] M. Olyaiy, C. Ng, and M. Lis. Accelerating dnns inference with predictivelayer fusion. In Proceedings of the ACM International Conference onSupercomputing (ICS), ICS \u201921, page 291303, New York, NY, USA, 2021.Association for Computing Machinery. ISBN 9781450383356.doi:10.1145\/3447818.3460378. URLhttps:\/\/doi.org\/10.1145\/3447818.3460378. \u2192 page 1[45] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany,J. Emer, S. W. Keckler, and W. J. Dally. Scnn: An accelerator forcompressed-sparse convolutional neural networks. ACM SIGARCHcomputer architecture news, 45(2):27\u201340, 2017. \u2192 pages 1, 9[46] A. Parashar, P. Raina, Y. S. Shao, Y.-H. Chen, V. A. Ying, A. Mukkara,R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer. Timeloop: Asystematic approach to dnn accelerator evaluation. In IEEE InternationalSymposium on Performance Analysis of Systems and Software (ISPASS),pages 304\u2013315. IEEE, 2019. \u2192 pages x, 1, 3, 10, 16, 22, 34, 46, 47, 50[47] B. Pradelle, B. Meister, M. Baskaran, J. Springer, and R. Lethin. PolyhedralOptimization of TensorFlow Computation Graphs, pages 74\u201389. 04 2019.ISBN 978-981-13-6209-5. doi:10.1007\/978-3-030-17872-7 5. \u2192 page 1[48] D. Rav\u0131`, C. Wong, F. Deligianni, M. Berthelot, J. Andreu-Perez, B. Lo, andG.-Z. Yang. Deep learning for health informatics. IEEE Journal ofBiomedical and Health Informatics, 21(1):4\u201321, 2016. \u2192 page 1[49] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen.Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings ofthe IEEE conference on computer vision and pattern recognition (CVPR),pages 4510\u20134520, 2018. \u2192 pages 45, 46[50] Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks. Aladdin: A pre-rtl,power-performance accelerator simulator enabling large design spaceexploration of customized architectures. In Proceedings of the ACM\/IEEE41st International Symposium on Computer Architecture (ISCA), pages97\u2013108. IEEE, 2014. \u2192 page 47[51] D. Shen, G. Wu, and H.-I. Suk. Deep learning in medical image snalysis.Annual review of biomedical engineering, 19:221, 2017. \u2192 page 1[52] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis,and C. Faloutsos. Tensor decomposition for signal processing and machine76learning. IEEE Transactions on Signal Processing, 65(13):3551\u20133582,2017. \u2192 pages 5, 8[53] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. \u2192page 46[54] A. K. Smilde, P. Geladi, and R. Bro. Multi-way analysis: Applications in theChemical Sciences. John Wiley & Sons, 2005. \u2192 pages 5, 8[55] S. Smith, J. W. Choi, J. Li, R. Vuduc, J. Park, X. Liu, and G. Karypis.FROSTT: The formidable repository of open sparse tensors and tools, 2017.URL http:\/\/frostt.io\/. \u2192 pages x, 46[56] J.-T. Sun, H.-J. Zeng, H. Liu, Y. Lu, and Z. Chen. Cubesvd: a novelapproach to personalized web search. In Proceedings of the 14thinternational conference on World Wide Web (WWW \u201905), pages 382\u2013390,2005. \u2192 pages 5, 8[57] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking theinception architecture for computer vision. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR), pages2818\u20132826, 2016. \u2192 pages xii, 4, 45, 46[58] N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses,S. Verdoolaege, A. Adams, and A. Cohen. Tensor comprehensions:Framework-agnostic high-performance machine learning abstractions, 2018.URL https:\/\/arxiv.org\/abs\/1802.04730. \u2192 page 18[59] M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of imageensembles: Tensorfaces. In European Conference on Computer Vision(ECCV), pages 447\u2013460. Springer, 2002. \u2192 pages 5, 8[60] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,\u0141. Kaiser, and I. Polosukhin. Attention is all you need. Advances in NeuralInformation Processing Systems, 30, 2017. \u2192 page 46[61] H. J. Vishnukumar, B. Butting, C. Mu\u00a8ller, and E. Sax. Machine learning anddeep neural networkartificial intelligence core for lab and real-world test andvalidation for adas and autonomous vehicles: Ai for efficient and quality testand validation. In 2017 Intelligent Systems Conference (IntelliSys), pages714\u2013721. IEEE, 2017. \u2192 page 177[62] M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. InProceedings of the ACM SIGPLAN 1991 Conference on ProgrammingLanguage Design and Implementation (PLDI), PLDI \u201991, page 3044, NewYork, NY, USA, 1991. Association for Computing Machinery. ISBN0897914287. doi:10.1145\/113445.113449. URLhttps:\/\/doi.org\/10.1145\/113445.113449. \u2192 page 23[63] Y. N. Wu, J. S. Emer, and V. Sze. Accelergy: An architecture-level energyestimation methodology for accelerator designs. In Proceedings of theIEEE\/ACM International Conference on Computer-Aided Design (ICCAD),pages 1\u20138. IEEE, 2019. \u2192 page 47[64] D. Yang, A. Ghasemazar, X. Ren, M. Golub, G. Lemieux, and M. Lis.Procrustes: a dataflow and accelerator for sparse deep neural networktraining. In Proceedings of the 53rd Annual IEEE\/ACM InternationalSymposium on Microarchitecture (MICRO), pages 711\u2013724. IEEE, 2020. \u2192pages 1, 9[65] T.-J. Yang, Y.-H. Chen, and V. Sze. Designing energy-efficient convolutionalneural networks using energy-aware pruning. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR), pages6071\u20136079, 2017. doi:10.1109\/CVPR.2017.643. \u2192 page 3[66] X. Yang, M. Gao, Q. Liu, J. Setter, J. Pu, A. Nayak, S. Bell, K. Cao, H. Ha,P. Raina, et al. Interstellar: Using halide\u2019s scheduling language to analyzednn accelerators. In Proceedings of the Twenty-Fifth InternationalConference on Architectural Support for Programming Languages andOperating Systems (ASPLOS), pages 369\u2013383, 2020. \u2192 pages3, 17, 22, 34, 37, 45, 46, 62, 63[67] D. Zhang, S. Huda, E. Songhori, K. Prabhu, Q. Le, A. Goldie, andA. Mirhoseini. A full-stack search technique for domain optimized deeplearning accelerators. In Proceedings of the 27th ACM InternationalConference on Architectural Support for Programming Languages andOperating Systems, pages 27\u201342, 2022. \u2192 page 378Appendix APseudo-Code for AlgorithmsHere, we show the high-level pseudo-code for the different parts of the mappingflow.Algorithm 3 Algorithm generates nodes and annotates them with reuse. This graph(precisely, the root node of this graph) is then passed into another algorithm toprune and remove sub-optimal orders.function GENERATE ORDERING TREE(tensors)for t in tensors dofull reuse dims\u2190{full reuse dims of tensor t}part reuse dims\u2190{partial reuse dims of tensor t}for part dim in part reuse dims doif node s is not created thenCreate node sAnnotate node s with tensor tfor dims in {power set of full reuse dims} dofor s in {permutations of dims} doif node s is not created thenCreate node sAnnotate node s with tensor tfor part dim in part reuse dims doif node (s+part dim) is not created thenCreate node (s+part dim)Annotate node (s+part dim) with tensor t79Algorithm 4 Algorithm recursively identifies suboptimal orderings by comparingthe reuse across each node\u2019s children.function FIND UNIQUE ORDERS(root)if root is a leaf node then return {root}unique\u2190{}for child in {root\u2019s children} dounique\u2190 unique\u222a FIND UNIQUE ORDERS(child)for child0,child1 in {pairs of nodes in unique} doif child0\u2019s reuse is a subset of child1\u2019s reuse thenremove child0 from uniqueelseif child1\u2019s reuse is a subset of child0\u2019s reuse thenremove child1 from uniquereturn uniqueAlgorithm 5 Algorithm generates L1-tile nodes on the fly and prunes the sub-optimal and invalid nodes.function PRUNE TILES(prob, capacity)kept nodes\u2190{}nodes to visit\u2190{root (all factors = 1)}while nodes to visit is not empty donode\u2190 head of nodes to visitpop nodes to visitall overflow\u2190 Truefor c in {node\u2019s children (enlarge by each dimension)} doif c fits in capacity thenall overflow\u2190 Falseif c has not been visited thenadd c to nodes to visitif all overflow thenadd node to kept nodesreturn kept nodes80Algorithm 6 Algorithm generates unrolling candidates based on the problem andnumber of PEs available.function GET SPATIAL(prob, PEs)unrollings\u2190{}for t in tensors dounroll\u2190{unroll index dims of t to occupy 100% of PEs}if unroll is empty thenunroll\u2190{unroll index dims of t to occupy as much PEs as possible}unroll\u2190 unroll\u222a{unroll mostly index dims with some other dims tooccupy 100% of PEs}unrollings\u2190 unrollings\u222aunrollreturn unrollingsAlgorithm 7 Algorithm takes a problem and spatially unrolls and tiles the inner-most memory level (L1), while pairing it with the best ordering.function PRUNE L1(prob,PEs,capacity)for unrolling in GET SPATIAL(prob,PEs) dosubprob\u2190 factor out unrolling from probfor tile in PRUNE TILES(subprob,capacity) dobest cost\u2190 \u221efor ord in orderings docost\u2190 # accesses with ord scaled by energy (using polynomialsfrom section 4)if cost < best cost thenbest map\u2190 (tile,ord)best cost\u2190 costadd best map to kept tilesadd best cost to costsreturn kept tiles,costs81Algorithm 8 Algorithm accepts partial mappings and tiles the next-level mem-ory. It employs an alpha-beta variant in order to dynamically prune provably sub-optimal tiles.function PRUNE LX(prob, tiles,costs,PEs,capacity)sorted tiles\u2190 sorted (tile,cost) pairs from tiles,costs based on ascendingcostopt overhead\u2190 lower bound of accesses scaled by energy (section 4)for (tile,cost) in sorted tiles doif (cost+optimal overhead)> best global cost thenbreaksubprob0\u2190 factor out tile from probfor unrolling in GET SPATIAL(subprob0,PEs) dosubprob1\u2190 factor out unrolling from subprob0for tile in PRUNE TILES(subprob1,capacity) dobest cost\u2190 \u221efor ord in orderings docost\u2190 # accesses with ord scaled by energy (using polyno-mials from section 4)if cost < best cost thenbest map\u2190 (tile,ord)best cost\u2190 costadd best map to kept tilesadd best cost to costsif best cost < best global cost thenbest global cost\u2190 best costreturn kept tiles,costs82Algorithm 9 Algorithm generates tile nodes with their corresponding memory onthe fly and prunes the sub-optimal nodes and memory sizes.function PRUNE TILES(prob,mem choices)kept nodes\u2190{}nodes to visit\u2190{root (all factors = 1)}while nodes to visit is not empty donode\u2190 head of nodes to visitmem gets smallest size from mem choices that fits nodeall overflow\u2190 Truefor c in {node\u2019s children (enlarge by each dimension)} doif c fits in largest memory in mem choices thenchild mem\u2190 smallest size from mem choices that fits nodeif child mem == mem thenall overflow\u2190 Falseif c has not been visited thenadd c to nodes to visitif all over f low thenadd node to kept nodesreturn kept nodes83","type":"literal","lang":"en"}],"http:\/\/www.europeana.eu\/schemas\/edm\/hasType":[{"value":"Thesis\/Dissertation","type":"literal","lang":"en"}],"http:\/\/vivoweb.org\/ontology\/core#dateIssued":[{"value":"2022-11","type":"literal","lang":"en"}],"http:\/\/www.europeana.eu\/schemas\/edm\/isShownAt":[{"value":"10.14288\/1.0421312","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/language":[{"value":"eng","type":"literal","lang":"en"}],"https:\/\/open.library.ubc.ca\/terms#degreeDiscipline":[{"value":"Electrical and Computer Engineering","type":"literal","lang":"en"}],"http:\/\/www.europeana.eu\/schemas\/edm\/provider":[{"value":"Vancouver : University of British Columbia Library","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/publisher":[{"value":"University of British Columbia","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/rights":[{"value":"Attribution 4.0 International","type":"literal","lang":"*"}],"https:\/\/open.library.ubc.ca\/terms#rightsURI":[{"value":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/","type":"literal","lang":"*"}],"https:\/\/open.library.ubc.ca\/terms#scholarLevel":[{"value":"Graduate","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/contributor":[{"value":"Lis, Mieszko","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/title":[{"value":"Analytically driven software\/hardware co-design for accelerating tensor workloads","type":"literal","lang":"en"}],"http:\/\/purl.org\/dc\/terms\/type":[{"value":"Text","type":"literal","lang":"en"}],"https:\/\/open.library.ubc.ca\/terms#identifierURI":[{"value":"http:\/\/hdl.handle.net\/2429\/82892","type":"literal","lang":"en"}]}}