Dynamic Race Detection for Non-Coherent AcceleratorsbyMay YoungA THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Computer Science)The University of British Columbia(Vancouver)May 2020c©May Young, 2020The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:Dynamic Race Detection for Non-Coherent Acceleratorssubmitted by May Young in partial fulfillment of the requirements for the degreeof Master of Science in Computer Science.Examining Committee:Alan Hu, Computer ScienceSupervisorGuy Lemieux, Electrical and Computer EngineeringSupervisory Committee MemberiiAbstractModern System-on-Chip (SoC) designs are increasingly complex and heteroge-neous, featuring specialized hardware accelerators and processor cores that accessshared memory. Non-coherent accelerators are one common memory coherencemodel. The advantage of non-coherence in accelerators is that it cut costs on ex-tra hardware that would have been used for memory coherence. However, thedisadvantage is that the programmer must know where and when to synchronizebetween the accelerator and the processor. Getting this synchronization correct canbe difficult.We propose a novel approach to find data races in software for non-coherentaccelerators in heterogeneous systems. The intuition underlying our approach isthat a sufficiently precise abstraction of the hardware’s behaviour is straightforwardto derive, based on common idioms for memory access.To demonstrate this, we derived simple rules and axioms to model the inter-action between a processor and a massively parallel accelerator provided by acommercial FPGA-based accelerator vendor. We implemented these rules in aprototype tool for dynamic race detection, and performed experiments on elevenexamples provided by the vendor. The tool is able to detect previously unknownraces in two of the eleven examples, and additional races if we introduce bugs inthe synchronization in the code.iiiLay SummaryModern systems-on-chip (SoCs) are integrated circuits that combine various com-puter components onto a single chip. Their designs are increasingly complex andheterogenous, where specialized hardware called accelerators and processors ac-cess shared memory. In non-coherent and heterogeneous systems, the programmermust know where and when to synchronize between the accelerator and the proces-sor. Getting this synchronization correct is extremely difficult. Doing it incorrectlycan result in dangerous bugs known as data races.We propose a novel approach to analyze code for data races in non-coherentaccelerators. We derived simple rules to model the interaction between a processorand an accelerator provided by a commercial vendor. We implemented these rulesin a prototype tool, and performed experiments on eleven examples provided by thevendor. The tool detected previously unknown races in two of the eleven examples,and additional races if we introduce synchronization bugs in the code.ivPrefaceThe work presented in this thesis was performed in collaboration with my supervi-sor, Dr. Alan Hu, and also with Dr. Guy Lemieux, who is the CEO of VectorBloxand a Professor in the Department of Electrical and Computer Engineering. Noneof the text of the dissertation is taken directly from previously published or collab-orative articles.The theoretical framework in Chapter 3 was designed by Alan Hu and me, andwas influenced by input and feedback from Guy Lemieux and Sam Bayless. Theproofs of the theorems in Section 3.4 were done by Alan Hu and me. The dynamicrace detection tool in Chapter 4 and the experiments in Chapter 5 were primarilyimplemented and performed by me. The vector power experiment in Chapter 5was completed by Guy Lemieux and me.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Data Races . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Non-Coherent Accelerators . . . . . . . . . . . . . . . . . . . . . 53 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1 Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Examples of Basic Rules . . . . . . . . . . . . . . . . . . . . . . 153.3 Full Ruleset for a Non-Coherent Accelerator . . . . . . . . . . . . 183.4 Theorems and Proofs . . . . . . . . . . . . . . . . . . . . . . . . 24vi4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.1 VectorBlox Architecture . . . . . . . . . . . . . . . . . . . . . . 334.2 Instrumentation and Execution . . . . . . . . . . . . . . . . . . . 344.3 Trace Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Time Required for Analysis . . . . . . . . . . . . . . . . . . . . . 375.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 395.4 Injection of Bugs in Examples . . . . . . . . . . . . . . . . . . . 436 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50viiList of TablesTable 3.1 Node semantics . . . . . . . . . . . . . . . . . . . . . . . . . 12Table 5.1 Benchmark examples . . . . . . . . . . . . . . . . . . . . . . 38viiiList of FiguresFigure 2.1 Common data race example . . . . . . . . . . . . . . . . . . 5Figure 2.2 VectorBlox architecture . . . . . . . . . . . . . . . . . . . . . 6Figure 3.1 A simple program . . . . . . . . . . . . . . . . . . . . . . . . 10Figure 3.2 Graph of memory operations . . . . . . . . . . . . . . . . . . 11Figure 3.3 Steps to build a graph . . . . . . . . . . . . . . . . . . . . . . 15Figure 3.4 Arrow from wb node to alloc node . . . . . . . . . . . . . . . 24Figure 3.5 Theorem 3.4.3 . . . . . . . . . . . . . . . . . . . . . . . . . 26Figure 3.6 Propagate writebacks . . . . . . . . . . . . . . . . . . . . . . 30Figure 4.1 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Figure 4.2 Instrumented simple program . . . . . . . . . . . . . . . . . 35Figure 4.3 Trace file . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Figure 5.1 Analysis runtime including the largest example . . . . . . . . 39Figure 5.2 Vector addition graph showing a race . . . . . . . . . . . . . 42Figure 5.3 Sync in vbw vec power t example . . . . . . . . . . . . . . . 45Figure 5.4 Missing sync in vbw vec power t example . . . . . . . . . . 45ixGlossaryAPI Application Programming InterfaceCPU Central Processing UnitDMA Direct Memory AccessFFT Fast Fourier TransformFIR Finite Impulse ResponseFIFO First-In, First-OutSOC System-on-ChipxAcknowledgmentsI would like to thank my supervisor, Alan Hu, for his guidance and advice. Iwould also like to thank Guy Lemieux for introducing the problem, giving methe opportunity to present my thesis in the industry, and being the second reader.Finally, I would like to thank my family, friends, and lab-mates who supported methroughout my Master’s.xiChapter 1IntroductionModern System-on-Chip (SoC) designs are increasingly complex and heteroge-neous, featuring specialized hardware accelerators and processor cores. This islargely due to the demise of Dennard scaling and today’s growing demands ofperformance and energy efficiency, which are leading towards specialization andaccelerator-rich architectures.Heterogeneous computing architectures generally consist of at least two com-ponents: a processor and an accelerator, which improves overall performance [17].In order to obtain the performance boost, the processor offloads some computationfor an application to the accelerator. During the computation, the processor doesnot need to wait for the results; rather, it can continue executing other instructionsin parallel, or even contribute to the processing as well. Taking advantage of boththe processor and the accelerator simultaneously exploits the heterogeneity of theplatform, resulting in performance improvement and efficient use of the platformin question.Accelerators and processors commonly communicate with each other throughshared memory. This obviously introduces the problem of memory coherence,since processors or accelerators may access stale data. In particular, non-coherentaccelerators read from and write to main memory via Direct Memory Access(DMA), bypassing the processor’s cache(s). Thus, the processor must wait forall DMA transfers to complete before accessing the data in main memory. Non-coherent accelerators have the advantages of using less complex designs and fewer1resources to build, compared to both hardware and software cache coherence pro-tocols.However, using non-coherent accelerators puts more burden on the program-mer to ensure that the program they are writing is race-free. For example, consideran architecture consisting of one general-purpose Central Processing Unit (CPU)with its own cache and an accelerator that cannot communicate with the CPU orthe CPU cache. The CPU, the CPU cache, and the accelerator are each connectedto main memory. Then, the conservative programmer would flush the entire cachebefore each DMA read and before each DMA write in a program. In contrast,the aggressive programmer would flush only the dirty data from the cache beforeeach DMA read, and flush any valid data before each DMA write. Therefore, theaggressive programmer must know what is in the cache before they can safely exe-cute a DMA operation (read/write). Of course, there may be a type of programmerthat falls in between these two sides of the spectrum.Two problems can arise from programming in a non-coherent, heterogeneousenvironment: (1) there can be unnecessary synchronization operations that the pro-grammer included because they are not confident of what is in the cache at a certaintime; or (2) there can be missing synchronization operations because they failed toproperly analyze the code. This results in either inefficiency or potentially incorrectbehaviour.In this thesis, we propose a novel approach to analyze code in software fornon-coherent accelerators in heterogeneous systems. We design and implement aprototype tool that analyzes a trace of the memory operations in a single executionof a program and detects races — if any — in it. We use a semi-automatic processto instrument the program to print out each memory operation into a trace text filewhen it is compiled and executed. The benefits of our approach over existing mem-ory coherence schemes are that it is generalizable to any non-coherent architecture,does not require additional hardware or new instructions, and is sound with respectto an execution. A limitation of our approach is that it verifies only that singleexecution.Overall, the major contributions of this thesis include:• A novel approach to find data races in software for non-coherent accelerators2in heterogeneous systems.• Proofs of theorems that show the correctness and soundness (to an extent) ofour solution.• A prototype of a dynamic race detection tool, which analyzed eleven open-source examples provided by a commercial FPGA-based accelerator vendor.The rest of the thesis is organized as follows: Chapter 2 provides backgroundknowledge needed to understand our solution; Chapter 3 describes our theoreticalframework; Chapter 4 presents and explains the implementation of the dynamicrace detection tool; Chapter 5 reports the examples and results used in the experi-ments; Chapter 6 reviews related work; and Chapter 7 concludes.3Chapter 2Background2.1 Data RacesWith more than one thread or hardware component being able to directly accessa shared resource such as main memory, the problem of managing concurrent ac-cesses arises. Thus, errors in synchronization can cause data races, which nega-tively impacts correctness. We formally define a data race as below.Definition 2.1.1 (Data Race). A data race occurs when there are two or morepossibly unordered operations that access the same memory location, where atleast one of these operations is a write, and there is no happened-before relation(explained further below) that forces an ordering between these operations.One way to detect data races is to utilize Lamport‘s happened-before relation[9], denoted by →. This relation is a strict partial ordering of events in a systemsuch that: (1) If a and b are events in the same process, and a comes before b, thena→ b; (2) If a is the sending of a message and b is the receipt of the same message,then a→ b; and (3) If a→ b and b→ c, then a→ c.To detect data races, memory operations can be treated as events. For example,a program first performs a cached write a, followed by an uncached read b. Thecached write happens before the uncached read due to program order. If we wereto put these operations into a simple graph, there would be a node representing thecached write and another node representing the uncached read. An arrow would4(a) Process 1 writes the value of x afterProcess 2, so the final value of x is 6.(b) Process 2 writes the value of x afterProcess 1, so the final value of x is 7.Figure 2.1: This shows a data race between two different processes in regardsto who writes the value of the shared variable x last. Both (a) and (b)assume that Process 1 always reads x before Process 2.be drawn from the cached write node to the uncached read node, where the arrowsymbolizes that the write a happens before the read b.There is a data race between two nodes in a graph if they access the samememory location, at least one is a write, and there is no happened-before relationthat forces an ordering. A common example of a data race involves two differentprocesses where each reads a shared variable x, increments it, and writes the newvalue back to x. Consider the example in Figure 2.1 where Process 1 adds 1 to xand Process 2 adds 2. If x was initially 5, without any synchronization in place, thefinal value of x could be 6 or 7 (Figure 2.1). Sometimes, it could be 6 (Figure 2.1a).Other times, it could be 7 (Figure 2.1b). Thus, it is a race between Process 1 andProcess 2 to change the value of x; this is a data race. In this example, assume thatthe correct result is 6, which means that, when these two processes are executed, attimes, x contains the correct value; other times, it contains the wrong value.2.2 Non-Coherent AcceleratorsThe main idea behind non-coherent accelerators is that they access shared memorythrough DMA, without having to go through the processor’s cache(s). This freesup the processor to continue executing in parallel with the accelerator. It also5Figure 2.2: Example of an accelerator architecture taken from VectorBlox[1]. This is a VectorBlox MXP [15] system with ARM Cortex-A9 inZynq-7000 FPGAs.gives the accelerator much faster access to memory. A typical architecture is onethat is used at VectorBlox, a commercial FPGA-based accelerator vendor that wecollaborated with and whose examples we used. Figure 2.2 shows an ARM-basedarchitecture from VectorBlox that contains one processor and one accelerator on asingle FPGA board. The CPU accesses shared main memory through its caches.On the other hand, the vector engine accesses main memory via DMA and onlywhen it receives such instructions from the CPU. In other words, the CPU offloadssome computation to the accelerator by placing the instructions associated withthis computation in a DMA First-In, First-Out (FIFO) queue, as seen in Figure 2.2,from which the accelerator (i.e., the vector engine) grabs them to do its work.According to [11], “a memory is coherent if the value returned by a read op-eration is always the same as the value written by the most recent write operationto the same address”. An accelerator architecture has more than one access path toshared memory. Figure 2.2 exemplifies this as two different entities — a processorand an accelerator — each have their own memory access path.Giri et al. studied three accelerator cache-coherence models from an SoC per-spective: non-coherent, coherent with the last-level cache (LLC-coherent), and6fully-coherent [8]. Non-coherent accelerators access shared memory via DMA,bypassing the cache hierarchy. Therefore, the cache hierarchy must flush the ac-celerator data region before accelerator execution. Fully-coherent accelerators arecoherent with the private caches of the processors, commonly through the MESIor MOESI protocol, and do not require any cache flushing prior to acceleratorexecution. LLC-coherent accelerators are an intermediate model, as they are co-herent with the LLC (via DMA requests) but not coherent with the private cachesof the processors (hence, cache flushing is needed before accelerator execution).One finding by Giri et al. concluded that non-coherent accelerators are more ef-fective for large workloads, which takes advantage of the higher throughput that isachieved from larger DMA bursts.In addition, the non-coherent model has less hardware overhead than one withsome type of memory coherence protocol in place. No extra hardware requiredalso means a more cost-efficient model. However, a programmer must know whenand where to synchronize between the accelerator and the processor. Getting thissynchronization correct can be difficult.7Chapter 3TheoryIn order to detect races (which are defined in Definition 2.1.1), we construct agraph of all memory operations in a program, where a node in the graph representsa memory operation (or other relevant operation). An arrow from a node x toanother node y indicates a happens-before relation; that is, x→ y. The graph isused to track enforced ordering of memory operations. The overall task of ouranalysis is simply to check the graph for two unordered nodes that access the samememory location, of which at least one is a write.Our theoretical framework exists to make this task scalable as well as provablysound, i.e., it cannot miss a possible race in a program execution. Other importantobjectives are that our analysis not raise too many false alarms, and that it notrequire excessively detailed hardware modeling.A necessary condition for scalability is that our analysis be done “on-the-fly”,i.e., creating the graph node-by-node according to the program execution and de-tecting races as soon as possible, instead of trying to post-process an enormousexecution trace after-the-fact. However, a challenge arises because memory op-erations do not have an obvious time at which they might occur. For example, acache line might fill from main memory at some unspecified time before a cachedread that would fill in that same line occurs. In fact, if the cache does prefetch-ing, the allocate may happen without any CPU load instruction at all. Similarly,writebacks from the cache might happen at a not precisely specified time after thecorresponding program instruction. Therefore, two problems might occur with on-8the-fly analysis: (1) false races could be flagged if nodes are added to the graph tooearly, because the ordering becomes evident only after the current nodes are cre-ated; and (2) races might not be detected if the conflicting nodes are added to thegraph too late, after other nodes that enforce an ordering (which retroactively closea window of time during which a race could have happened). To prevent these twosituations from occurring, we enforce the following invariant:Definition 3.0.1. A node exists in the graph iff a memory operation can happenand the value being read or written by the operation is visible to the CPU or theaccelerator. (In other words, the operation corresponding to a node could happen assoon as the node exists in the graph, but not before that point in time. Additionally,the result of that operation must be observable by the CPU or accelerator.)For example, since we don’t want to model the details of a cache’s detailed pre-fetching and allocation policies, we must conservatively assume that a cache linemight pre-fetch from memory at any point in time. But if we created nodes for allof these pre-fetch operations that might happen, we would create countless falserace detections between these mythical pre-fetches and any write operation to thesame address. Instead, we wait until we encouter an operation where the CPU doesa cached read, and then retroactively create a node for the cache line allocation thatmust have happened before the cached read can happen. We prove in this chapterthat these sorts of modeling decisions are sound.What exactly are the nodes being created? We track cached reads, cachedwrites, uncached reads, uncached writes, DMA reads, DMA writes, flush instruc-tions explicitly written by a programmer in a program, and sync operations. Theaforementioned cached/uncached operations relate to the cpu-cache (i.e., the CPU’sdata cache).3.1 Simple ExamplePerhaps it would be easier to introduce our theory with an example: Figure 3.1shows a simple program that performs a Fast Fourier transformation (FFT) on inputmicrophone data. The flush of the array a, which stores the data, from the CPUcache at line 13 ensures that the accelerator’s DMA read on line 15 gets the correctvalues from main memory. Furthermore, the sync operation at line 19 guarantees9Figure 3.1: A simple program containing several types of memory operationsthat we capture in a graph. The CPU reads in data from a microphone(lines 8-11), offloads the Fast Fourier transformation computation tothe accelerator (lines 15-18), and finally reads the first value of the re-turned data (line 22). The accelerator copies the mic data into its internal(scratchpad) memory (line 15), performs work on it (lines 16-17), andthen copies the transformed data back into main memory for the CPUto access (line 18).that all DMA operations have finished, so the CPU can safely access the operated-on data, such as the read on line 21.Assuming the same typical architecture described in Section 2.2, the CPU is-sues and executes the operations specified in an input program (e.g., cached write,do dma read, etc.), which can then cause the cache or the accelerator to performan appropriate operation (e.g., a writeback or a DMA read, respectively). Theseoperations are defined in Table 3.1.Suppose that the hexadecimals in Figure 3.2 are the memory addresses — ormore precisely, the address range — of the shared resource among the CPU and theaccelerator, that is, array a. Figure 3.2 shows a cached write (in the address range,0xc0f45c70 to 0xc0f45c71, shortened to 0xc70 to 0xc71 onwards), as the first nodein the graph since it is the first memory operation in the program in Figure 3.1.10Figure 3.2: Graph of all memory operations from the program in Figure 3.1.White nodes are operations issued and executed by the CPU; light greynodes are those executed by the CPU cache; and dark grey nodes arethose by the acceleratorThis happens before the next operation (a cache flush), hence, an arrow from thecached write node to the cache flusha node. Additionally, the CPU cache actuallywrites the value at addresses 0xc70 to 0xc71 into the cache, which generates a newcw node (the top right light grey node). The cache will eventually write back thevalue to main memory at some unspecified time but after the cache brought thevalue into the cache; thus, cw→ writeback wb. Finally, wb→ cache flush becausethe writeback must occur before or at the same time as the cache flush. If it did nothappen before the flush, the cache flush would force the pending writeback data tobe written back to main memory as part of the flush.11These nodes and arrows were built using rules that we have painstakingly de-vised, described later in this section (Section 3.2 and Section 3.3). We constructedthe rest of the graph in a similar way. The steps are visualized in Figure 3.3.Table 3.1: Node semanticsType of Node Semanticscached read CPU reads data from cachecr Cache returns data to CPUalloc Cache reads data from shared memorycached write CPU writes data to cachecw Cache writes data into cachewb Cache writes back data to shared memorycache flusha CPU tells cache to remove data, performing wb if dirtyuncached read CPU reads data from shared memory, bypassing cacheuncached write CPU writes data to shared memory, bypassing cachedo dma read CPU tells accelerator to read from shared memorydma r Accelerator reads data from shared memorydo dma write CPU tells accelerator to write to shared memorydma w Accelerator writes data to shared memorysync CPU waits for all DMA operations to complete121314Figure 3.3: Steps to build a graph one node at a time for the program in Fig-ure 3.1. Notice that new nodes are generated by an instruction, such asin (a), (c), (d), and (f).3.2 Examples of Basic RulesWe create a node and draw a happens-before arrow in the graph according toour rules. A node is stored as a triple (name,addrRange,neighbours), whereaddrRange is an inclusive address range [low,high].As we parse a program line by line, we are only concerned with instructionsassociated with accessing memory and case-split depending on the type of memoryoperation. In each of these cases, we always construct a new node currNode witha given name (e.g., “cached read”) which is the type of memory operation, and itsaddress range. If a previous node prevNode exists — in other words, if there is15another memory operation before currNode in the program — then we draw anarrow from prevNode to currNode. This is summarized in pseudocode as:currNode = new Node( op , addrRange ) ;i f ( prevNode ){prevNode−>neighbours . append ( currNode ) ;}Firstly, handling uncached reads and writes are the simplest. After creating thecurrent node and possibly connecting it to the previous node, the only thing to doconsists of checking if there is a race from currNode to the last relevant writeback(wb) node or to the last DMA operation node, where “relevant” means a danglingnode whose address range overlaps with that of currNode. Note that a danglingnode is one that has no children in the graph.i f ( op == “ uncached read ” | |op == “ uncached wri te ” ){hasRace ( lastWbs , currNode ) ;hasRace ( lastDmaOps , currNode ) ;}Secondly, we will explain the more complicated processing of cached readsand writes. Again, we first create the cached read/write node and draw an arrowfrom the previous node to it. For the cached read operation, we generate a newnode dubbed “cr” that symbolizes the cached read executed by the cache itself, aswell as a new predecessor node called “alloc” that represents the cache allocationdone by the cache.The CPU and the CPU cache are treated as separate entities and, therefore, twodifferent nodes (one for each entity) are created for an operation executed by theCPU that involves the cache. For instance, as seen in Table 3.1, a cached readhas a “cached read” node that is executed by the CPU, as well as a “cr” nodeand an “alloc” node that are executed by the cache. This is the reason for thedifferent coloured nodes and groups drawn in Figure 3.2, where white nodes areCPU executed and light grey nodes are cache executed.We draw an arrow from the cache-executed “cr” or “cw” node to the next CPUoperation, if it exists. This is to symbolize that control goes back to the CPU fromthe cache.16Cache allocation means fetching data from shared memory, and writebackswrite cache data into shared memory. Thus, they are both accessing a shared re-source which the accelerator also accesses via DMA. Therefore, conflicts can oc-cur between “alloc” and the last DMA operation, or between “wb” and the lastDMA operation, if there is no sync occurring between them and if they access anyoverlapping addresses. Note that although we are not explicitly considering cacheprefetching, it is safe to model a cache allocation that is triggered by a cached readthe same as a prefetched cache allocation.i f ( op == “ a l l o c ” | |op == “wb” ){/ / Assume currNode i s the “ a l l o c ” or “wb” nodehasRace ( lastDmaOp , currNode ) ;}Thirdly, cache flush nodes do not spawn any new nodes and are obviouslyonly applicable to cached writes, as cached writes may alter data values and cacheflushes forcefully write cache data into shared memory. Thus, all dangling write-backs before a cache flush that have not been flushed already and that are in thesame address range have an arrow pointing to the flush node. This means that thedata in the flushed address range are no longer dirty in the cache.Another type of node that does not generate any new nodes is the sync node.The last DMA operation that has not yet been synchronized with the CPU codehas an arrow that points from it to the first sync node that happens after it. Thisis because, upon executing a sync instruction, the CPU will wait for all previouslyissued DMA operations to complete before continuing executing any further in-structions.Finally, we will describe DMA nodes. Recall that Direct Memory Access(DMA) is the mechanism by which data transfers between shared memory andthe accelerator’s internal memory. We first create a do dma read or do dma writenode, which signifies that the CPU issues a DMA request to the accelerator. Wethen generate a new node called “dma r” or “dma w”, respectively. Unlike “cr”and “cw” nodes, “dma r” and “dma w” nodes do not have an arrow pointing tothe next CPU operation; if they did, it would indicate that there are no data racesbetween the accelerator and the CPU. Oftentimes, a program contains a burst of17several consecutive DMA requests if the processor communicates with the accel-erator and vice versa. In this case, if there is no sync between two DMA nodes(“dma r” and “dma w”), then there is an arrow from the earlier DMA node to thelater one, respecting the original program order.DMA operations can conflict with relevant dangling writebacks due to the samereasoning as how “wb” conflicts with the last DMA operation.i f ( op == “ dma r ” ){/ / Assume currNode i s the “ dma r ” nodehasRace ( lastWbs , currNode ) ;}i f ( op == “dma w” ){/ / Assume currNode i s the “dma w” nodehasRace ( lastWbs , currNode ) ;hasRace ( l a s tA l l o c s , currNode ) ;}For a summary of the semantics of each type of node, please see Table 3.1.Now that we have covered the basics of how to construct nodes in a graphof memory operations, consider again the simple program in Figure 3.1. Goingthrough its trace line by line means that we build the nodes of the correspond-ing graph on the fly, one by one. The process of building this graph is shown inFigure 3.3.3.3 Full Ruleset for a Non-Coherent AcceleratorThere are subtle details in our theoretical framework that are not mentioned inthe previous section (Section 3.2), which will be covered here. Below is thefull ruleset that our approach and our race detection tool follow when analyzinga program intended for a non-coherent accelerator to find potential data races.To facilitate describing the rules, let CPUop be the following set of CPU op-erations: {cached read, cached write, cache flusha, do dma read, do dma write,sync}. Also, DMAop is defined as {dma r, dma w}.Several rules below use the concept of a bloated address range. Recall thateach node has an inclusive address range [low,high]. The CPU cache reads andwrites data from and to shared memory in chunks, where the size of the chunk18depends on the cache. The size of a read from shared memory is generally a mul-tiple of the cache line size, which means that reads are cache line granularity.A cache flush is also performed at cache line granularity. The size of a write toshared memory is often a multiple of the cache line size as well but may differdepending on the cache, and is thereby called the writeback granularity. There-fore, to reflect the actual address range that the CPU cache touches in sharedmemory, the address range of “alloc” and “cache flusha” nodes are modified sothat low and high are multiples of the cache line size, where these multiples stillcover the range [low,high]. Specifically, assuming that the bloated address rangeis [bloated low,bloated high], bloated low is the largest address that is a multipleof the desired granularity equal to or below low ; bloated high is the smallest mul-tiple of the desired granularity that is larger than high, minus one. In addition, theaddress range of “wb” nodes is likewise modified so that low and high are multi-ples of the writeback granularity. For example, if an address range is [10,59] andthe granularity is 64, then the bloated address range would be [0,63].1. If “uncached read”, generate an uncached read node with the specified ad-dress range.(a) If there was a previous CPUop, draw an arrow from it to the uncached -read node.(b) If there exist any previous dangling writebacks (wb nodes) whose ad-dress range overlaps with this uncached read node’s address range,then check if there is a path from each of those wb nodes to this un-cached read node, or a path from this uncached read node to those wbnodes. If there is no path either way — in other words, there is nopath both from a relevant wb node to this uncached read node and viceversa1 — flag that there is a race and stop the analysis2.1With the current ruleset, a path would not exist from the uncached read node to a previous wbnode. However, to the best of our knowledge, the vice versa check is a conservative approach tocover any possible cases (perhaps, if more rules are constructed in the future) where such a pathcould exist. The reasoning behind performing the vice versa check, though it is currently trivialbecause there can be no such path, applies to all other vice versa checks in Section 3.3.2In some processors, they may check uncached reads against dirty data in the cache. In suchprocessors, there is no race. Here, we choose to be agnostic and conservative, and report a race.19(c) Check a chain of past DMA writes (dma w nodes) back to the lastknown sync or where there was a dma r or dma w that happened-before. If any of these dma w nodes have an address range that overlapswith this uncached read node’s address range, then check if there is apath from that dma w node to this uncached read node, or vice versa.If there is no path either way, flag that there is a race and stop the anal-ysis. No check is necessary if the last DMA operation is a DMA read(dma r node) because read/read is not a race.2. If “uncached write”, generate an uncached write node with the specified ad-dress range.(a) If there was a previous CPUop, draw an arrow from it to the uncached -write node.(b) If there exist any previous dangling writebacks (wb nodes) whose ad-dress range overlaps with this uncached write node’s address range,then check if there is a path from each of those wb nodes to this un-cached write node, or vice versa. If there is no path either way — inother words, there is no path both from a relevant wb node to this un-cached write node and vice versa — flag that there is a race and stopthe analysis3.(c) Check a chain of past DMAop (dma r or dma w nodes) back to thelast known sync or where there was a dma r or dma w that happened-before. If any of these DMAop nodes have an address range that over-laps with this uncached write node’s address range, then check if thereis a path from that DMAop node to this uncached write node, or viceversa. If there is no path either way, flag that there is a race and stopthe analysis.3. If “cached read”, generate a cached read node with the specified addressrange.3Similar to uncached reads, we assume that the processor will bypass the cache. This is a safeassumption because simple processors do this. Again, we choose to be conservative and report arace.20(a) If there was a previous CPUop, draw an arrow from it to the cached -read node.(b) Generate a cr node with the same address range. Draw an arrow fromthe cached read node to this cr node. When the next CPUop is knownin the future, draw an arrow from the cr node to it.(c) If the read is aligned (the start address is a multiple of the cache linesize), generate an alloc node with a bloated address range, alloc1. Drawan arrow from this alloc node to the cr node from Step 3b.(d) If the read is unaligned and the specified address range fits in one cacheline, generate an alloc node with a bloated address range, alloc1. Drawan arrow from this alloc node to the cr node from Step 3b.(e) If the read is unaligned and the specified address range does not fit inone cache line, generate two alloc nodes, each with a bloated addressrange, alloc1 and alloc2. Draw an arrow from each alloc node to thecr node from Step 3b. Note: Only one of {3f}, {3g, 3h}, or {3i} setsof steps can happen at one time. For whichever set of steps occurs,perform the same set of steps for both alloc1 and alloc2.(f) If an alloc’s address range is the first cached operation after a flush withwhich its address range overlaps, draw an arrow from the cr node fromStep 3b to that alloc node. Note that cache flushes, like allocs, are alsocache line granularity.(g) If there is a previous dangling wb node whose address range overlapswith that of alloc, draw an arrow from that wb to this alloc.(h) If there is a previous dangling wb node (wb) whose address range over-laps with that of alloc, generate a wb node (wb’) with the same addressrange as that of the dangling wb node (wb). Draw an arrow from the crnode from Step 3b (i.e., the parent of the alloc node) to this generatedwb node (wb’). The writeback node needs to be propagated becausethis writeback could happen before or after the alloc node associatedwith this cached read (see also Theorem 3.4.6).(i) If there is a previous alloc node whose address range overlaps with thatof alloc, draw an arrow from that alloc to this alloc.21(j) Check a chain of past DMA writes (dma w nodes) back to the lastknown sync or where there was a dma r or dma w that happened-before. If any of these dma w nodes have an address range that overlapswith the address range of the alloc node(s) (alloc1 and, if it exists fromStep 3e, alloc2), then check if there is a path from that dma w node toalloc1 (and alloc2), or vice versa. If there is no path either way, flagthat there is a race and stop the analysis.4. If “cached write”, generate a cached write node with the specified addressrange.(a) If there was a previous CPUop, draw an arrow from it to the cached -write node.(b) Generate a cw node with the same address range. Draw an arrow fromthe cached write node to this cw node. When the next CPUop is knownin the future, draw an arrow from the cw node to it.(c) If the write is aligned (the start address is a multiple of the writebackgranularity), generate a wb node with a bloated address range, wb1.Draw an arrow from the cw node from Step 4b to this wb node.(d) If the write is unaligned and the specified address range fits in onewriteback (i.e., the size of the writeback granularity), generate a wbnode with a bloated address range, wb1. Draw an arrow from the cwnode from Step 4b to this wb node.(e) If the write is unaligned and the specified address range does not fit inone writeback granularity, generate two wb nodes, each with a bloatedaddress range, wb1 and wb2. Draw an arrow from the cw node to eachwb node.(f) If there is a previous dangling wb node whose address range overlapswith that of any of the wb node(s) generated in this Step 4 (wb1 andwb2), draw an arrow from that wb to this overlapping wb.(g) Check a chain of past DMAop (dma r or dma w nodes) back to thelast known sync or where there was a dma r or dma w that happened-before. If any of these DMAop nodes have an address range that over-22laps with the address range of the wb node(s) (wb1 and, if it exists fromStep 4e, wb2), then check if there is a path from that DMAop node towb1 (and wb2), or vice versa. If there is no path either way, flag thatthere is a race and stop the analysis.5. If “do dma read”, generate a do dma read node with the specified addressrange.(a) Generate a dma r node with the same address range. Draw an arrowfrom the do dma read node to this dma r node. If there was a previousDMAop, draw an arrow from it to the dma r node.(b) Check if there is a path from any previous dangling writebacks (wbnodes) whose address range overlaps with that of this dma r to thisdma r node, and vice versa. If there is no path either way, flag thatthere is a race and stop the analysis.6. If “do dma write”, generate a do dma write node with the specified addressrange.(a) Generate a dma w node with the same address range. Draw an arrowfrom the do dma write node to this dma w node. If there was a previ-ous DMAop, draw an arrow from it to the dma w node.(b) If there exist any previous dangling writebacks (wb nodes) whose ad-dress range overlaps with this dma w node’s address range, then checkif there is a path from each of those wb nodes to this dma w node, orvice versa. If there is no path either way, flag that there is a race andstop the analysis.(c) If there exist any previous floating allocs (alloc nodes) whose addressrange overlaps with this dma w node’s address range, then check ifthere is a path from each of those alloc nodes to this dma w node, orvice versa. If there is no path either way, flag that there is a race andstop the analysis.7. If “sync”, generate a sync node (with no address range).23(a) If there was a previous CPUop, draw an arrow from it to the sync node.(b) Draw an arrow from the last DMAop to this sync node. This ends aDMAop chain, and none of the DMAops in this chain are to be consid-ered any longer.8. If “cache flusha”, generate a flush node with a bloated address range.(a) If there was a previous CPUop, draw an arrow from it to the flush node.(b) Draw an arrow from any dangling wb nodes whose address range over-laps with that of flush.3.4 Theorems and ProofsThe following theorems justify the correctness of our approach. Each of the fol-lowing theorems tie into the rules described in Section 3.3; which rules a theoremsupports are made explicit under each theorem in “Related rule(s)”. Assume thatthe nodes discussed in each theorem access the same or overlapping memory range.Figure 3.4: Assume each operation in the graph accesses the same part ofmemory. Adding an arrow from a “wb” node to an “alloc” node doesnot add more order, and adds more possible races. There is a missingpropagated “wb” node dangling from “alloc1” not shown here, as thisis due to another theorem (Theorem 3.4.6). For more details, please seeFigure 3.6.Theorem 3.4.1. Adding an arrow from a “wb” node to an “alloc” node does notadd more order, and adds more possible races. The number of possible races ispositively monotonically increasing.24Related rule(s): 3gProof. Note that adding an arrow from a “wb” node to an “alloc” node signifiesthat the writeback may happen before the allocation. However, it may not havehappened and could have happened after the allocation (i.e., the cached read); thiscase is handled in Theorem 3.4.6. This means that there are strictly more writesand reads, which means there are strictly more races possible. For example, inFigure 3.4, the arrow from the “wb0” node to the “alloc1” node signifies that “wb0”may happen before but certainly not after“alloc1”.The problem that this theorem is addressing is how to make the write visible,which occurs when the value written is transferred from the CPU cache to sharedmemory. There are two situations that can transpire depending on when the write-back occurs. First, it could be that the writeback does not happen (until later); thus,the allocation also does not happen since the value can be read directly from thecache. The generated nodes in light grey are “maybes”; they may happen. In thiscase, to make the write visible, the “wb” node must be propagated after the cachedread and is left dangling (i.e., unordered). As an aside, this is a point to considerin Theorem 3.4.6’s proof. Second, it could that the writeback does occur at thatmoment in the program execution, resulting in an allocation for the succeedingcached read operation. Then, the write has already been made visible since it hasbeen written back, and there seems to be no need to propagate the writeback in thisscenario. The proof of Theorem 3.4.6 shows otherwise.Theorem 3.4.2. Adding an arrow from an “alloc” node to another “alloc” nodedoes not add more order, and adds more possible races. The number of possibleraces is positively monotonically increasing.Related rule(s): 3iProof. Again, note that adding an arrow from an “alloc” node to another “alloc”node signifies that two allocations could occur. However, in reality, only one wouldprobably occur. Nonetheless, our model takes the conservative approach and fol-lows the rule that every cached read node has an associated “alloc” node in order tomake handling cached reads consistent. Having an arrow from an “alloc” node toanother “alloc” node does not add new order because it follows cache order. Since25there are more “alloc” nodes (reads) than necessary, there can only be strictly moreraces possible.Figure 3.5: Handling cached memory operations appropriately after a CPUcache flush, for instance, by following the rule outlined in Theo-rem 3.4.3, allows for the race detection tool to avoid false warningsof a potential race possibly after every cache flush.Theorem 3.4.3. Assume there is no spontaneous prefetching done by the CPUcache. Adding an arrow from a “cr” node to its “alloc” node after a cache flush,if a cached read is the first cached memory operation after the flush, and anotherarrow from the first cached memory operation after the flush to the next “alloc”node, if the memory range it accesses overlaps with that of the first cached memoryoperation after the flush, removes erroneous races that would otherwise be detectedby the race detection tool after a CPU cache flush.26Related rule(s): 3fProof. Since the CPU cache cannot do anything after a cache flush until a cachedoperation occurs, drawing an arrow from the first cached memory operation aftera flush to that memory operation’s “alloc” node (if it is a cached read) and to thenext “alloc” node no matter what its address range follow cache order.Figure 3.5 shows an arrow from “cr2” to “alloc2” as “cached read2” is the firstcached memory operation after a flush (“cache flusha1”). Therefore, the “alloc2”node is not a floating node. Semantically, this means that the cache does not doany work after a flush before there is an instruction to do so.Furthermore, in Figure 3.5, the second cached read operation (“cached read6”)accesses a memory region that overlaps with the first cached read. Hence, therace detection tool draws an arrow from “alloc2” to “alloc6”. Without the Theo-rem 3.4.3 rule, the tool would erroneously flag a race at the “cached read6” opera-tion because “alloc6” would be unordered with respect to “dma w4” in Figure 3.5.Theorem 3.4.4. Loop invariant: We have enough information to detect all possibleraces as memory operations are added one at a time.Related rule(s): 1b, 1c, 2b, 2c, 3j, 4g, 5b, 6b, 6cProof. The data race detection tool keeps track of only the nodes in the activefrontier, which provides enough information detect all possible races. The activefrontier is a set of nodes that are unordered in the graph, which implies that they arethe only nodes that could be part of a data race. Removing nodes from the activefrontier cannot eliminate any races, because they are only removed if they becomeordered during the program analysis as the graph is built one memory operation ata time.Recall that a race is defined in terms of an execution. Our analysis does notcreate more edges than in a graph of all the memory operations in a given program,referred henceforth as the “full graph”. If there is a race in the full graph, as long asthe tool creates the two nodes that are involved in the race, then there is a race in thepartial graph. (A partial graph consists of only the nodes in the active frontier; it isa subgraph of the full graph.) This always happens since the tool builds the graph27one node at a time and does not add any extra edges compared to the full graph. Inparticular, when the tool creates the two racy nodes, it is when they matter, in otherwords, when they can execute under our abstraction model. This works because anode follows an instruction.The proof for the other direction — if the partial graph has a race, then the fullgraph has a race — is discussed in the proof for Theorem 3.4.5.Theorem 3.4.5. An active frontier is sufficient to be able to check for races andthus, we can ignore all previous memory operations.Proof. In order to prove Theorem 3.4.5, let’s consider the two following state-ments:• Any race that could happen in the full, unbounded graph can be detectedwith only the active frontier.• Conversely, any race in the active frontier is a race in the full, unboundedgraph.If nodes are deleted from the analysis graph, then both the mistake of missinga race in the full graph and the mistake of flagging a race that is not present in thefull graph must not happen. Theorem 3.4.4 has already shown that, if a race existsin the full execution (full graph) and since the tool is guaranteed to build nodeseventually and one at a time, creating fewer edges than in the full graph, then arace exists in the partial execution (partial graph).Consequently, the only case that would pose a problem in being able to detecta data race with the active frontier is if one node is deleted (i.e., removed from theactive frontier), before the other is built by the tool. This leads to a case analysis.Firstly, there are several scenarios where if the operation to be deleted first is awriteback (wb), as listed below:1. wb→ cache flush: wb must happen before the flush deletes wb. If the flushcomes before another conflicting node, then this is not a race. On the otherhand, if the flush comes after the conflicting node, the tool would detect arace because wb has not yet been deleted by the flush.Related rule(s): 8b282. wb0 → wb1: In this case, wb1 is deleted. A conflicting node cannot be analloc or a wb because arrows are drawn from wb0. The conflicting node canonly be a node that does not participate in cache behaviour. If the conflictingnode is before wb1, then the tool would have flagged the race already. If theconflicting node is after wb1, then there could be a race or no race with wb1.If there is a race involving wb1, then our approach would have preserved arace. If there is no race involving wb1, then there is also no race with wb0because the racy node is guaranteed to be either before or after wb1; thus, arace did not escape detection.The only way that the tool would miss catching a race in this case is if theconflicting node is after the second cached write (associated with wb1) butis also somehow before wb1; however, this scenario is impossible with ourrules, which are fully defined in Section 3.3.Related rule(s): 4f3. wb→ alloc: wb is propagated further in this case (see Theorem 3.4.6). Theracy node that conflicts with wb would not be a cached operation becausecached operations are always in order. Recall that alloc is generated with acached read. Then, if wb happens before the cached read in program order,the tool would catch the race before deleting wb. If wb happens after thecached read and the alloc conflicts with the dangling wb, then a race wouldhave been preserved. Otherwise, if the alloc does not conflict with the prop-agated, dangling wb, then it is also not a race with the deleted wb. It isimpossible for the conflicting node to occur between wb and alloc due to ourrules; there is no rule that would lead to this situation.Related rule(s): 3gTheorem 3.4.6. Propagate a new “wb” node after a “cr” node (as its child) if thereis a relevant dangling writeback before that “cr”, where “relevant” means that the“wb” node and the “alloc” node associated with “cr” access overlapping memoryaddresses. The “wb” and “alloc” nodes are used to check for relevancy because29Figure 3.6: Assume that each node accesses overlapping memory addresses.wb1 is propagated as a child of cr1 since there is dangling write-back (wb0) that accesses the same or overlapping memory region ascached read1. The tool would miss a race if this writeback was notpropagated.they both use a bloated address range, which help to detect issues similar to falsesharing (see Section 5.3).Related rule(s): 3hProof. With this rule stated in Theorem 3.4.6, there are two writeback nodes thatsignify the same writeback operation in the graph; there are two of them due topropagation of the writeback. This does not pose any problems because still, onlyone of them happens; the tool does not know when it happens — whether it happensbefore or after the alloc associated with the cached read node. Bear in mind that ifthe writeback does not happen, then the allocation also does not happen.Figure 3.6 emphasizes the need for this rule. When the tool arrives at the un-cached read node (uncached read2) during analysis, the tool should flag a potentialrace with the writeback from cached write0 because it is unknown whether thewriteback has already happened or not at the time of this uncached read. There-fore, the tool must propagate the dangling writeback by creating a new “wb” nodeafter the cr1 node. Now, with that propagated node, the tool is able to flag a poten-tial race between the uncached read (the uncached read2 node) and the writeback(the wb1 node).Observe that Theorem 3.4.1, Theorem 3.4.2, Theorem 3.4.3, and Theorem 3.4.630conservatively capture any possible race, therefore, making our approach sound.31Chapter 4ImplementationDynamic race detection aims to find data races in a program while it is executing. Itanalyzes one execution of the program at a time. Dynamic race detection can detectdata races more easily than if done statically. Static race detection runs before theprogram is executed. It is not as scalable as the dynamic approach, as it mustconsider all the possible paths in the control flow of the program. Additionally, itis difficult for static race detectors to perform alias analysis, let alone avoiding falsepositives in results. On the other hand, dynamic race detectors know exactly thememory locations of dynamically allocated variables and when they are accessed.Hence, the current implementation performs dynamic race detection on a givenprogram in C; the workflow is shown in Figure 4.1. The program must be instru-mented to keep track of every memory operation, compiled, and then executed.The instrumentation causes the program to print the specific memory operationsthat were done in that execution into a trace file. Then, our tool reads through thetrace file to detect whether there are any races or not in the recorded execution ofthe program.Firstly, Section 4.1 introduces the VectorBlox architecture which the imple-mentation is tailored for. Next, Section 4.2 describes what happens in the transi-tion from a program to its trace, i.e., the first arrow of the workflow in Figure 4.1.Section 4.3 details how the following step — the trace analysis — is implementedin the race detection tool.32Figure 4.1: The workflow starts from a given input program that is instru-mented to print out every memory operation and then executed so thatthe analysis is done on a single execution. The outputted memory op-erations are aggregated if necessary into a final trace file that is read byour dynamic race detection tool, which will inform the user of potentialraces in that one execution.4.1 VectorBlox ArchitectureAll devices see a single global address space. The processor possesses an instruc-tion cache and a data cache, the latter of which we will denote as cpu-cache. Weconsider only the data cache because only the values in the data cache that areflushed to main memory can conflict with values originating from the accelerator.Therefore, we are only concerned with addresses used to access external DRAM.We assume that there exist only physical addresses (no virtual addresses). Also,VectorBlox has no prefetching mechanism for the CPU cache. Nonetheless, it issafe to model a cache allocation that is triggered by a cached read the same asa prefetched cache allocation; there is no distinction between how we treat themregardless of whether or not it is a prefetch.The CPU component typically executes the following scalar instructions: load(ld), store (st), and flush (flush). For each of these instructions, the CPU op-erates on a given address, e.g., ld addr. At VectorBlox, the processor issues ldand st in-order and synchronously before the next instruction. A flush addrinstruction probes the CPU cache for a specific address and evicts/writebacks if itis present and dirty. This flush may touch other addresses as well if other addressesare in the same cache line as the address that is being flushed.DMA instructions consist of read (rd addr-range), write (wr addr-range), and sync (sync), where addr-range is a (start-addr, length) pair. TheDMA rd and wr instructions are issued and executed in-order, one at a time until33the range is exhausted, but are also executed asynchronously (i.e., they are con-current with future instructions). As for the sync operation, it is synchronousblocking and only returns after all previously issued DMA operations have com-pleted.4.2 Instrumentation and ExecutionDynamic race detection needs a trace of all operations that could be involved in arace and that were run during an execution of a program. In this case, the poten-tially racy operations are all memory operations.For the workflow demonstrated in this thesis, a user manually modifies thecode so that, for every memory operation, there is a print statement that prints outthe type of memory operation and the address range that was operated on. Thisinstrumentation could be automated instead but this is orthogonal to the thesis.Figure 4.2 illustrates an instrumented program (the same program from Figure 3.1).Compiling and executing this instrumented program produces a trace file, whichdisplays the memory address range of memory operations.For efficiency purposes, some processing of the raw trace file consolidates con-secutive memory accesses into one line into a new version of the trace. For exam-ple, on line 10 in Figure 4.2, the program populates the entire a array throughcached writes. However, the printf statement prints out the address of each el-ement one at a time, resulting in an unnecessarily verbose trace, especially if thesize of the array (N) is a large number:cached wr i te 0x7ffd97898fd0−0x7f fd97898fd0cached wr i te 0x7ffd97898fd1−0x7f fd97898fd1. . .cached wr i te 0x7ffd97898fd9−0x7f fd97898fd9I wrote a Python program that condenses the trace file, if possible. For example,the above lines can be combined into one line since they are the same memoryoperation and the memory addresses are consecutive. This one line becomes:cached wr i te cached wr i te 0x7ffd97898fd0−0x7f fd97898fd9Figure 4.3 shows the final, condensed trace file for our running example.This covers the first part of the workflow. The next section describes how weuse the outputted trace file to detect potential races.34Figure 4.2: A simple program (similar to Figure 3.1 but with a larger N) thathas been instrumented to print out every memory operation, includingthe type of operation and the memory address range being manipulated.vbx dma to vector function has the destination as the first parame-ter and the source as the second parameter.Figure 4.3: Consolidated trace file generated from the program in Figure 4.2,where the hexadecimals are the memory address range of the variablethat is being accessed.354.3 Trace AnalysisThe race detection tool, written in C++, implements the rules delineated in Sec-tion 3.3, building each node on the fly. It parses a trace file line by line in order tobuild these nodes and uses map data structures to record nodes in the active fron-tier1. When nodes in the active frontier become ordered in the graph, they can besafely removed from the active frontier because either they cannot be racy or thesame race would affect a newer node in the active frontier. Therefore, the graphthat the tool builds can be pruned; in other words, the tool only tracks the nodesin the active frontier, ignoring all other operations. This reduces the amount ofmemory used for the analysis step.1Recall from Section 3.4 that the active frontier consists of the nodes that could be involved ina race in the future as of the point where the tool has come so far in analyzing the program. Forexample, the newest write to a memory location is obviously part of the active frontier, whereaswritebacks that already have been flushed are not. Dangling writebacks and floating allocs are alsopart of the active frontier, where a dangling writeback is a writeback node that the tool created atthe earliest point in time that the writeback could occur but it has not occurred yet for certain (thewriteback will take place at some later point in time), and similarly, a floating alloc node is createdretroactively when the node that requires the allocated cache line is created.36Chapter 5EvaluationEvaluation involved analyzing real-world examples for races using the dynamicrace detection tool implemented for this thesis, which applies the rules defined inSection 3.3, and validating the effectiveness of the tool.5.1 Experimental SetupTo analyze real-world examples, I ran my dynamic race detection tool on elevenVectorBlox open-source examples1 provided on GitHub. Select examples in theGitHub repository were chosen to be used in the experiments if they were non-trivial and were written in C. All the examples were executed using a VectorBloxsimulator, assuming a single core with its own cache and one vector accelerator.The examples included basic math computations (such as vector addition, etc.),signal processing algorithms, and image processing algorithms (Table 5.1).5.2 Time Required for AnalysisThe shortest runtime took 30 milliseconds (ms) for 6,000 lines in the trace file(vbw libfixmath example), and the longest runtime took approximately 815 millionms, or around 9 days, for almost 29 million lines in the trace (vbw mtx sobelexample). The analysis for ten out of the eleven examples completed in less than1Retrieved on November 21, 2019. Hash version c7a4894206.https://github.com/VectorBlox/mxp/tree/master/examples/software/bmark37Table 5.1: Benchmark examplesTest Name Test Descriptionvbw libfixmath Square root and divisionvbw mtx fir t 2D Finite Impulse Response (FIR) filter using matricesvbw mtx median argb32 Median filter (using bubble sort) with 32-bit data typevbw mtx median t Median filter (using bubble sort) with 8-bit data typevbw mtx motest* Motion estimationvbw mtx sobel Sobel filter (e.g., used in edge detection algorithms)vbw mtx xp t Matrix transposevbw vec add t* Vector additionvbw vec fft Fast Fourier Transform (FFT)vbw vec fir t FIR filter using vectorsvbw vec power t Vector power*This example contained a race.45 minutes each. Figure 5.1 provides a visual outlook of the analysis runtime forthe examples that contained no races, as the tool stops at the first race detected.38Figure 5.1: This semi-log graph shows the log of the runtime in milliseconds(ms) of each example that completed, i.e., had no races, with respect tothe number of lines in the respective trace.5.3 Experimental ResultsThe tool found subtle races in two out of the eleven examples evaluated. One exam-ple is the vector addition example called “vbw vec add t” in the GitHub repository,which is a test to check that the vector accelerator works correctly. Excerpts of thiscode from the GitHub repository will be shown below as an example. Firstly, thetest allocates various variables:/ / Assume N i s the number o f elements i n an ar ray .vbx mm t ∗ sca l a r i n 1 = mal loc ( N∗sizeof ( vbx mm t ) ) ;vbx mm t ∗ sca l a r i n 2 = mal loc ( N∗sizeof ( vbx mm t ) ) ;vbx mm t ∗ sca l a r ou t = mal loc ( N∗sizeof ( vbx mm t ) ) ;scalar x variables are named so because they will be manipulated by a scalarprocessor. They are allocated using the standard library’s malloc, which meansthat they are CPU cached variables; hence, any access to them are cached readsor writes. Next, vector x variables are allocated using the VectorBlox Appli-cation Programming Interface (API), vbx shared malloc, whose semantics39mean that they will be treated as uncached variables; they reside in shared mem-ory.vbx mm t ∗ vec t o r i n1 = vbx shared mal loc ( N∗sizeof ( vbx mm t ) ) ;vbx mm t ∗ vec t o r i n2 = vbx shared mal loc ( N∗sizeof ( vbx mm t ) ) ;vbx mm t ∗ vec to r ou t = vbx shared mal loc ( N∗sizeof ( vbx mm t ) ) ;vector x variables will be mainly manipulated by a vector accelerator. Finally,v x variables are allocated in the scratchpad, the accelerator’s internal memory, butthey are not considered by the tool because they are internal only to the accelerator.vbx sp t ∗v in1 = vbx sp mal loc ( N∗sizeof ( vbx sp t ) ) ;vbx sp t ∗v in2 = vbx sp mal loc ( N∗sizeof ( vbx sp t ) ) ;vbx sp t ∗v ou t = vbx sp mal loc ( N∗sizeof ( vbx sp t ) ) ;The vector addition test consists of two tests, performing the same vector addi-tion computation twice, first by a scalar processor and then by a vector accelerator.The first test, which is done by the processor, has two input arrays, scalar in1and scalar in2, whose values are added together and put into an output array,scalar out. After this test is completed, the next test executes, this time carriedout by the vector accelerator and where vector in1 and vector in2 are twoinput arrays and vector out is the output array that contains the results.The vector addition test starts by zeroing out the output arrays, scalar outand vector out.VBX T( t e s t z e r o a r r a y ) ( sca la r ou t , N ) ;VBX T( t e s t z e r o a r r a y ) ( vec to r ou t , N ) ;Then, it initializes the scalar in1 array. Afterwards, the values in the cachedvariable, scalar in1, are copied into the uncached variable, vector in1, ina loop.VBX T( t e s t i n i t a r r a y ) ( sca la r i n1 , N, 1 ) ;VBX T( t es t copy a r r ay ) ( vec to r i n1 , sca la r i n1 , N ) ;Similarly, the test initializes scalar in2 before copying its values into vector -in2.VBX T( t e s t i n i t a r r a y ) ( sca la r i n2 , N, 1 ) ;VBX T( t es t copy a r r ay ) ( vec to r i n2 , sca la r i n2 , N ) ;40The program calls the helper function test scalar to perform the test bythe CPU and then prints out the output array.sca l a r t ime = t e s t s c a l a r ( sca la r ou t , sca la r i n1 , sca la r i n2 , N ) ;VBX T( t e s t p r i n t a r r a y ) ( sca la r ou t , PRINT LENGTH ) ;To check that the vector accelerator can execute instructions correctly, the ac-celerator first needs to obtain the input values via DMA. The CPU tells the acceler-ator to do a DMA read through the VectorBlox API call, vbx dma to vector.The code below shows DMA reads of vector in1 and vector in2 in mainmemory.vbx dma to vector ( v in1 , ( void ∗) vec to r i n1 , N∗sizeof ( vbx sp t ) ) ;vbx dma to vector ( v in2 , ( void ∗) vec to r i n2 , N∗sizeof ( vbx sp t ) ) ;The accelerator executes the vector addition in its scratchpad in the call, test -vector, and then DMA writes the results into the uncached variable, vector -out, with a call to vbx dma to host.t e s t v e c t o r ( v out , v in1 , v in2 , N, sca l a r t ime ) ;vbx dma to host ( ( void ∗) vec to r ou t , v out , N∗sizeof ( vbx sp t ) ) ;On the next line, the VectorBlox synchronization function, vbx sync, instructsthe CPU to wait until all previously issued DMA requests have completed. Due tothis sync, there is no race at the next CPU operation, which is to print the uncachedarray, vector out.vbx sync ( ) ;VBX T( t e s t p r i n t a r r a y ) ( vec to r ou t , PRINT LENGTH ) ;The last line in the vector addition test verifies that the results computed by thescalar processor and the vector accelerator were equal, thereby confirming that theaccelerator works.e r ro r s += VBX T( t e s t v e r i f y a r r a y ) ( sca la r ou t , vec to r ou t , N ) ;The tool detects a potential race at the first DMA read instruction in the “vbw -vec add t” example. Specifically, the race occurs between a cached write anda DMA read. This is made apparent in the graph generated by the tool (Fig-ure 5.2), where the race is between the first writeback node (labelled “wb0”) and41Figure 5.2: This is part of the graph that is generated by our tool as it is ana-lyzing the vector addition example. The large black oval in the middle ofthe graph signifies that there are several other nodes that are in betweenthe nodes shown but are not important in illustrating the race present.the first DMA read (“dma r14”). In the graph, there is a dangling writeback node(wb0 0x11ff080 0x120f07f) that is unordered with respect to the first DMA readnode (dma r14 0x120f070 0x121f06f). Their memory address ranges — in par-ticular, from 0x120f070 to 0x120f06f — overlap with each other. As only oneof these accesses is a write (wb0), this is a write-read case. Hence, this situationfulfills the criteria of a data race stated in Definition 2.1.1.In the code, the dangling writeback node and the DMA read node correspondto the following two lines:VBX T( t e s t z e r o a r r a y ) ( sca la r ou t , N ) ;. . .vbx dma to vector ( v in1 , ( void ∗) vec to r i n1 , N∗sizeof ( vbx sp t ) ) ;The conflicting variables are scalar out, a cached variable that eventuallyhas its values written back when the test zeroes this array, and vector in1, anuncached variable in main memory that is accessed by the accelerator. Although42they are two different data structures, our dynamic tool saw that they are mappedto the same cache line due to them being allocated to consecutive memory loca-tions. Therefore, due to writeback granularity, the scalar out writebacks canoverwrite the first few bytes of vector in1 in main memory. The DMA read ofvector in1 may thus read wrong data values. This is similar to false sharing(see Theorem 3.4.6); however, this is a coherence error rather than a performanceissue.The “vbw mtx motest” example contained the second race. The tool flagged apotential race between a dangling writeback node from a cached write to scalar -x input and a DMA write to the uncached variable, vector result. Specif-ically, the race is between wb160616 0x1a29080 0x1a290bf and dma w160722 -0x1a25070 0x1a3506c. The guilty lines in the program are below; the vbw mtx -motest byte function does a DMA write to vector result at the end.i n i t mo t e s t ( s ca l a r x i npu t , s c a l a r r e s u l t ) ;. . .e r r o r r c = vbw mtx motest byte ( v e c t o r r e su l t , vec t o r x i npu t , vec t o r x i npu t , &m ) ;Similar to the first race condition, vector result and scalar x inputare two different data structures but overlap the same cache line because scalar -x input is the next variable allocated after vector result. Again, this con-dition satisfies the data race definition, as this is a write-write case that accesses thememory address range, from 0x1a29080 to 0x1a290bf, simultaneously.5.4 Injection of Bugs in ExamplesTo verify that the implemented tool was able to catch more types of races thanwhat were discovered in real-world examples, I also performed experiments inwhich I introduced bugs in the VectorBlox code examples by removing necessarysynchronization. Not surprisingly, my race detection tool was able to detect thesebugs easily in all cases.Interestingly, in at least one case, the VectorBlox simulator still showed thetest as passing, even though I had eliminated necessary synchronization. Note thatthere exists other sync calls in the program but the particular sync that was re-moved turned out to be essential to avoiding potential future problems. This high-43lights how data races can produce elusive bugs, which are not caught in debugging,but emerge only late (and sporadically) in a production system — and hence theimportance of data race detection tools such as I propose.To demonstrate the impact of these bugs, I will now go through one examplein detail, specifically, the VectorBlox example that calculated vector power. Fig-ure 5.3 shows the operation immediately before the sync (do dma write) in theprogram and immediately after the sync (uncached read). There is no race hereas the dma w node is ordered with respect to the sync node, and the sync node isordered with respect to the uncached read node as well. Since orders are tran-sitive, the dma w node is guaranteed to happen before the uncached read node.On the other hand, in Figure 5.4, where the sync has been removed, the dma w nodeis now unordered with respect to the uncached read node because there is nohappens-before relation (or path from) the dma w node to the uncached readnode. Because the DMA write and the uncached read access overlapping memoryregions, there is a potential race.44Figure 5.3: The vbw vec power t VectorBlox example’s original program or-der which includes synchronization.Figure 5.4: Removed a sync operation in the vbw vec power t VectorBloxexample, thereby introducing a bug in the program. On the VectorBloxsimulator, the race does not happen, so the code behaves correctly, andthe test “passes”. However, there is no guarantee that any given imple-mentation will happen to order the two operations in the same way thatthe simulator did. Most importantly, the race detection tool is able todetect this data race.45Chapter 6Related WorkAs we use Lamport’s happens-before relation, an obvious related work is the Lam-port timestamps algorithm [9] and the more advanced vector clock method [5],which builds on Lamport’s work. These two algorithms are used to determine or-dering of events in distributed systems.Lamport’s logical clocks are functions that assign a number to an event in aprocess and do not necessarily have a relation to physical time. Essentially, eachprocess has a counter and increments it before any event occurs in that process.This counter value is included in any message sent by the process. A receivingprocess updates the counter to be the maximum of its current counter and the times-tamp received in the message, and then increments that number by one to considerthe message received. Furthermore, Lamport’s timestamps satisfies the condition:if event a→ event b, then C(a) < C(b), where C(x) is the logical clock of anevent x. However, the converse is not true. The vector clock method addresses thislimitation and satisfies that condition both ways.Our solution differs from the Lamport timestamp and vector clock algorithms,as we rely on program order to create partial ordering in the graph produced.Causality between events is clear in a program because the CPU is the “master”of the system, telling itself, its cache, and the accelerator what to do.Multithreaded and concurrent programs have generated much research in bothstatic and dynamic race detection tools. For example, RELAY [18], Warlock [16],RacerX [4], and Locksmith [12] statically check for races by performing lockset46analysis. rccjava [6],[2] (which uses extended static checking [3]) is another statictool but uses theorem provers and type checkers. Although static detection toolsoffer scalable solutions, they generally work with abstracted versions of a programand thus, can produce a large number of false alarms.Eraser [13] and FastTrack [7] are examples of dynamic race detectors. Erasertracks the set of locks held by each shared variable in a thread but is prone to falsealarms. FastTrack exploits lightweight vector clocks and performs comparably toEraser but is more precise and never reports false alarms. FastTrack claims that themajority of data in multithreaded programs is either thread local, lock protected,or read shared. Thus, it uses an adaptive representation for the happens-beforerelation that requires only constant space for these common cases, without anyloss of precision or correctness. In contrast to a vector clock-based race detectorthat records the clock of the most recent write to each variable x by each thread t,FastTrack records the clock and thread identifier of only the very last write to x,where this information is dubbed an epoch. The authors of FastTrack assert that allwrites to x are totally ordered by the happens-before relation, assuming no raceshave been detected so far. Hence, the full generality of vector clocks is not neededin this case. Similarly, since reads on thread-local and lock-protected data aretotally ordered — assuming no races have been detected — FastTrack records onlythe epoch of the last read to these types of data. FastTrack adaptively switches fromepochs to vector clocks (and vice-versa) in order to guarantee no loss of precision.To the best of our knowledge, there is no race detection approach in a non-coherent and heterogeneous context, for which our work is targeted.47Chapter 7ConclusionsIn conclusion, this thesis presents a novel approach to detect data races in softwarefor non-coherent accelerators in heterogenous systems, and proves its correctnessand soundness. The approach generalizes to other hardware architectures and codeexamples, not only by VectorBlox, because it is mostly architecture-independent.It is mostly but not completely architecture-independent because the model stillincludes a CPU cache and needs to know the size of the cache line and that ofthe writeback granularity. Nonetheless, the model does not require any details ofhow the cache works, which protocol it uses, etc., as our solution abstracts thehardware’s behaviour. The approach is also generalizable in the sense that thederived simple rules work for any non-coherent accelerator.Furthermore, I have built a dynamic race detection tool that successfully foundtwo subtle races in real-world examples. Limitations of the current tool are due tothe fact that it performs dynamic race detection (as opposed to static), as well asthe fact that it works with an abstraction of the hardware. Dynamic race detectionis less sound than static race detection, which means that it can miss races (falsenegatives), because it does not execute all possible paths in a program. However,our analysis is sound with respect to the given exeuction, in that it is able to find allpotential races in that execution. On the other hand, the soundness comes with aloss of precision. We are conservatively abstracting away some details of the hard-ware’s behaviour, which can lead to flagging false alarms (false positives), sincethe abstraction is an over-approximation and builds upon several assumptions. For48instance, one assumption is that a writeback can happen arbitrarily late but thismay not be true due to hardware details.Future work include implementing a static version of the race detection tooland automatically instrumenting an input program. For example, the static racedetection tool can use the LLVM framework [10] to help with the instrumentationand analysis of a program. It would also be interesting to examine ThreadSanitizer[14], a data race detector for C/C++, to see if our solution can work alongside orwithin it, or to compare against it.49Bibliography[1] VectorBlox MXP Programming Guide for Xilinx. URLhttp://vectorblox.github.io/mxp/mxp guide xilinx.html. Accessed 2019-09-27.→ page 6[2] M. Abadi, C. Flanagan, and S. N. Freund. Types for safe locking: Static racedetection for Java. ACM Transactions on Programming Languages andSystems (TOPLAS), 28(2):207–255, 2006. → page 47[3] D. L. Detlefs, K. R. M. Leino, G. Nelson, and J. B. Saxe. Extended staticchecking. Research Report 159, Compaq SRC, 1998. → page 47[4] D. Engler and K. Ashcraft. RacerX: effective, static detection of raceconditions and deadlocks. In ACM SIGOPS Operating Systems Review,volume 37, pages 237–252. ACM, 2003. → page 46[5] C. Fidge. Logical time in distributed computing systems. Computer, 24(8):28–33, 1991. → page 46[6] C. Flanagan and S. N. Freund. Type-based race detection for Java. In ACMSIGPLAN Notices, volume 35, pages 219–232. ACM, 2000. → page 47[7] C. Flanagan and S. N. Freund. FastTrack: efficient and precise dynamic racedetection. In ACM SIGPLAN Notices, volume 44, pages 121–133. ACM,2009. → page 47[8] D. Giri, P. Mantovani, and L. P. Carloni. Accelerators and coherence: AnSoC perspective. IEEE Micro, 38(6):36–45, 2018. → page 7[9] L. Lamport. Time, clocks, and the ordering of events in a distributed system.Communications of the ACM, 21(7):558–565, 1978. → pages 4, 46[10] C. Lattner and V. Adve. LLVM: A compilation framework for lifelongprogram analysis & transformation. In International Symposium on CodeGeneration and Optimization (CGO), pages 75–86. IEEE, 2004. → page 4950[11] K. Li and P. Hudak. Memory coherence in shared virtual memory systems.ACM Transactions on Computer Systems (TOCS), 7(4):321–359, 1989. →page 6[12] P. Pratikakis, J. S. Foster, and M. Hicks. LOCKSMITH: Practical static racedetection for C. ACM Transactions on Programming Languages andSystems (TOPLAS), 33(1):3, 2011. → page 46[13] S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser:A dynamic data race detector for multithreaded programs. ACMTransactions on Computer Systems (TOCS), 15(4):391–411, 1997. → page47[14] K. Serebryany and T. Iskhodzhanov. ThreadSanitizer: data race detection inpractice. In Workshop on Binary Instrumentation and Applications (WBIA),pages 62–71. ACM, 2009. → page 49[15] A. Severance and G. G. F. Lemieux. Embedded supercomputing in FPGAswith the VectorBlox MXP matrix processor. In 2013 InternationalConference on Hardware/Software Codesign and System Synthesis(CODES+ISSS), pages 1–10. IEEE, 2013. → page 6[16] N. Sterling. WARLOCK–a static data race analysis tool. In USENIX Winter,pages 97–106, 1993. → page 46[17] A. L. Varbanescu and J. Shen. Heterogeneous computing with accelerators:an overview with examples. In 2016 Forum on Specification and DesignLanguages (FDL), pages 1–8. IEEE, 2016. → page 1[18] J. W. Voung, R. Jhala, and S. Lerner. RELAY: static race detection onmillions of lines of code. In ESEC/FSE, pages 205–214. ACM, 2007. →page 4651