You may notice some images loading slow across the Open Collections website. Thank you for your patience as we rebuild the cache to make images load faster.

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Optimizing network-on-chips for FPGAs Kwa, Jimmy Williamchingyuan 2013

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


24-ubc_2013_spring_kwa_jimmy.pdf [ 532.71kB ]
JSON: 24-1.0073805.json
JSON-LD: 24-1.0073805-ld.json
RDF/XML (Pretty): 24-1.0073805-rdf.xml
RDF/JSON: 24-1.0073805-rdf.json
Turtle: 24-1.0073805-turtle.txt
N-Triples: 24-1.0073805-rdf-ntriples.txt
Original Record: 24-1.0073805-source.json
Full Text

Full Text

Optimizing Network-on-Chips for FPGAs by Jimmy Williamchingyuan Kwa  B.A.Sc., The University of British Columbia, 2009  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE in The Faculty of Graduate Studies (Electrical and Computer Engineering)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) April 2013 c Jimmy Williamchingyuan Kwa 2013  Abstract As larger System-on-Chip (SoC) designs are attempted on Field Programmable Gate Arrays (FPGAs), the need for a low cost and high performance Network-on-Chip (NoC) grows. Virtual Channel (VC) routers provide desirable traits for an NoC such as higher throughput and deadlock prevention but at significant resource cost when implemented on an FPGA. This thesis presents an FPGA specific optimization to reduce resource utilization. We propose sharing Block RAMs between multiple router ports to store the high logic resource consuming VC buffers and present the Block RAM Split (BRS) router architecture that implements the proposed optimization. We evaluate the performance of the modifications using synthetic traffic patterns on mesh and torus networks and synthesize the NoCs to determine overall resource usage and maximum clock frequency. We find that the additional logic to support sharing Block RAMs has little impact on Adaptive Logic Module (ALM) usage in designs that currently use Block RAMs while at the same time decreasing Block RAM usage by as much as 40%. In comparison to CONNECT, a router design that does not use Block RAMs, a 71% reduction in ALM usage is shown to be possible. This resource reduction comes at the cost of a 15% reduction in the saturation throughput for uniform random traffic and a 50% decrease in the worst case neighbour traffic pattern on a mesh network. The throughput penalty from the neighbour traffic pattern can be reduced to 3% if a torus network is used. In all cases, there is little change in network latency at low load. BRS is capable of running at 161.71 MHz which is a decrease of only 4% from the base VC router design. Determining the optimum NoC topology is a challenging task. This thesis also proposes initial work towards the creation of an analytical model to assist with finding the best topology to use in an FPGA NoC.  ii  Preface Parts of Chapters 1, 2, 3, 5, 6, 7, and 8 have been published [19]. Jimmy Kwa, Tor M. Aamodt. Small Virtual Channel Routers on FPGAs Through Block RAM Sharing. In proceedings of the IEEE International Conference on Field Programmable Technology (FPT), pages 71-79, December 2012. Dr. Tor Aamodt assisted with the design of the research program. I created the analytical model and modified an existing router design [4] to implement the proposed router architecture. I also analyzed the experimental results. The manuscript was prepared with the assistance of Dr. Tor Aamodt.  iii  Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ii  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iii  Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iv  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  vi  Abstract Preface  List of Tables  List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Glossary  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . .  x  1 Introduction . . . . . . 1.1 Motivation . . . . . 1.2 Contributions . . . 1.3 Thesis Organization  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  1 2 4 5  2 Background . . . . . . . . . . . . . . . . . 2.1 FPGA Architecture . . . . . . . . . . . 2.2 Network-on-chips . . . . . . . . . . . . 2.2.1 Performance . . . . . . . . . . . 2.2.2 Topology . . . . . . . . . . . . . 2.2.3 Flow Control . . . . . . . . . . . 2.2.4 Routing Algorithm . . . . . . . 2.2.5 Traffic Patterns . . . . . . . . . 2.3 Virtual Channel Router Architecture . 2.3.1 Virtual Channel Router Pipeline  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  7 7 10 12 13 15 19 21 22 25  3 Small Virtual Channel Routers on FPGA Through Block RAM Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28  iv  Table of Contents 4 FPGA NoC Topology Analytical Model  . . . . . . . . . . .  35  5 Experimental Methodology . 5.1 Router RTL Modifications . 5.2 Testing Procedure . . . . . . 5.3 Test Modules . . . . . . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  40 40 42 43  6 Experimental Results . 6.1 Resource Utilization . 6.2 Timing . . . . . . . . 6.3 Performance Analysis  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  45 45 48 49  . . . . . . . . . . . . . . . . . . . . . . . . . . . .  55  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  58  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  59  7 Related Work 8 Conclusion  . . . .  . . . .  . . . .  . . . .  v  List of Tables 2.1  Synthetic Traffic Patterns. . . . . . . . . . . . . . . . . . . . .  22  4.1 4.2  NoC Topology Model Variables. . . . . . . . . . . . . . . . . . Bisection bandwidth and average hop count. . . . . . . . . . .  36 38  6.1 6.2 6.3 6.4 6.5  4x4 Mesh Resource Utilization. . . . . 4x4 Torus Resource Utilization. . . . . BRS vs CONNECT. A 4x4 Mesh ALM VC buffers on CONNECT. . . . . . . Maximum frequency for 4x4 networks.  46 47 47 48 49  . . . . . . . . . . . . . . . . . . . . . . . . . . Utilization comparison. . . . . . . . . . . . . . . . . . . . . . . . . . .  vi  List of Figures 1.1  BRS vs CONNECT Contour plot. . . . . . . . . . . . . . . .  4  2.1 2.2  8  2.3 2.4 2.5 2.6 2.7 2.8 2.9  Configurable 3 input LUT. [1] . . . . . . . . . . . . . . . . . . Configurable 3 input LUT implementing a 3 input AND gate. [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FPGA tile. [1] . . . . . . . . . . . . . . . . . . . . . . . . . . Message composition. . . . . . . . . . . . . . . . . . . . . . . NoC topology examples. . . . . . . . . . . . . . . . . . . . . . Head-of-Line blocking example. . . . . . . . . . . . . . . . . . XY dimension ordered routing example . . . . . . . . . . . . Traditional VC router architecture. . . . . . . . . . . . . . . . Three stage VC router pipeline. . . . . . . . . . . . . . . . . .  8 9 11 14 18 21 23 25  3.1 3.2  BRS architecture. . . . . . . . . . . . . . . . . . . . . . . . . . BRS Stalling conditions. . . . . . . . . . . . . . . . . . . . . .  29 32  6.1 6.2 6.3 6.4  ALM Usage vs Port Count. . . . . . . . . . . . . ALM usage of BRS with varying VC buffer sizes. Network Latency vs Offered Traffic. . . . . . . . Accepted Traffic vs Offered Traffic. . . . . . . . .  46 48 50 53  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  vii  Glossary ALM Adaptive Logic Module ASIC Application Specific Integrated Circuit BRS Block RAM Split Router CLB Configurable Logic Block downstream router any router that directly receives data from the router being examined flit packets are divided into small chunks called flits FPGA Field Programmable Gate Array HDL Hardware Description Language head flit first flit in a packet LAB Logic Array Block latency the time it takes for a packet to go from source to destination LUT Lookup Table message the data to be sent from source to destination NoC Network-on-Chip offered traffic number of flits per cycle per node that are being injected into the network OSNoC Open Source Network-on-Chip Router packet messages are divided into large chunks called packets RTL Register Transfer Level SoC System-on-Chip viii  Glossary tail flit last flit in a packet throughput number of flits per cycle per node that are ejected from the network upstream router any router that directly sends data to the router being examined VC Virtual Channel  ix  Acknowledgements I would like to thank my supervisor, Professor Tor Aamodt, for all the help he has given me over the years. Without him, this work would not have been possible. I would also like to thank the other members of the computer architecture group for their support and for helping me out when I get stuck. Finally, I’d like to thank my family for the innumerable things they have helped me out with throughout my life.  x  Chapter 1  Introduction As computing technology advances so does the field of reconfigurable computing. The concept of reconfigurable computing hardware has been around since the 1960s. Estrin proposed a reconfigurable architecture with a fixed general-purpose computer attached to a variable structure composed of reconfigurable building blocks [14, 15]. The variable structure could be configured to perform a task as quickly as dedicated hardware. Afterwards, the exact same hardware could be reconfigured to perform a different task. Decades later Xilinx would invent the reconfigurable hardware device known as a Field Programmable Gate Array (FPGA) [25]. An FPGA is a configurable semiconductor device that can be programmed to perform different hardware functions [34]. They are composed of several programmable logic elements and are connected together by a configurable interconnection network. The logic elements can be set to act like a simple logic function like an AND gate. The configurable network is set up to connect the logic elements together and also connect them to storage resources or I/O pins on the FPGA. By configuring the logic elements and interconnect, large and complicated circuits can be implemented on an FPGA. FPGAs provide several benefits relative to Application Specific Integrated Circuits (ASICs) which are integrated circuits that have been custom designed for a specific purpose. First off, several of the problems associated with manufacturing custom hardware no longer exist which leads to reduced costs. The physical FPGA hardware has already been designed and tested without requiring the chip designer to get involved. ASIC designs normally have a significant up front cost due to the custom manufacturing process so using existing hardware allows FPGA designers to bypass this cost. Not only is there a reduction in monetary cost but there is a reduction in time as well. FPGA designers do not need to wait for their design to be processed by a silicon foundry like ASIC designs which can reduce the time to market by months [35]. A second advantage of FPGAs over ASIC results from taking advantage of its reprogramability. This feature makes FPGAs perfect for rapid prototyping due to the small amount of time between modifying the 1  1.1. Motivation design and pushing the changes to the physical FPGA device. Another benefit is the ability to patch bugs in released hardware since the FPGAs can be reprogrammed even after they have been deployed in the field. Thirdly, FPGA design software is capable of automating much of the FPGA design flow. A chip designer can focus on making changes at the hardware description language (HDL) level and mostly rely on the tools to handle the rest. Using these advantages, FPGAs have gained popularity in several markets including video processing, consumer electronics, automotive, security, high performance computing and many other areas [2, 33]. This thesis studies the implementation of Network-on-Chips (NoCs) on FPGAs and how they affect the performance of a circuit implemented on an FPGA. It proposes a method of optimizing the routers in the NoC to improve performance while minimizing the FPGA resources consumed by the NoC. Furthermore, this thesis looks into the the issue of analytically determining the optimal NoC topology to use for a given chip design on an FPGA. The remainder of this chapter will address the motivations for this thesis, what it contributes and the organization of the remaining chapters.  1.1  Motivation  As the size of FPGAs grow, so does the desire and ability to implement larger FPGA based System-on-Chip (SoC) designs. This in turn has resulted in an increasingly large number of components that need a low cost and high performance method of communicating with each other. Previously proposed methods of connecting the various components included using a shared bus or directly connecting each component to every component it communicates with [12]. The desire for better scalability and performance has led researchers to propose using an NoC to provide the communication network [10]. An NoC provides a highly scalable structure that facilitates simultaneous communication among network nodes [10]. In this thesis, we study NoCs composed of routers that are used to pass messages between different nodes. NoC designers want NoCs that use a minimal number of resources while providing a high performance communication network. The performance of an NoC is defined by how low its network latency is and how high its throughput is. Previous studies have examined using FPGAs to simulate ASIC NoCs [23] or have attempted to directly use an NoC designed for ASICs on an FPGA [24]. However, designing an NoC for use on FPGAs presents its own unique issues and is a far less studied area compared to ASIC NoCs. An NoC optimized for an ASIC system will not necessarily be 2  1.1. Motivation an efficient design on an FPGA and ASIC optimizations will not necessarily improve an FPGA NoC [24]. One particular NoC component that is problematic to implement on an FPGA is a Virtual Channel (VC) router. A simple router without VCs can suffer from head-of-line blocking which harms the NoC’s throughput. Headof-line blocking is when the packet at the head of a queue stalls and the packets behind it are also blocked as a result. VCs allow for packets to go around other packets that have stalled [8]. Additionally, VCs are required to prevent deadlock in some routing algorithms [9]. The class of routing algorithms that require VCs to prevent deadlock include the well known and popular torus network based dimension-ordered routing [11]. While VCs are a common feature in routers for ASIC NoCs, efficient implementation in an FPGA NoC has been a challenge. A VC router can end up using up to five times as many resources as a router without VCs [29]. A significant part of the resource usage problem with VCs comes from the VC buffers [24]. VC buffers cannot use an FPGA’s Distributed RAMs or Block RAMs in an efficient manner [24]. More details regarding this problem are provided in Chapter 2. The inefficiency in Block RAM usage for VC buffers is one of the key problems targeted by this work. We present the Block RAM Split (BRS) router architecture which addresses this problem by customizing the router design so the VC buffers can be implemented using a minimal number of Block RAMs while simultaneously freeing a large quantity of the Adaptive Logic Modules (ALMs) for use on user logic. Primarily this is achieved through a method of sharing a single Block RAM between the VC buffers of two router ports. VC buffer management is also modified to accommodate the additional limitations resulting from sharing a Block RAM. Fig. 1.1 shows how many more nodes can be synthesized when using a router architecture that uses Block RAMs compared to an architecture that does not. The graph was produced by extrapolating the resource data from Table 6.1 and Table 6.3 which will be discussed later in this thesis. It shows how the BRS router compares against CONNECT [24], a router design that does not use Block RAMs, under different combinations of resources. The bounds of the graph are set to match the number of M9Ks and ALMs on the FPGA board we were targeting which are 1280 and 212480 respectively. CONNECT does better when there are only a few Block RAMs available for use in the NoC. However, the minimum number of Block RAM required before a larger number of BRS routers can be synthesized is low compared to the total number of Block RAMs and beyond this point many more BRS routers can be synthesized relative to CONNECT. 3  1.2. Contributions  100  80  60  40  1200  1000  100  80  60  40  20  M9Ks  800  600 120  100  400  80 80  60 40  60  20  200  20 20  0  0  0  0  0  0.2  0.4  0.6  20  40  40  0.8  1  1.2  0  −20  −20  1.4  ALMs  1.6  1.8  −40 2 5  x 10  Figure 1.1: Contour plot of the difference in maximum node count between BRS and CONNECT across different combinations of free M9Ks and ALMs. Positive values mean more BRS nodes. Our results show that compared to a router that uses Block RAMs we achieve a 40% reduction in Block RAM usage for a torus NoC. At the same time, performance is unaffected at lower loads. When compared to a router that does not implement VC buffers in Block RAMs, a 71% reduction in ALM usage is shown. Also, our results show only a 4% decrease in clock frequency relative to the base VC router design. The second key problem targeted by this work is the problem of selecting an NoC topology to use for an FPGA based SoC. It is possible to implement heterogeneous systems on an FPGA that require communication among a variety of components [27] and currently there is no analytical method to determine the best method to connect everything together. NoC designers need to guess at what they think may be the best NoC topology to use for a given design and they will not know how close to optimal their guess was. Although it is not yet complete, the work in this thesis attempts to create an analytical model to determine the best balance between spending resources on user logic and spending resources on NoC logic along with selecting the best NoC parameters to minimize execution time.  1.2  Contributions  This work provides the following contributions: 4  1.3. Thesis Organization 1. Proposes a new VC buffer management system for FPGA routers that allows Block RAMs to be shared between multiple router input ports as storage for their VC buffers. This greatly reduces the number of Block RAMs needed to implement a VC router that uses Block RAM based VC buffers. 2. Implements and evaluates a VC router architecture on an FPGA that takes advantage of the new VC buffer management system and demonstrates the effects on performance and the significant savings in resource usage. 3. Highlights the negative effects of larger router port counts on resource usage for FPGA routers. 4. Quantifies the resource usage impact of deeper VC buffers when using the proposed VC buffer system. 5. Proposes the use of an analytical model for FPGA NoC performance and provides an initial framework for this goal. This includes a method of estimating the resource usage of the routers in the NoC and calculating the resource budget for the NoC. Our VC buffer modifications were built on an existing Open Source Network-on-Chip Router RTL (Register Transfer Level) developed at Stanford [4]. We also used the Stanford Router RTL as the basis for resource usage equations in our proposed analytical model.  1.3  Thesis Organization  The remainder of the thesis is organized as follows: • Chapter 2 goes over background information to assist with understanding our work. It will include information regarding FPGAs, NoCs and routers. • Chapter 3 is our proposed VC router architecture for FPGAs that uses our new VC buffer management system. This work was published previously at the International Conference on Field-Programmable Technology 2012 and is one of the key contributions of this thesis. • Chapter 4 describes the work that was done towards creating an analytical model for FPGA NoCs and is the other key contribution of this thesis. 5  1.3. Thesis Organization • Chapter 5 covers our methodology and how our tests were set up. • Chapter 6 goes over the results of our work. • Chapter 7 will cover related work in the area and how our work fits into the field. • Finally, Chapter 8 concludes the thesis.  6  Chapter 2  Background This chapter covers background information related to the contributions of this thesis. It is split into three sections and provides information on FPGA architecture, NoC design and NoC router architecture.  2.1  FPGA Architecture  An FPGA is composed of several different parts including logic blocks, I/O blocks, connection blocks, switch blocks, wiring and potentially many other components. Unlike an ASIC design, the hardware in an FPGA is set before the circuit to be implemented has been designed. The result of this is the FPGA hardware needs to be as flexible as possible to accommodate the countless different designs that users may wish to implement on it. To make this possible, the components on an FPGA are largely configurable. Logic blocks on an FPGA can be implemented in different ways. Two possible ways are the Configurable Logic Blocks (CLBs) on Xilinx FPGA devices and Logic Array Blocks (LABs) on Altera FPGA devices. CLBs hold blocks called Slices which in turn hold configurable Look-Up Tables (LUTs), arithmetic logic and registers to implement the design [32]. For Altera devices, LABs consist of multiple ALMs along with communication logic [3]. ALMs are made of multiple different types of LUTs along with arithmetic logic and flip-flops. Fig. 2.1 is an example of a configurable 3-input LUT used in an FPGA. An N-input LUT is capable of implementing any logic function that uses up to N inputs. The configuration bits on the left can be set so for every possible combination of valid inputs, a specific value propagates to the output. To configure a LUT on an FPGA, a stream of bits is shifted into the LUT’s configuration bits. For example, by shifting in the binary value 10000000 into the LUT’s configuration bits, the 3-input LUT acts as a 3-input AND gate as shown in Fig. 2.2. If all three inputs are high, the tree of transistors creates a path from the ’1’ to the output. For all other valid inputs a ’0’ value propagates to the output. This is the exact same functionality as a 3-input AND gate. 7  2.1. FPGA Architecture Input  Input  Input  Output  Configuration Bits  Figure 2.1: Configurable 3 input LUT. [1] Input  Input  Input  1 0 0 0 Output  0 0 0 0 Configuration Bits  Figure 2.2: Configurable 3 input LUT implementing a 3 input AND gate. [1] Connection blocks, switch blocks and routing wires are used to connect together the logic blocks in an FPGA to bring everything together. Fig. 2.3 8  2.1. FPGA Architecture  Logic Block  Figure 2.3: FPGA tile. [1] shows a tile in an FPGA that contains these components along with a logic block. The horizontal and vertical routing tracks send signals in and out of the FPGA tile. The connection block is used to connect the logic block to the routing tracks while the switch block is used to connect together the vertical and horizontal routing tracks. The configuration of these connections are done in the connection and switch blocks. They act as a crossbar and many of the crossover points have a configuration bit that indicates whether a connection should be made at that point or not. Similar to the logic blocks, these crossover point configuration bits are configured when the design is loaded on to the FPGA. When all of the proper configuration bits are set, a connection from the output of one logic block to the input of another can be created. To reduce the size and increase the speed of a switch block, not every possible connection between horizontal and vertical tracks is allowed. Only a subset of the tracks can be connected and many possible switch block designs are possible. Also to save on area and improve speed, the routing tracks themselves very in length. Some routing tracks are short and only go from one tile to another while other tracks are long an span the entire chip. An FPGA can also contain blocks of non-reconfigurable logic to create a specific structures. Embedded memories, multipliers or even a CPU can 9  2.2. Network-on-chips be implemented using these blocks. Embedded memories such as Block RAMs provide a form of high density data storage on the FPGA. These structures are known as hard structures and can achieve higher speed and density. However this comes at the cost of being unable to use the area for something different. As a result these hard blocks simply go to waste if they can not be used by the desired design. The routing tracks can also be used to connect these hard structures to the reconfigurable logic blocks. The VC buffers in a VC router are problematic to implement on an FPGA due to inefficient mapping to the storage resources [24]. On an FPGA, storage comes from using logic resources or Block RAMs. Xilinx’s Distributed RAM uses LUTs as storage instead of for logic while Altera uses register bits inside of ALMs. Block RAMs are larger storage units that cannot be subdivided and are far less abundant than LUTs or ALMs on an FPGA. Implementing the VC buffers by using logic resources will use up a significant amount of LUTs or ALMs with the problem getting worse at larger VC counts or buffer depth. On the other hand, VC buffers are small relative to the storage capacity of a Block RAM. Using Block RAMs to store the VC buffers will use up a highly limited resource and a large part of the Block RAM will remain unused due to the inability to subdivide them.  2.2  Network-on-chips  An NoC is a transportation system that can be used to move data from one location on a chip to another. They are composed of several router nodes that are connected to each other as well as to components outside of the network known as terminals. Common types of terminals include processors or memories and they use the network to send and receive data. The data to be sent from one terminal to another is known as a message. Using the connection to a router, a terminal can send a message into the network. This act is known as injecting a message into the network. The message is then passed from router to router through connections called channels and each instance of this is called a router hop. The router that is one hop before the current router is the upstream router and the next router the message will be sent to is the downstream router. The specifics on how the message reaches its goal depends on the configuration of the network but at a high level the message is either passed from router to router until it reaches a router that is connected to the target terminal or the message is dropped from the network. If the message reaches a router connected to the target terminal, the message is sent out of the NoC to the terminal. The act of 10  2.2. Network-on-chips  Message  Packet  Flit Figure 2.4: Messages are split into packets which are split into flits. Each packet and flit has additional header bits. [11] leaving the NoC is called being ejected from the network. Before a message is injected into an NoC, it is usually broken into several chunks known as packets and packets are further broken into flits. An example of this is shown in Fig. 2.4. Each packet has additional header bits to relay control information. These extra bits are indicated by the marked area on the far left of the packet. This information commonly includes the intended destination of the packet. Similarly each flit has header information. The information stored in the far left bits of the flit commonly includes a valid flag bit that indicate whether the flit contains valid information or not, a head flit flag bit that marks the first flit in a packet, a tail flit flag bit that marks the last flit in a packet and, in the case of VC routers, an indicator to show which VC the flit will be stored in at the next router. Prior to the use of NoCs, two other forms of communication were used [12]. The first is point-to-point where each terminal had a direct channel connecting it to every other terminal it needed to communicate with. The other method was a shared bus connection where data transfer between one source and any number of destination terminals connected to the same bus was possible but only one source can send data at a time. For larger terminal counts, point-to-point communication requires a large amount of area to implement due to the fact that it scales quadratically with node count and the problem gets worse with growing terminal counts. Shared bus is small but performance is low for higher terminal counts since they all have to take turns using the shared bus. An NoC provides a balance between the two. It can provide much higher performance than a shared bus while at the same time using a much lower amount of resources compared to point-to-point 11  2.2. Network-on-chips [12].  2.2.1  Performance  There is no one single number that indicates how good an NoC is. Rather, there are several different metrics that can be used to measure the effectiveness of an NoC. Different applications place different weights on the importance of each metric. These metrics include resource usage, throughput, bisection bandwidth, latency and others. On an FPGA circuits can be implemented using a variety of logic resources or hard blocks. The primary resources consumed by the NoC designs in this thesis are ALMs and Block RAMs. The resource usage metric of an NoC looks at the amount of each FPGA resource that is used by the NoC. Using a lower amount of each resource is better but the best balance of two or more different resources can depend on the target design. For example, if the user logic needs a large number of Block RAMs, an NoC, like the ones generated by CONNECT [24], that avoids Block RAM usage may be better even if it consumes a larger number of ALMs. Throughput is the number of flits per cycle per node that are ejected from the NoC. The throughput of the NoC depends not only on the parameters of the NoC but also on the traffic pattern and rate at which flits are injected into the NoC. The distribution of target destinations for packets injected at each router form the traffic pattern. Traffic patterns are described in more detail later in this section. Normally, as the packet injection rate increases so does the NoC’s throughput. However, when the packet injection rate is above a certain threshold, throughput will level off or decrease. The throughput at this point is known as the saturation throughput. If throughput levels off at the saturation throughput point, the network is considered stable. However, if the throughput drops off the network is considered unstable which is an undesirable trait. Instability is frequently due to fairness issues in the network that cause packets from one or more terminals to be disproportionately delayed compared to other terminals. This is known as a starvation problem. One of the targets for an NoC is to have as high a saturation throughput as possible and another is for the throughput to simply level off instead of decreasing when injection rate is above the saturation throughput point. A bisection of a network is when the routers in an NoC are divided into two groups that are as close to being equal in number as possible. The bandwidth of a particular bisection is the total bandwidth of all the connections that cross over from one group to the other. The bisection 12  2.2. Network-on-chips bandwidth of an NoC is minimum bandwidth value for all possible network bisections. Bisection bandwidth depends on the topology of the NoC, the number of nodes in the NoC and the width of the channels in the NoC. An NoC should aim to have as high a bisection bandwidth as possible. Latency refers to network latency which is the time it takes for a packet to go from source to destination in the NoC. Similar to throughout, latency also depends on the traffic pattern and rate at which flits are injected into the NoC. As injection rate increases, so does the latency of the network. Roughly around the saturation throughput point, latency begins to rapidly approach infinity. Latency is affected by the topology of the NoC and the time it takes for a router to process and pass on a flit. NoCs should be designed to have as low latency as possible. Throughput vs. offered traffic and latency vs. offered traffic graphs are two common ways of graphically representing the performance of a network. Offered traffic is the number of flits per cycle per node that are being injected into the network. The throughput graph highlights the point where throughput no longer increases with increasing offered traffic and whether the NoC remains stable or not after that point. The latency graph highlights the zero load latency and the amount of offered traffic that the network can handle before latency starts to rapidly approach infinity.  2.2.2  Topology  The topology of an NoC refers to the layout of the routers and channels that make up the NoC. The topology chosen for an NoC can affect the size of the network and performance metrics such as network latency or bisection bandwidth. Different topologies are suited for different applications. There are a countless number of possible NoC topologies and this section describes several of the more common ones. In the mesh topology shown in Fig. 2.5(a), the squares represent the different routers in the network and the lines show which routers are connected by a channel. In reality, a channel only allows data to travel in one direction so each channel line in the picture actually indicates two channels that work in opposite directions. In this topology, the routers are connected together as if they made up a rectangular grid. Each router is connected to all of their horizontal or vertical neighbours. A special case of the mesh topology is the square mesh. In a square mesh, the horizontal and vertical dimensions of the topology are the same. The physical locations of the routers do not have to create a grid to be considered a mesh network. It can be a mesh network as long as the channel connections are the same as if the routers 13  2.2. Network-on-chips  (a) Mesh Topology  (b) Torus Topology  (c) Fully Connected Topology  (d) Ring Topology  (e) Hypercube Topology  (f) Star Topology  Figure 2.5: NoC topology examples.  14  2.2. Network-on-chips were in a grid. A torus topology is shown in Fig. 2.5(b). A torus topology is very similar to a mesh topology. The key difference is the routers at the edge of the network now have a new wraparound link to the opposite edge of the network. This reduces the average latency of the network and increases the bisection bandwidth. However, this comes at the cost of a potentially longer clock period due to the increase in channel length and an increase in the amount of FPGA routing resources being used. A fully connected topology is shown in Fig. 2.5(c). In a fully connected topology, every router has a direct connection to every other router. The results of this is a message can go from any source to any destination in just a single hop. This configuration has the lowest average router hop count and highest bisection bandwidth. Unfortunately, fully connected networks are extremely expensive in terms of resource usage and scale poorly. On the other end of the spectrum is the ring topology shown in Fig. 2.5(d). In a ring topology, each router is connected to two other routers to create a ring of routers. This configuration is one of the smallest possible NoC topologies. The cost of this configuration is low bisection bandwidth that does not grow even as the number of routers increase and high average router hop count. The hypercube network topology shown in Fig. 2.5(e) and star network topology in Fig. 2.5(f) are two other attempts at striking a balance between performance and resource usage. As its name suggests, a hypercube topology connects routers as if they were the vertices in a hypercube. The configuration shown is only a 4 dimensional hypercube but the number of dimensions can be scaled up indefinitely. A star topology places a single router at the centre of the network and every other router is connected to this central router. Relative to a mesh topology, hypercube and star can potentially have higher performance but this comes at the cost of larger routers and poor scaling to larger node counts.  2.2.3  Flow Control  Flow control is needed to manage the packet or flit storage space in each router. Storage space for temporarily holding multiple flits at each router in an NoC provides several benefits such as reducing or eliminating the need for misrouting and dropping packets. Misrouting is when a flit is sent in a direction away from their destination which increases the number of hops the flit must take to reach its destination. Dropping packets means the packet is completely discarded. If the packet contained information that 15  2.2. Network-on-chips must reach the destination, the packet must be resent by the source node. The storage space is divided into multiple independent queues to separate the flits that arrive from different input ports. To prevent the buffers from overflowing, there needs to be a way to ensure that a packet or flit is only sent to a router that has sufficient open buffer space. If the downstream router does not have enough space, the packet or flit should be held in the buffer space it already has reserved in the sending router. There are many forms of flow control and four of the most well known types are described below. The four forms are store-and-forward, cut-through, wormhole flow control and VC flow control. Store-and-forward flow control manages the router buffers at the packet level. Using this flow control method, a packet must be fully transferred from one router to another before any flit in the packet can be sent onward to the next router in the route. After a packet is fully received by a router it initially waits inside the router’s packet buffer. The packet needs to wait for two things before it can be transferred to the next router. First the next router must have enough buffer space available to hold the entire packet. The second thing is the channel leading to the next must not be reserved for sending a different packet. When both of these conditions are met, the channel to the next router and buffer space on the downstream router are reserved. Afterward the flits in the packet are sent to the downstream router. When this is complete, the buffer space on the previous router is set free and the reservation on the channel is cleared. The problem with store-and-forward flow control is high latency due to having to wait for an entire packet to be sent at each router hop before sending data for the next hop. Also each router requires enough buffer space to hold multiple packets which can be a problem if the NoC needs to be able to accept large packets. Cut-through flow control is similar to store-and-forward but no longer requires a packet to be fully received before starting the process of sending data to the next router. The router still waits until the next router has enough buffer space to hold an entire packet and that the channel to the next router has not be reserved by a different router port. However, as long as the head flit in the packet has already been received, the flits in the packet will start being sent to the next router. This helps reduce the latency of each router hop. However, requiring buffer space to hold an entire packets can still be a problem. The following two flow control methods use VCs. A VC is composed of a set of channel state fields and potentially a flit buffer which is also called a VC buffer. Each VC is associated with a port on the router and both input ports and output ports on the router can have one or more VCs. For 16  2.2. Network-on-chips an input port buffered router, only the input port VCs have a flit buffer and every input port VC has one. The channel state fields are used to store information regarding the transmission of flits over a channel. The input VC tracks the current stage in the router pipeline the packet has reached, the target output port and output VC for the flits at the head of the flit buffer. The input VC may also track the amount of free buffer space at the target flit buffer in the downstream router. Each output VC is also associated with the input VC at the downstream router and potentially keeps track of the amount of free buffer space in the flit buffer. The output VC also keeps track of the input VC at the current router it was assigned to and global state information. Wormhole flow control acts like cut-through flow control but now works on the level of individual flits. Wormhole flow control uses VCs to store state information. When a head flit reaches the front of a flit buffer at a router, it sets up the state information in the VC and reserves a channel for its next router hop. When the tail flit of the same packet is sent, the VC is set free to be used by a different packet and at the same time the reservation on the channel is removed. Since the flow control looks at things at the flit level, a flit no longer needs to wait until the next router has enough space to hold an entire packet. The head flit can be sent to the next router as long as there is buffer space for one flit and the channel to the next router is not current reserved by a different packet. Wormhole flow control does not provide as much throughput as cut-through flow control but significantly reduces the amount of buffer space required in the router. Buffers now only need to hold a handful of flits instead of entire packets. A problem with wormhole flow control is the fact that channels are reserved by a packet until all flits from the packet have been sent to the downstream router. If a packet stalls in the middle of being sent to the next router, the channel will remain idle even if the next router has buffer space and the current router has flits from a different packet to send to the same downstream router. VC flow control is an extension on wormhole flow control. VC flow control assigns multiple input and output VCs to each physical channel. Each VC acts similarly to the VCs in wormhole flow control and store almost the same information . The difference is the VCs associated with the same input port need to take turns sending their flits to the next router. Also each input port VC needs to keep track of the output VC that the VC allocator assigns to it. The advantage of having multiple VCs per port is each channel is no longer assigned to a single packet at a time. Multiple packets can now interlace their turns sending flits over a single physical channel. Head flits from different packets can be received until all available 17  2.2. Network-on-chips  Figure 2.6: Head-of-Line blocking example. VCs have been reserved and sending a tail flit still frees up a reserved VC. A head flit can be sent to the next router as long as there is a free VC and free flit buffer at the destination and the flit is allowed to use the connecting channel during that cycle. Non-head flits are assigned to the same VC as the head flit and can be sent as long as there is buffer space and the flit has been given a time slot on the channel. VC flow control provides many benefits including increasing throughput and preventing deadlock in certain routing algorithms such as the well known XY Dimension Order routing on a torus NoC [11]. The different VCs can also be used to create multiple traffic classes to assign different priority levels to different packets. VCs can also be used to prevent the problem of head-of-line blocking. Head-of-line blocking occurs when the flit at the head of the buffer queue can not be sent and there is another flit in the queue that could have been sent if it was at the front of the queue. An example of this is shown in Fig. 2.6. In this figure the small squares represent flits in the NoC. On the left side of the figure, the squares with a cross on them, are trying to head to the right but are currently unable to make progress and the queues are backing up. At the same time, the square with the single slash on it wants to take the open path upward but it can not due to the cross flit in front of it. This effect is known as head-of-line blocking. VCs can create an effect similar to a street having a left turn lane. By adding an additional turn lane you get the situation on the right side of the figure. The single slash flit is able to use the second VC as a turn lane to go around the stalled cross flit and take the open channel upward. In all four of these cases, the router needs to know the amount of free buffer space at the downstream router before sending a flit. To support this, a credit based system can be used. Each router keeps a counter for each of  18  2.2. Network-on-chips its neighbour’s input ports that it is connected to. There is an individual counter for each VC and it tracks the number of flits that can be sent to the buffer associated with that VC until it is full. As long as the value is not zero, a flit can be sent to the neighbouring router and the counter is decremented by one. When a router sends a flit, a space opens up in their flit buffers and this information needs to be relayed back to the upstream router. This is done by sending a short stream of data called a credit via a separate connection that goes in the opposite direction of the channel. The credit indicates which VC buffer has a new open slot. When this credit is received, the associated counter is incremented by one. The minimum number of cycles between sending a flit and receiving a credit back is known as the credit round-trip delay and it is based on the time to transmit data between two routers and the time for the routers to process the credit. To prevent limiting throughput, each VC buffer should be able to hold a number of flits that is greater than the credit round-trip delay.  2.2.4  Routing Algorithm  The routing algorithm employed in the NoC is used to determine the path a packet or flit takes through the NoC to go from source to destination. There are a limitless number of possible algorithms and the performance of the NoC depends on which one is used. A routing algorithm’s quality is based on a few factors such as load balancing, average hop count, and complexity of implementation. Load balancing looks at how evenly traffic is distributed in the network under multiple different traffic patterns that stress the network in different ways. If the load is balanced well, the channels in the network can be used more efficiently which leads to higher throughput. To create a balanced network, sometimes a packet will be given a path that isn’t the shortest possible route but uses channels that are not as busy. A routing algorithm that can only assign the shortest possible path, or minimal path, are called minimal routing algorithms. A routing algorithm that is able to assign paths other than the shortest possible are called nonminimal routing algorithms. The advantages of a balanced network need to be weighed against the effect on the number of hops each flit needs to take to reach its destination. Reducing the number of hops required to go from source to destination helps reduce the latency of the network. Tying everything everything together is the implementation of the algorithm. More complex routing algorithms may require addition resources to implement or adversely affect clock speed in exchange for the benefits they provide. This is another factor that must be balanced against everything else. 19  2.2. Network-on-chips Routing algorithms can be divided into three categories: deterministic, oblivious and adaptive. When using a deterministic routing algorithm all packets and flits with the same source and destination always take the exact same path through the network. These algorithms are typically the simplest to implement but are not very good a balancing load on the network. Under an oblivious routing algorithm, a packet can take multiple different paths from a source to destination. However, the process of choosing a path is done without using information regarding the current state of the network. The third category of routing algorithms is adaptive routing. Adaptive routing algorithms use information from he network to try and create better paths for a packet to follow. For example, an adaptive routing algorithm may recognize that a part of the network is currently congested and change the route of packets that would normally go through that area. Adaptive routing algorithms are better able to handle load balancing but they can be complex and costly to implement. Dimension ordered routing is one of the most well known routing algorithms [11]. It is a deterministic routing algorithm that uses minimal length paths. Dimension ordered routing can be used in mesh or torus topologies. Under dimension ordered routing, each flit can only travel along a dimension once and the flit can only make hops in the direction of their destination. Furthermore, all flit traverse each dimension in the same order. Along a dimension, only one direction takes you closer to the destination and since the dimensions are always traversed in the same order, the path from a source to destination is always the same. The exception to this is in torus networks when both directions along a dimension take the same number of hops to reach the target. As long as the same direction is chosen every time, the algorithm is still deterministic. Since a flit is only allowed to move towards the destination, dimension ordered routing is minimal. For a mesh topology, forcing dimension traversal to always follow the same order prevents dependency loops from and creating a deadlock. However. a torus topology using dimension ordered routing can still deadlock if it does not use VCs. Fig. 2.7 shows an example of XY dimension ordered routing on a mesh network. Under XY routing, travel along the X axis occurs first. A packet traveling from A to B must first make one hop to the right before taking two hops upward. All packets and flits from A to B will take this route. In the second example, a packet wants to go from C to D. In this case no travel in the X dimension is required. Packets from C can immediately travel upward along the Y axis until they reach D.  20  2.2. Network-on-chips  B  D  A  C Figure 2.7: XY dimension ordered routing example  2.2.5  Traffic Patterns  The distribution of target destinations for packets being sent by each source is the traffic pattern being run on the network. To test the performance of a network, the sources injecting packets into the network can be configured to generate a specific traffic pattern. In a NoC, each router has an address associated with it and different topologies can assign the addresses in different ways. In a mesh or torus network, the address of a router is what its Cartesian coordinate would be if the routers were physically arranged in a grid. For both the mesh and torus networks, the bottom left router is chosen as the origin and has an address of 0. Table 2.1 is a list of several possible synthetic traffic patterns and how the destination address for each packet source is determined. di and si refer to individual bits in the b-bit long source and destination addresses. The address can also be broken into k different sections indicated by dx and sx which are the coordinates in each of the k dimensions of the network. Synthetic patterns are simple test patterns that are intended to represent traffic patterns that may occur in real applications. The first listed pattern is uniform random traffic. Under this traffic 21  2.3. Virtual Channel Router Architecture  Table 2.1: Synthetic Traffic Patterns [11]. Name Pattern Uniform Random Uniform random destination distribution Bit Complement di = ¬si Bit Reverse di = sb−i−1 Bit Rotation di = s((i+1) mod b) Shuffle di = s((i−1) mod b) Transpose di = s((i+b/2) mod b) Tornado dx = (sx + (⌈k/2⌉ − 1)) mod k Neighbour dx = (sx + 1) mod k  pattern, each source sends packets to every destination except itself with equal probability. This is the most commonly used traffic pattern but it does not do a very good job of stressing a network and can hide problems with load balancing. For every other listed traffic pattern, each source sends all of their packets to just one destination. If the target destination for a source is itself, the source will not inject any packets at all. Bit complement traffic inverts each bit in the source address to get the destination address. Bit Reverse, Bit Rotation, Shuffle and Transpose rearrange the bits in their source address to get the destination. Tornado traffic has each source send their packets halfway across the network in each dimension which stresses torus networks in particular. Neighbour traffic has each source send their packets to the address that is one higher in each dimension and is useful for testing the locality property of the network. Running different synthetic traffic patterns on a network can be used to generate multiple different throughput and latency graphs to illustrate the performance of the network under multiple different conditions.  2.3  Virtual Channel Router Architecture  VC routers were introduced to improve the throughput of an NoC [8] and allow a greater variety of routing algorithms to run deadlock-free [9]. However, the large amount of FPGA resources consumed by the traditional VC router architecture [29] is a barrier to more widespread adoption of VC routers on FPGA. The router architecture is similar to the one shown in Fig. 2.8. Each input port of the router has an associated route computation unit, set of 22  2.3. Virtual Channel Router Architecture  Route Computation  VC Allocator  Block RAM VC Buffer Input Port  Output Port  VC Buffer  Switch VC Buffer Input Port  Output Port  VC Buffer Block RAM Route Computation  Switch Allocator  Figure 2.8: Traditional VC router architecture. Figure drawn with only two input and output ports for clarity – we use 4 port routers in our evaluation in Chapter 6. VC buffers and storage space for the VC state information. The router also contains a VC allocator and switch allocator to manage access to the output VCs and reserve cycles on the switch. The allocators and switch are shared among all the ports on the routers. The route computation unit calculates the output port id number a flit should use to head towards its intended destination. It needs to be able to calculate this for any destination in the network. For example in a mesh network there are five possible output ports on a router. The route computation unit knows the address of the current router and knows the destination address from data stored in the head flit of the packet it is currently examining. The route computation unit knows that the router addresses encode the source and destination’s coordinates in the mesh and that a hop to another router will increment or decrement the current coordinates in one dimension. By taking the difference between the current coordinates and the target coordinates, the router computation unit knows if positive or negative travel in each dimension is required. Under XY Dimension Order routing, the route computation unit will send the flit to an output port that will cause the flit to try and align its X coordinate with the target before trying to line it up in the Y dimension. If the address of the destination and 23  2.3. Virtual Channel Router Architecture the current router match, the route computation block will say that the flit needs to take the ejection port out of the network next. An allocator matches a group of resources to a group of requesters. Allocators are composed of multiple arbiters. An arbiter is a hardware structure that can be used to manage requests for a shared resource. All requests for the same resource are sent to the same arbiter and the arbiter decides which request will be accepted and which will be rejected. Desirable attributes in an arbiter include fairness, size and speed. Fairness is a measure of how equally requesters of the same priority are treated. Under a round robin arbiter, a requester that just had their request granted is given the lowest priority next time an arbitration decision needs to be made. This provides an equal chance at winning arbitration to all requesters. If a request has been made, every other requester can win no more than once before the request is granted. There are multiple ways to implement the VC and switch allocators and the different methods vary in size, performance and maximum clock speed. We choose to use separable input first allocators due to their small size and high clock speed. The separable input first allocators used in this router architecture are composed of multiple round robin arbiters in two stages. In the first stage of the switch allocator, the switch requests from all VCs under the same input port are put against each other to ensure that only one request per input port will be granted. The winners are sent to the second stage of allocation where requests from VCs that target the same output port are put against each other. This ensures that only one request per output port will be granted. The resulting grants are not guaranteed to be maximal but providing such a guarantee would require a larger or slower allocator. In the first stage of the VC allocator, arbitration is performed for each input VC such that each VC now requests, at most, just one output VC and the output VC is not already reserved. During the second stage of VC allocation, each output VC is assigned to at most one of the input VCs that are requesting the output VC. The switch allows flits to travel from the input port of a router to an output port. This is another part of the router that consumes a lot of ALMs, especially for routers with larger port counts. In network designs where a flit is not allowed to go back the way it came from, the switch can be made smaller by not including a connection between an input port and an output port in the reverse direction. Although other works such as Leary et al. work [20] examine switch specific optimizations, it is outside the scope of this thesis. The input port block holds the VC state information required to process flits through the router. Each VC has its own set of state information and 24  2.3. Virtual Channel Router Architecture  VC Buffers  In  VC Buffers  Data Route Computation SW Allocator  Crossbar Control  Out  VC Allocator  Route Computation VC Allocation  Switch Allocation  Switch Traversal  Figure 2.9: Three stage VC router pipeline. this information includes a global state value, the output port the flit at the head of the buffer wishes to use, the output VC it was assigned and the number of credits remaining at the downstream router buffer. The VC buffers are a key area of interest and are directly targeted by an optimization proposed in this thesis. The VC buffers are implemented on an FPGA using either ALMs or Block RAMs and they store the data for each valid flit received by the router until the flit is sent to the next location. Each input VC on each port has its own VC buffer.  2.3.1  Virtual Channel Router Pipeline  The traditional VC routing process is broken into four pipelined stages [11]. These stages include Route Computation, VC Allocation, Switch Allocation and Switch Traversal. To reduce the number of clock cycles to traverse the router, the number of pipeline stages is reduced to three through lookahead routing computation [17]. Lookahead routing computation means the current router performs the route computation needed by the next router so the calculation is always complete one router hop in advance. Route Computation for the next router and VC Allocation for the current router are independent from each other and can be performed in parallel. An example of the resulting three stage pipeline can be seen in Fig. 2.9. While other well known speculation based optimizations exist that allow for other stages to be performed in parallel [26], our router architecture only uses lookahead routing computation. When a flit is sent by an upstream router or a terminal attached to the 25  2.3. Virtual Channel Router Architecture local input port, it travels along a channel until it reaches the flip flop at the start of the router input port. At this flip flop, the flit is stored for a cycle to pipeline the transport of the flit’s data. During the next cycle, the flit is passed to the VC buffer. If the flit is a head flit, the header information is also passed on to the Route Computation block and VC Allocator. The Route Computation block performs the lookahead routing computation [17] to calculate the flit’s desired output port at the next router. This is done so the route computation can be done in parallel with VC allocation which needs to know the flit’s target output port for the current router. Due to the lookahead route computation done at the previous hop, the flit already contains the target output port of the current router. This information is used to determine where the flit will be after it hops from the current router. The flit also contains the router address of the final destination of the flit. Using the knowledge of what the next router is and what the target destination is along with the rules of the routing algorithm being employed, the route computation block determines which output port should be targeted at the next router. This information is then written to the head flit to be used by the next router. During the VC allocation stage, the VC allocator attempts to allocate an output VC to each input VC that contains a flit and has not yet been assigned an output VC. Only head flits can request an output VC. All nonhead flits use the same VC as the head flit of their packet. When a head flit arrives at the router, it already contains a tag which indicates the output port it wants to use. When the flit reaches the front of its VC buffer, the input VC sends a request signal to request a VC associated with the target output port. The VC allocator replies back with a grant signal that indicates whether the allocation was successful or not. If the input VC is unable to reserve an output VC with the target output port, it will try again next cycle. Upon success, the allocation result is recorded and used to direct the remaining flits in the packet to the same output VC. Once an output VC has been reserved it can not be assigned to another input VC until the reservation ends. When a tail flit is sent out, the output VC reservation is removed and it can be reassigned to a different input VC. Each input VC that was assigned an output VC and currently contains a valid flit sends a request signal to the switch allocator to reserve cycles on the switch. Each output port can only accept a flit from one VC across all input VCs per cycle. Also only one VC from each input port can send a flit to an output port per cycle. The switch allocator enforces this constraint by only giving a grant to at most one VC for each input port and granting access to at most one input port to each output port. If a input VC does 26  2.3. Virtual Channel Router Architecture not receive a positive grant signal from the switch allocator, it will not send a flit to the switch during the next cycle. When a VC buffer is given a positive grant signal by the switch allocator, during the next cycle the flit at the front of the buffer is removed and sent into the switch to be moved to a flip flop at the output port. One cycle after being received by the output port flip flop, the flit is sent across the link towards the next router. The switch traversal stage is also when credits are sent to the upstream router to indicate that a new space has opened up in one of the current router’s VC buffers. The upstream routers keep track of the free space in each of the current router’s VC buffers that they are attached to. When they send a flit, they assume it will be received and decrement the credit count of the VC they sent the flit to. When a credit is received, they increment the credit count of the VC the credit is associated with. This is the last section covering background material. The next chapter will cover our proposed VC router architecture modifications which is one of the key contributions of this thesis.  27  Chapter 3  Small Virtual Channel Routers on FPGA Through Block RAM Sharing Our work presents a new and simple method of implementing and managing the VC buffers to reduce the amount of resources consumed while limiting the impact on performance. The key to our work is a method of sharing the space inside a Block RAM between the flit data portion of the VC buffers associated with two of the router input ports. This change can be seen in Fig. 3.1 where the VC buffers for two input ports are now held within a single Block RAM as opposed to Fig. 2.8 where they are separate memory blocks that could be implemented in either Block RAMs or logic resources. Flit header information is small and is required to be accessed more often than the flit data so it is still stored in logic resources. We designed the BRS architecture to target an FPGA that uses Altera M9K Block RAMs but the idea can be extended to other Block RAMs that support true dual port access. The M9K memory block comes with several limitations that must be addressed before they can be used to store the desired VC buffers. The primary issue is the number of ports on the Block RAMs. The M9K Block RAMs have two read/write ports. On any given cycle, a port can be used to either read or write to a memory location in the Block RAM. A single port can also be used to read and write to the same memory location during the same cycle. In our work we configure the Block RAM such that a read and write during the same clock cycle will return the new data if the same Block RAM port is used and the old data if different Block RAM ports are used. For Block RAMs that do not support this configuration, additional logic will be required to read the new data during a simultaneous read and write. In the baseline design, a VC buffer can both be written to and read from during the same cycle if a flit arrives at the input port at the same time the VC is sending a flit through the switch to an output port. Since two router ports will be sharing a single Block RAM, this  28  Chapter 3. Small Virtual Channel Routers on FPGA Through Block RAM Sharing  Route Computation  VC Allocator  Block RAM VC Buffer Input Port  Output Port  VC Buffer  Switch VC Buffer Input Port  Output Port  VC Buffer  Route Computation  Switch Allocator Input Port Status  Figure 3.1: BRS architecture. Figure drawn with only two input and output ports for clarity – we use 4 port routers in our evaluation in Chapter 6. can result in accesses to as many as four different memory locations during the same cycle. This behaviour is not compatible with using a single Block RAM as storage since access is limited to at most two different locations at the same time. The second issue with the Block RAMs is the limitation on the data width and VC count and depth. While operating as a true dual port memory, the M9K Block RAMs have a maximum data width of 18 bits and can hold 512 entries of data of this size. To address the issue of a limited number of Block RAM ports, modifications to the VC buffer management were required. Situations where more than two locations in a Block RAM were accessed at the same time needed to be avoided. This required implementing a form of stalling when a write or read to the Block RAM could not be performed during the current cycle and ensuring that the write or read is performed in the future. The router only writes to a Block RAM when a new valid flit arrives at an input port. Reads from a Block RAM only occur when a VC on one of the router ports sharing the Block RAM is able to win switch allocation. The lookahead route information is read and recorded when a head flit initially enters the VC buffer and does not need to be read out of a Block RAM. This reduces resource usage but comes at the expense of preventing VC reassignment 29  Chapter 3. Small Virtual Channel Routers on FPGA Through Block RAM Sharing until an entire packet has been sent from the VC. To stall writes to a Block RAM, information would need to be passed to the upstream router to indicate a flit was received but dropped and needs to be resent or to block the flit from being sent in the first place. In the interest of keeping the credit logic simple, we opted to give priority to Block RAM write attempts so they are never stalled. The potential downside of this is an increase in network latency from congestion. A router with several flits backed up in its VC buffers may have to hold a normally sendable flit for an extra cycle to receive a flit from an upstream router that currently has nearly empty VC buffers. At most two new flits can arrive on the two input ports sharing a Block RAM during a single cycle. The router ports can use the two ports on the Block RAM to write both flits into their respective VC buffer. This way if an upstream router has credits available, it is able to send a flit with a guarantee that the downstream router has buffer space and will accept the flit. Since Block RAM writes are given priority, the VC buffer management logic needs to stall VC buffer reads when Block RAM ports are not available to service one or both of the reads. This stalling is performed by canceling requests to the Switch Allocator. This is done to prevent a VC from reserving time on the switch that it cannot use and consequently resulting in unnecessarily preventing another VC from using the switch during the next cycle. There are several different combinations of potential read and write requests to a Block RAM that need to be handled in different ways. The straightforward cases are when either both or neither of the router ports sharing a Block RAM receive a new flit. When both router input ports receive a new flit, two memory locations need to be accessed inside the Block RAM. This automatically ties up both ports on the Block RAM so all switch requests from any of the VCs from either router port need to be squashed. The case where neither router port receives a new flit leaves both Block RAM ports free. This allows the VCs for both router ports to send their requests to the switch allocator. The case where only one of the router input ports receives a new flit is a bit more complex. The incoming flit will tie up one of the Block RAM’s ports which means only one of the two router ports can read their flit data from the Block RAM. Priority for sending their requests to the switch allocator is given to the router port that is not receiving a new flit. The switch requests of the router port receiving a new flit are masked off unless the other router port sharing the Block RAM does not have any switch requests to make. This router port selection can sometimes lead to unnecessary stalling when the router port selected loses switch allocation and the router port that has its requests canceled would 30  Chapter 3. Small Virtual Channel Routers on FPGA Through Block RAM Sharing have won. However, this avoids the need for additional Switch Allocators that would be required to redo switch allocation for the other router port during the same cycle. Other options such as using a round robin arbiter to select which port will have its requests masked off are also possible but we opted to use the current method due to its simplicity and because it uses a low amount of FPGA resources to implement. Fig. 3.2 shows examples of each of the stalling conditions. The Block RAMs each hold 4 VC buffers, 2 for each router port. The top two bars represent the VC buffers for one router port and the bottom two bars represent the VC buffers for the second router port sharing the Block RAM. Each of the small grey blocks represents a flit as it enters, passes through or leaves the VC buffers. To keep the examples simple, they are designed so the switch allocator never has to resolve any conflicts. Only the top VC buffer for each router port will ever have a flit to send so there will not be competition between VCs of the same router port. All packets are assumed to request different router output ports so there is no competition for the same output port. The result of this is whenever a VC makes a request to the switch allocator, it will be granted time on the switch during the next cycle. Although not shown, the BRS router architecture is fully capable of handling the more complex cases where stalling can occur due to requests being denied by the VC allocator or switch allocator. Conflicts between requests that make it to the VC allocator or switch allocator are handled in the same way as the original router architecture and result in a subset of the flits in conflict being sent to the switch. In Fig. 3.2(a), the top VC buffer for the first router port currently stores two flits and so does the top VC buffer for the second router port. This cycle there are no new incoming flits that need to be written to the VC buffer next cycle. No new incoming flits means no additional stalling will occur. Both Block RAM ports can be used to read flit data from the VC buffers next cycle. The result of this is switch requests from both the top VC of the first router port and the top VC of the second router port will be sent to the switch allocator. The switch allocator will grant the requests so both VCs will be able to send a flit to the switch next cycle. This only requires two Block RAM reads so there are no problems with Block RAM access in this case. In Fig. 3.2(b), the top VC for each router port currently holds one flit each. Unlike the previous example, there are now two new incoming flit that need to be written to the VC buffers next cycle. This is the full stall case. In this case, both memory ports on the Block RAM are tied up next cycle to service the two writes to the VC buffers. This means it will not 31  Chapter 3. Small Virtual Channel Routers on FPGA Through Block RAM Sharing  (a) 0 incoming flits, no stall  (b) 2 incoming flits, full stall  (c) 1 incoming flit, half stall  (d) 1 incoming flit, half stall  Figure 3.2: BRS Stalling conditions. be possible to read any flits from the Block RAM next cycle and the flits at the front of the VC buffers need to be stalled. To accomplish this, all requests to the switch allocator from the VCs using this Block RAM are canceled. Since the requests never reach the switch allocator, grants are never returned. Without a grant, attempts to read flits from the VC buffers 32  Chapter 3. Small Virtual Channel Routers on FPGA Through Block RAM Sharing will not be made. The end result is two writes to the Block RAM and zero reads which is supported by the Block RAM. Fig. 3.2(c) covers the first of two cases involving a new flit at only one of the two router ports. In this example, once again the top VC for each router port currently holds one flit each. Only the top router port needs to write a flit to the VC buffers this time. This is a half stall scenario. Only one memory port will be used to write the new flit into a VC buffer so the other memory port can be used to read a flit from the Block RAM. Currently both router ports have a flit waiting in a VC buffer. The router port that is not receiving a new flit is given priority so the bottom router port sends its requests to the switch allocator while the requests from the top port are canceled. Next cycle, the new flit is written to one of the VCs for the top router port and a flit is read out from the bottom port’s VCs and sent to the switch. This case has one write and one read to the Block RAM which falls under the two access limit. Fig. 3.2(c) covers the second case involving a new flit at only one of the two router ports. Unlike the previous examples, only a VC in the top router port is currently holding flits in its VC buffer. Once again one incoming flit means flits can only be read from one of the two router ports next cycle. Normally, it would be the router port that is not receiving a new flit this cycle. However, this time the bottom router port VCs have no flits to send. As a result, the top router port will be allowed to send a request to the switch allocator. Next cycle the new flit is written to the top router port’s VC buffers and at the same time the data for one flit is read out. Once again the maximum Block RAM access limit is followed. The limited data width on the Block RAMs creates a limitation on channel width of the VC router. The data width of flits passing through the network is limited to the maximum data width of a Block RAM which is 18 bits for the M9K Block RAM. Increasing the data width beyond this would force the need for multiple Block RAMs to hold the VC buffers of two router ports. The limited capacity of a single Block RAM also creates a limitation on the number of VCs and the depth of their buffers. If the number of VCs multiplied by the depth of the buffer and doubled to account for sharing between two ports exceeds 512, additional Block RAMs are once again required to hold the VC buffers. A variable factor on the BRS router is the selection of router ports to be paired together to share a Block RAM. Different pairings can lead to different effects on the performance of the router. Ports that frequently receive or send flits together should be separated into different pairs. Different routers in the same NoC can use different router port pairings since the pairings are 33  Chapter 3. Small Virtual Channel Routers on FPGA Through Block RAM Sharing individual to each router.  34  Chapter 4  FPGA NoC Topology Analytical Model We attempted to create an analytical model to help determine the best NoC topology to use in an FPGA SoC. Resynthesizing an entire system to manually test different NoC topologies to find the best one can be a time consuming process. An analytical model would provide a guide to help quickly determine the best topology and other parameters, such as data width, to use in the NoC. Our model assumes that the NoC is the final piece of the system and the other pieces such as processors, accelerators or memory components have already been designed along with the software that will be run on the processors. Our model is composed of four stages. The stages are Component Synthesis, Router Synthesis, Performance Estimation and Topology Optimization. The parameters used in our model are shown in Table 4.1. Component Synthesis is performed to estimate the resource usage impact of the non-NoC portions of the SoC. The individual non-NoC components are synthesized to estimate the number of ALMs and Block RAMs they consume. To reduce the number of free variables and reduce complexity of the model we assume each type of component falls into one of two categories and, while quantity can vary, copies of the same component consume roughly the same amount of resources. The first category is the constant category. The constant category contains the components that should not change in terms of total resources consumed when SoC parameters change. Any SoC parameter that affects the resource usage of components in the constant category must also remain constant. The total amount of resources consumed by the components in the constant category is determined by summing up the resource usage of the individual components. The end result of this calculation should be a constant. The second category is the variable category. The total resource usage of the components in the variable category should scale relative to a single global scaling factor variable (gs ). Components such as soft processors that scale in number relative to the size of the NoC will fall into the variable category. The result of this is the total resource 35  Chapter 4. FPGA NoC Topology Analytical Model  Table 4.1: NoC Topology Model Variables. Name RN oCmax ALMN oCmax BRAMN oCmax Rtotal Rconst Rvar gs ALMrouter BRAMrouter Cx P Px Wrouter WBRAM WrouterALM WrouterBRAM N Texec Bb TL  Description # of Resource R available to implement the NoC # of ALMs available to implement the NoC # of Block RAMs available to implement the NoC total quantity of Resource R total quantity of Resource R consumed by the constant category components total quantity of Resource R consumed by the variable category components global scaling factor # of ALMs consumed by the router # of Block RAMs consumed by the router model coefficients # of ports in the router # of ports in router x data width of the routers data width of a Block RAM maximum router data width due to ALMs maximum router data width due to Block RAMs # of routers execution time bisection bandwidth packet latency  usage of the variable category components is a function of gs . This is shown in Equation 4.2 where the amount of a generic resource R consumed by the variable category (Rvar ) is calculated as a function of gs . Similar to the constant category, parameters that cause the resource usage of the variable category to change independently of gs must remain constant. Once the resource usage of the two categories have been calculated, the amount of resources left over for the NoC can be estimated by subtracting the constant and variable category resources from the total amount of available resources as shown in Equation 4.1. By replacing generic resource R with the name of a real resource, these equations can be used to model the usage of ALMs, Block RAMs or another FPGA resource. 36  Chapter 4. FPGA NoC Topology Analytical Model  RN oCmax = Rtotal − Rconst − Rvar  (4.1)  Rvar = f(gs )  (4.2)  The Router Synthesis stage is performed to determine the effect on resource usage the routers in the NoC have when NoC parameters are varied. Ideally, the router is configurable and supports different port counts and data widths but it is not required. If the router is not configurable in the desired way, it is estimated to change based on only one of the factors or always use the same amount of resources. The topology choice must be restricted to ones that can be implemented by the router. Our analytical model focuses on two particular resources that may be consumed by a router: ALMs and Block RAMs. The router used in the NoC is synthesized under several combinations of ports and data widths and the total number of ALMs and Block RAMs used in each configuration is recorded. In the case of ALM usage estimation, the equation is derived through two stages of curve fitting. During the first stage, the number of ports in the router is held constant and ALM usage data is recorded while varying the router’s data width. Curve fitting is performed to generate a polynomial that models the ALM usage of the router based on data width of the router (Wrouter ) being variable and the router port count being constant. The resulting equation is shown in Equation 4.3. The first stage is repeated multiple times for different port counts to get a collection of polynomials. During the second stage, curve fitting is performed on the coefficients of these equations to model them based on the router’s port count (P ) being variable. The polynomials that model the coefficients C0 and C1 are shown in Equation 4.4 and Equation 4.5. We opt to use a quadratic and linear equations because initial tests have shown they can estimate the resource usage of the router within 7%. ALMrouter = C0 Wrouter + C1 2  (4.3)  C0 = C2 P + C3 P + C4  (4.4)  C1 = C5 P + C6  (4.5)  A different equation is required to estimate the number of Block RAM used to implement a router. In the router design we targeted, Block RAM usage could be modeled by Equation 4.6. The number of Block RAMs per port must cover the data width of the router and this particular router does not share Block RAMs between multiple ports. BRAMrouter = P ∗ ⌈Wrouter /WBRAM ⌉  (4.6) 37  Chapter 4. FPGA NoC Topology Analytical Model  Table 4.2: Bisection bandwidth and average hop count formulae for several common topologies.[11] Bisection Bandwidth Average Hop Count Topology √ √ Square Mesh 2 N Wrouter 2( 3N − 3√1N ) ( √ √ N √ even if N √ 2 √ Square Torus 4 N Wrouter 1 N N odd 2( 4 − √ ) if 4 N Fully Connected  N2 2 Wrouter  Ring  4Wrouter  Hypercube  N Wrouter    1 if N even 1 N − 4N if N odd 4 N 4  N −2 2  The next stage is the Performance Estimation stage. The design of the Performance Estimation stage has not been completed. The intent of this stage was to analytically determine the performance of a system based on gs , bisection bandwidth (Bb ) and latency (TL ) of the NoC. We intended to use the Gem5 simulator [5] to find this for systems composed of soft processors. Other multiprocessor simulators that support a configurable interconnection network may also be usable. We would perform several simulations of the target program. These simulation include holding the number of processors constant, setting NoC latency to zero and measuring the effect of varying bisection bandwidth on execution time. We would also run simulations with bisection bandwidth set to unlimited and check the effect on runtime of different amounts of network latency. Furthermore, we would repeat these simulations under different values of gs . Once the data had been collected, we expected to be able to use curve fitting to find an equation that models execution time based on gs and Bb along with an equation that models performance based on gs and TL . If completed, we would get Equation 4.7 and Equation 4.8. Texec = f(gs , Bb )  (4.7)  Texec = f(gs , TL )  (4.8)  The final stage is Topology Optimization. The scaling factor gs sets the number of terminal nodes required. Also when gs is used with Equation 4.1, the number of ALM and Block RAMs available can be determined. This information is further used to determine the number of routers needed in the network and the number of ports on each router for different topolo38  Chapter 4. FPGA NoC Topology Analytical Model gies. The output of Equation 4.3 and Equation 4.6 are set to match the maximum number of ALMs and Block RAMs available to the NoC and are rearranged into Equation 4.10 and Equation 4.11 which determine the maximum data width for a given amount of free resources and port counts on the routers. Equation 4.9 takes the minimum of these two values to find the maximum data width for the current NoC parameters. Data width and topology are used to find the bisection bandwidth. Different topologies have a different formula to determine the bisection bandwidth. The formulae for several common topologies are shown in Table 4.2. By sweeping across different topologies and scaling channel width to use up all of the remaining resources, the maximum bisection bandwidth and minimum zero load latency can be determined for a given gs . This is combined with the result from the Performance Estimation stage to determine Texec as a function of gs , bisection bandwidth and latency. Finally, by sweeping across values of gs , a configuration to maximize performance can be found. This configuration will include the value of gs , the NoC topology to use, the number of routers and the data width for the routers. Wrouter = min(WrouterALM , WrouterBRAM ) WrouterALM =  $  WrouterBRAM  PN  (ALMN oCmax − i (C3 Pi + C4 )) PN 2 i ((C0 Pi + C1 Pi + C2 )) $  %  BRAMN oCmax ∗ WBRAM = PN i Pi  (4.9) %  (4.10) (4.11)  39  Chapter 5  Experimental Methodology Our work builds on top of the pre-existing Open Source Network-on-Chip Router (OSNoC) RTL [4]. We modified the OSNoC router architecture to implement the proposed BRS router architecture modifications. Our modifications are described below along with our procedure for characterizing the networks.  5.1  Router RTL Modifications  In the original OSNoC design, simple dual-port Block RAMs are used to implement the register file which makes up the VC buffers. This implementation of the VC buffers is incompatible with the BRS router architecture since only one read and one write to the VC buffers can be done each cycle. To address this issue a new Verilog module was created that synthesizes to a register file made up of true dual-port RAMs. The module has two address ports, two input data ports and two output data ports. Any combination of read and writes can be performed to two different locations each cycle. The module also includes a clock input and write enable signals. The depth and width of the register file are input parameters. If a read and a write to the same location occurs on the same cycle there are two possible outcomes. If the read uses the output data port on the same side as the input data port performing the write, the new data is returned. If the read uses the other output data port, the data in the location prior to the write is retrieved. For each router input port there is a an input port controller module that contains the VC buffers for the port along with logic for managing the buffers. In the conversion to the BRS architecture, each input port controller needs to store their VC buffers in a register file they share with another port instead of a private register file. To accommodate this, each input port controller module no longer has its own register file. Instead the wires that normally would go to the register file leave the controller module to access a register file that is now outside the module. The top level VC router module now holds the VC buffer register files and connects them to the input port controllers that share them. The input controllers send the 40  5.1. Router RTL Modifications address of the locations that want to write to or read from to the register file and potentially the data to be written as well. The register file can send back data stored in a VC buffer. The address space for the VC buffers of the two different ports are disjoint so accesses from the two different input controllers will never conflict with each other. The switch allocator module was modified to accommodate the new limitations to VC buffer access caused by sharing a Block RAM between two router input ports. Additional logic is placed between the switch request signals and the switch allocator. This additional logic also receives signals from the flip flop at each router input port that indicates whether a new flit needs to be written in the VC buffers or not. For each port, the new logic first checks to see if a new flit arrived at that port and needs to be written to the VC buffers. If not, the switch requests from that port are allowed to proceed to the switch allocator. If there is a flit that needs to be stored a second check is performed which looks at the other port that shares a Block RAM with the port currently being examined. The second check checks to see if the other port has a flit waiting to be stored in a VC buffer or has requests for the switch allocator. If either of these are true, the requests for the port being examined are masked off so that it can not win switch allocation. The exception to all of this is the injection port. The injection port does not share a Block RAM with another port so its switch requests are never masked off. The top level VC router module was modified to include logic to manage the shared Block RAM storage for the VC buffers. A new combinational logic block was added to select which address locations inside the Block RAM would be accessed. The decision is based on which ports have a new flit that needs to be written to the VC buffers and which ports have a VC that won time on the switch to send out a flit from the VC buffers. If both ports have a new incoming flit, the two write location addresses are selected. If neither port has a new incoming flit, both read addresses are selected. If there is one new incoming flit and one port won a grant, the corresponding read and write addresses are selected and passed on to the Block RAM. If the read and write was done to the same address, a flag is set and stored in a flip flop. The flag indicates that the read must be done from the same side of the Block RAM as the write was so the new data is read out. The flag is cleared next cycle unless it is triggered again. The switch value used to make the decision is stored in a flip flop because next cycle it is used to interpret the output from the Block RAM. The switch value indicates which output data port from the Block RAM goes to which router input port. There is one set of this additional logic for each shared Block RAM. 41  5.2. Testing Procedure  5.2  Testing Procedure  We generate mesh and torus networks with OSNoC, BRS and CONNECT [24] routers to demonstrate the effect on performance and synthesize the designs to determine the effect on resource usage. Synthesis is done using the commercial tool Quartus II and targets an Altera DE4 FPGA board (EP4SGX530KH40C2). The routers are set to have 2 VCs per port with VC buffer depth of 16. For the routing algorithm, XY dimension ordered routing is used. Each router has 5 ports: North, South, East, West and a local port. For the BRS routers, the East and West ports on the router share a single Block RAM and the North and South ports share a Block RAM. The local port has its own non-shared Block RAM for its VC buffers. In the interest of reducing synthesis and simulation runtime length, our experiments are performed on 4x4 networks. However, our modifications can also be applied to routers in larger networks and we expect our results to hold for larger networks as well unless indicated otherwise. Simulation of the design is performed using ModelSim-Altera. The synthetic traffic patterns used are uniform random, neighbour, bit compliment and transpose. The packets generated are 4 flits long. The local port is attached to a packet generator and a flit sink and they manage the flow of flits to and from the network. Synthetic traffic patterns are used to show how a network responds to different packet injection patterns. Many of these synthetic patterns represent communication patterns found in real applications [11]. We initially run the traffic pattern on the NoC for a 10000 cycle warmup period. After the warmup period is the measurement period where all packets created for the next 10000 cycles are tagged with a flag that indicates that it is one of the packets injected during the measurement period. The flits in these packet are marked with the cycle number the head flit of the packet entered the source queue. Network latency is measured as the difference between the cycle number the head flit entered the source queue and the cycle number the tail flit of the same packet leaves the network. After the measurement period is the drain period where the simulation is run until all packets that were created during the measurement period have left the network. The average latency is calculated as the average network latency of all packets created during the measurement period. To account for the different cycle length of the different designs, the data is normalized to a nanosecond scale instead of cycles using the clock frequency of each router. We sweep across a range of offered traffic to compare the latency of each network and their saturation throughput.  42  5.3. Test Modules  5.3  Test Modules  To assist with simulating to NoC, there is a packet source module for every injection port on every router in the NoC. A packet source is given the address of the router it is associated with, a target rate for generating packets and the length of packets to generate. A packet source has two different modes for determining the destination of the packets it generates. The first mode is random destination. Under this mode, each packet generated is given a target address at random. The addresses are uniformly distributed among the routers in the network except the destination can not be the same as the router the packet source is associated with. The second mode is single destination mode. Under this mode, the packet source is given another parameter which is a target destination address. Every packet generated by the packet source will be sent to the indicated destination and different packet sources can have a different destination address. Every cycle, the packet source generates a random number and compares it against the target packet rate. If the random number is lower than the packet rate, a packet counter is incremented to indicate a new packet that needs to be created and passed on to be injected into the NoC. As long as the packet counter is greater than zero, the packet source will generate a new packet and inject it into the network one flit per cycle. After the head flit of a packet has been generated the packet counter is decremented but the packet source will continue to generate the remaining flits in the packet. The flits in the generated packets are sent to the testbench module and are tagged with the cycle the head flit of the packet was created, and a flag is set if the packet was generated during the measurement period. A counter keeps track of the number of packets generated during the measurement period. Afterwards the flits are sent to the source queue module. The source queue module sits between the packet source and the injection port of the router and is also part of the testing system. Each source queue can hold over 100,000 flits and is intended to be a representation of a buffer with infinite depth. The purpose of the source queue is to allow the packet source to function independently of the network. If the packet source was directly connected to the network, back pressure from the network’s flow control may prevent a packet from being generated which would alter the traffic pattern. The infinite source queue provides a location to store generated flits for as long as necessarily to be accepted by the network. Time spent in the queue is also counted as part of a packet’s latency. The source queue receives flow control information from the injection port and tracks the amount of free buffer space in each VC. When there is room 43  5.3. Test Modules in an appropriate VC buffer for a new flit, the source queue sends the flit at the front of the queue to the router and the flit is tagged with the VC it will be stored in. The flit sink module is the final module used to assist with characterizing a network through simulation. The ejection port of every router is attached to a flit sink that reads flits from the network. The flit sink accepts a parameter indicating how often it should attempt to read a flit from the network and manages flow control accordingly. In our tests, the flit sink was set to always read flits from the network so there is no additional delay for flits that have reached their destination and are attempting to leave the network. The latency of each packet is determined based on the difference in time between the head flit being generated and the tail flit being accepted by the flit sink. The average latency of all packets generated during the measurement period is calculated including the packets that are ejected during the drain period. For throughput data, a counter also counts the total number of tail flits received by the flit sinks during the measurement period.  44  Chapter 6  Experimental Results We performed synthesis and simulations to examine different router and NoC designs from multiple different angles. In particular we looked at the effects on resource usage, clock speed and NoC performance.  6.1  Resource Utilization  Higher degree topologies are possible on FPGA and can result in lower NoC diameter and less routing delay [28]. However, this benefit needs to be weighed against the cost of the higher radix routers it requires. Fig. 6.1 shows how the components of the OSNoC router we base our work off grow in size as the number of ports increase. The VC and switch allocator components of the router grow quadratically. In the rest of this study we focus on mesh and torus topologies which use lower area cost low radix routers at the potential expense of higher network latency. The synthesis results from Quartus II in Table 6.1 show the average resource usage of the routers in a 4x4 mesh NoC based on their location and the total resources used by the entire NoC. Corner routers refers to the four routers at each corner of the mesh. These routers have only two neighbours each. Edge routers in a 4x4 mesh refer to the eight routers that are on the perimeter of the mesh but are not at a corner. These routers have three neighbours each. The centre routers are the four routers in the middle of the mesh and they have four neighbours each. This data indicates that the additional logic to support sharing a Block RAM between two ports has low impact on the number of ALMs used to generate the NoC. The NoC with our modified routers only uses 2.80% more ALMs than the NoC without modifications. The difference in M9K Block RAM usage is much more significant with usage decreases of up to 40%. In the original OSNoC router, edge and corner routers use fewer Block RAMs due to unused ports being optimized out. As a result, these routers do not benefit as much from our router modification. However, edge and corner routers become a smaller factor relative to the centre routers as the mesh NoC is scaled up to higher core counts. The net result on the 4x4 mesh NoC is a 25% reduction in 45  6.1. Resource Utilization  Figure 6.1: ALM Usage as router port count increases. Table 6.1: 4x4 Mesh Resource Utilization. OSNoC [4] BRS % Diff Corner Router (ALM)  608  605  -0.58  Edge Router (ALM)  814  831  2.18  Centre Router (ALM)  1042  1121  7.59  Total NoC (ALM)  12877  13237  2.80  Corner Router (M9K)  3  3  0.00  Edge Router (M9K)  4  3  -25.00  Centre Router (M9K)  5  3  -40.00  Total NoC (M9K)  64  48  -25.00  Block RAM usage with low change in ALM usage. When these results are extrapolated they show an estimated 30% reduction in Block RAM usage in mesh NoC as small as 6x6. We synthesize a 4x4 torus NoC with the same parameters as the 4x4 mesh NoC. The synthesis results from the torus NoC are shown in Table 6.2. The results show strong similarities to the synthesis results of the centre routers in the 4x4 mesh. As a torus network has no edges or corners, every router has the 40% decrease in M9K usage that only the centre routers in the mesh NoC achieved. Although this comes along with the same ALM increase the centre router in the mesh NoC had, it is ultimately only a slightly larger than 5% increase in ALM usage. Table 6.3 displays a comparison between BRS and an FPGA optimized NoC generated by CONNECT [24]. Unlike BRS which uses Block RAMs, 46  6.1. Resource Utilization  Table 6.2: 4x4 Torus Resource Utilization. OSNoC [4] BRS % Diff 4x4 Torus ALM Usage (Avg)  1050  1105  5.29  ALM Usage (Total)  16504  17383  5.33  M9K Usage (Avg)  5  3  -40  M9K Usage (Total)  80  48  -40  Table 6.3: BRS vs CONNECT. A 4x4 Mesh ALM Utilization comparison. ALM Usage CONNECT [24] BRS % Difference Corner Router  2029  605  -70.20  Edge Router  3013  831  -72.41  Centre Router  3965  1121  -71.74  Total NoC  46091  13237  -71.28  CONNECT generates an NoC that avoids using Block RAMs in the interest of leaving them for the user logic. To follow our implemented 4x4 mesh design as closely as possible, the CONNECT NoC generated is also a 4x4 mesh. The CONNECT NoC is set to have 2 VCs per input port and each VC has a VC buffer with a depth of 16. For the allocator, a separable input-first round-robin allocator is used and the flit data width is set to 18. The data in Table 6.3 and Table 6.4 is taken from synthesizing the CONNECT NoC in Quartus II. The CONNECT VC Queues column displays the number of ALMs allocated for use in implementing the VC buffer storage for routers at the corners, edges and centre of the mesh NoC. By comparing this against the total number of ALMs used to implement the complete router, it shows that a huge percentage of the ALMs used by CONNECT are being allocated for use as VC buffer storage. At the cost of the additional Block RAMs to implement the design and any effects on performance, our router has ALM usage that is lower by 71%. The ALM usage decrease in the centre routers is almost exactly in line with the overall decrease in ALM usage which shows that larger networks, that have a larger percentage of centre routers, will maintain the 71% decrease in ALM usage. The number of bits used in a Block RAM can be calculated by multiplying together the number of ports sharing the Block RAM, VCs per port, depth of the VC buffers and width of the VC buffers. Using our presented configuration, at most 1152 out of 9216 bits in a M9K Block RAM are occu47  6.2. Timing  Table 6.4: VC buffers on CONNECT. CONNECT VC Queues [24] Router % ALM Usage Corner Router  1563  77.05  Edge Router  2106  69.90  Centre Router  2665  67.22  Total NoC  33761  73.25  Figure 6.2: ALM usage of BRS with varying VC buffer sizes. All shown configurations also use 3 M9K Block RAM. pied at the same time and the rest of the space is unused. The percentage of the Block RAM that is used can be increased if the desired NoC can take advantage of deeper VC buffers or a larger number of VCs. Fig. 6.2 shows how VC buffer sizes affect ALM usage in BRS. Due to using Block RAMs to store the VC Queue, significantly larger VC queues can be used in exchange for a small increase in ALM usage and no increase in Block RAM usage. However, in the cases where larger VC buffers are not useful it would be beneficial for an NoC to have access to smaller Block RAMs on the FPGA. This would increase the ratio of I/O bandwidth to memory capacity to better match the demands of the VC buffers.  6.2  Timing  We used Altera’s TimeQuest Timing Analyzer to find the maximum clock frequency for BRS and compared it against OSNoC and CONNECT. The results are shown in Table 6.5. CONNECT trades off a higher clock speed 48  6.3. Performance Analysis  Table 6.5: Maximum frequency for 4x4 networks. Max freq (MHz) 4x4 Mesh 4x4 Torus CONNECT [24]  72.37  65.58  BRS  161.71  149.43  OSNoC [4]  167.31  154.20  for fewer router pipeline stages so it has the lowest clock frequency. OSNoC can run at 2.3 times the clock frequency of the CONNECT NoC. BRS has a maximum clock frequency of only 4% less than OSNoC and remains 2.2 times higher than CONNECT. The critical path in the BRS router is in the switch allocations stage. The requests from the VCs go through the new masking logic to the switch allocator and the returned grants set the addresses for accesses to the VC buffers. Since the input addresses to the VC buffer are not known until the end of the router clock cycle, it is not possible to run the Block RAMs at double clock speed to increase the number of accesses per router clock cycle.  6.3  Performance Analysis  The graphs in Fig. 6.3 show how our router modifications affect network latency under different synthetic traffic patterns. From Fig. 6.3(a) it can be seen that the proposed modifications result in a 15% decrease in the saturation throughput for a 4x4 mesh NoC running uniform random traffic. The difference in performance comes from the new stalling conditions introduced by the limited port count on the shared Block RAMs. At lower injection rates, the effect on packet latency is nearly non existent. Lower injection rates lead to lower traffic on the network which in turn reduces the likelihood of two ports sharing a Block RAM both receiving a flit during the same cycle which would stall reading flit data to be sent out of the VC buffers. Despite the decrease in saturation throughput, the network with our routers still beats the network generated by CONNECT by 10%. On a 4x4 torus network, BRS performs even better against OSNoC and CONNECT. As shown in Fig. 6.3(b) the modified routers now only reduce saturation throughput by 10% relative to OSNoC and does 40% better than CONNECT. Using a torus network instead of a mesh network reduces the average number of hops from source to destination and increases the number of links in the network the traffic is spread over. These factors reduce link utilization for 49  6.3. Performance Analysis 600  400  400  Latency (ns)  Latency (ns)  600  200  200  0 0.00  0.02  0.04  0.06  0.08  0.10  0 0.00  0.12  0.02  Offered Traffic (flits/node/ns)  0.04  0.06  0.08  0.10  0.12  Offered Traffic (flits/node/ns)  (b) 4x4 torus, uniform random traffic  (a) 4x4 mesh, uniform random traffic  400  400  Latency (ns)  600  Latency (ns)  600  200  0 0.00  200  0.02  0.04  0.06  0.08  0.10  0 0.00  0.12  0.02  Offered Traffic (flits/node/ns)  0.04  0.06  0.08  0.10  0.12  Offered Traffic (flits/node/ns)  (c) 4x4 mesh, neighbour traffic  (d) 4x4 torus, neighbour traffic  400  400  Latency (ns)  600  Latency (ns)  600  200  0 0.00  200  0.02  0.04  0.06  0.08  0.10  0 0.00  0.12  0.02  Offered Traffic (flits/node/ns)  0.04  0.06  0.08  0.10  0.12  Offered Traffic (flits/node/ns)  600  600  400  400  Latency (ns)  (f) 4x4 torus, bit complement traffic  Latency (ns)  (e) 4x4 mesh, bit complement traffic  200  0 0.00  200  0.02  0.04  0.06  0.08  0.10  0.12  0 0.00  Offered Traffic (flits/node/ns)  0.04  0.06  0.08  0.10  0.12  Offered Traffic (flits/node/ns)  (g) 4x4 mesh, transpose traffic  OSNoC  0.02  (h) 4x4 torus, transpose traffic  BRS  CONNECT  Figure 6.3: Network Latency vs Offered Traffic for multiple topologies and traffic patterns. Lower latency means better performance.  50  6.3. Performance Analysis equal levels of offered traffic which in turns reduces the likelihood of the new stalling conditions being triggered. Traffic patterns other than uniform random can result in different effects on performance. In particular the neighbour traffic pattern can lead to two different extremes depending on whether the network is a mesh network or a torus network. This effect can be seen in Fig. 6.3(c) which shows the mesh network case and Fig. 6.3(d) which shows the torus network case. Neighbour traffic on a mesh network is a worst case scenario for our routers while neighbour traffic on a torus network is one of the best cases. Under the neighbour traffic pattern each node sends a message to the node one hop to the right and one hop up. On both mesh and torus topologies, the input ports on a router never compete for the same output port. This is a highly favourable traffic pattern for OSNoC and the routers generated by CONNECT and allows these routers to consistently receive and send flits on every port each cycle. Even at a 100% injection rate, represented by the far right point on CONNECT’s lines in the graphs, the CONNECT NoC’s latency remains constant. In the case of the OSNoC router, constant latency all the way to 100% is not quite achievable due to having to wait for a VC buffer to be empty before reusing it. While our routers also benefit from the neighbour traffic pattern relative to uniform random traffic, sharing a Block RAM between the VCs of two ports means our router cannot consistently send and receive flits on all ports. This prevents our routers from improving as much as the other two designs. However, at lower levels of offered traffic, BRS has similar latency to the other two designs and near 100% injection rate is an extreme situation that is unlikely to occur. In contrast to when it is run on a mesh, when neighbour traffic in run on a torus network, it turns out to be highly beneficial to BRS. When neighbour traffic is run on a torus, only West, South and Local input ports, which use three different Block RAMs, receive flits each cycle. The result is the new stalling conditions are never triggered. Cycle for cycle, our modified router matches the OSNoC design but when normalized to nanoseconds the performance is around 3% worse due to a lower clock frequency. For the neighbour traffic pattern on a torus, the CONNECT NoC is able to achieve a 100% offered traffic rate. However, due to the difference in clock frequency it has a saturation throughput around 40% worse than our modified routers. Fig. 6.3(e) and Fig. 6.3(f) show how the different networks respond to a bit complement traffic pattern while Fig. 6.3(g) and Fig. 6.3(h) show the response to a transpose traffic pattern. On a mesh running bit complement traffic, our routers have a 30% lower saturation throughput compared to OSNoC but is nearly 15% better than CONNECT. On a torus running bit 51  6.3. Performance Analysis complement traffic, the different networks perform similarly to neighbour traffic. In the case of transpose traffic, our router nearly matches OSNoC in terms of performance. In the transpose traffic pattern, no router receives data from more than 2 input ports. In only 2 of the 16 routers in the 4x4 mesh do the 2 input ports share a single Block RAM. The result of this is the new stalling conditions are rarely triggered. The saturation throughput of our router network in this traffic pattern is only 3% worse than the OSNoC router network and 30% higher than the CONNECT network. We also examined the effects on throughput under the different router architectures, traffic patterns and NoC topologies. The results are shown in Fig. 6.4. Similar to the latency graphs, the throughput graphs are normalized to a nanosecond time scale instead of cycles due to the different router types having different cycle lengths. Since the CONNECT routers have a much lower clock frequency, the maximum amount of traffic that can be offered to the network is much lower than the other two router designs. For this reason, the far right CONNECT data point in each graph indicates the throughput at 100% injection rate and the line cuts off there since an injection rate above 100% is not possible. Fig. 6.4(a) shows how a 4x4 mesh made up of the different routers responds to uniform random traffic. This data also shows that the BRS router is comparable to CONNECT but trails behind the original OSNoC router design. However, when the same uniform random traffic is run on a torus network, we get the results shown in Fig. 6.4(b). In this case, BRS is much closer to OSNoC in terms of throughput and significantly ahead of CONNECT. Beyond the saturation throughput point, the throughput of all of the networks running uniform random traffic remain relatively constant. This indicates that the networks are stable for this traffic pattern and are not exhibiting symptoms of a starvation problem. Fig. 6.4(c) and Fig. 6.4(d) show the effects on throughput when running the neighbour traffic pattern on a mesh and torus network respectively. In a result similar to what was shown in the latency graphs, the neighbour traffic pattern can be both friendly or unfriendly to the BRS router. On a mesh, the saturation throughput of the BRS network running neighbour traffic is much worse than OSNoC and even a little worse than CONNECT. However, on a torus, BRS is nearly able to match OSNoC and is now way ahead of CONNECT. The reasons for this are the same as they are for the latency graph. The additional flit stalling conditions in BRS are especially harmful relative to OSNoC when neighbour traffic is run on a mesh and they are never triggered when it is run on a torus. Similar to with uniform random traffic, throughput levels off after the saturation throughput point 52  6.3. Performance Analysis 0.12  Accepted Traffic (flits/node/ns)  Accepted Traffic (flits/node/ns)  0.12  0.10  0.08  0.06  0.04  0.02  0.00 0.00  0.02  0.04  0.06  0.08  0.10  0.10  0.08  0.06  0.04  0.02  0.00 0.00  0.12  0.02  Offered Traffic (flits/node/ns)  Accepted Traffic (flits/node/ns)  Accepted Traffic (flits/node/ns)  0.08  0.06  0.12  0.04  0.02  0.08  0.06  0.04  0.02  0.02  0.04  0.06  0.08  0.10  0.00 0.00  0.12  0.02  0.04  0.06  0.08  0.10  0.12  Offered Traffic (flits/node/ns)  (c) 4x4 mesh, neighbour traffic  (d) 4x4 torus, neighbour traffic 0.12  Accepted Traffic (flits/node/ns)  0.12  Accepted Traffic (flits/node/ns)  0.10  0.10  Offered Traffic (flits/node/ns)  0.10  0.08  0.06  0.04  0.02  0.02  0.04  0.06  0.08  0.10  0.10  0.08  0.06  0.04  0.02  0.00 0.00  0.12  0.02  Offered Traffic (flits/node/ns)  0.04  0.06  0.08  0.10  0.12  Offered Traffic (flits/node/ns)  (e) 4x4 mesh, bit complement traffic  (f) 4x4 torus, bit complement traffic  0.12  0.12  Accepted Traffic (flits/node/ns)  Accepted Traffic (flits/node/ns)  0.08  0.12  0.10  0.00 0.00  0.06  (b) 4x4 torus, uniform random traffic  0.12  0.00 0.00  0.04  Offered Traffic (flits/node/ns)  (a) 4x4 mesh, uniform random traffic  0.10  0.08  0.06  0.04  0.02  0.00 0.00  0.02  0.04  0.06  0.08  0.10  0.12  0.10  0.08  0.06  0.04  0.02  0.00 0.00  0.04  0.06  0.08  0.10  0.12  (h) 4x4 torus, transpose traffic  (g) 4x4 mesh, transpose traffic  OSNoC  0.02  Offered Traffic (flits/node/ns)  Offered Traffic (flits/node/ns)  BRS  CONNECT  Figure 6.4: Accepted Traffic vs Offered Traffic for multiple topologies and traffic patterns. Higher throughput means better performance.  53  6.3. Performance Analysis which shows the networks are stable under neighbour traffic. Fig. 6.4(e) and Fig. 6.4(f) show how bit complement traffic affects throughput while Fig. 6.4(g) and Fig. 6.4(h) show the effects of transpose traffic. In all of the cases OSNoC does better than BRS and in three cases BRS beats CONNECT. These graphs once again demonstrate the stability of each of the networks. For bit complement traffic and transpose traffic, throughput remains mostly constant beyond the saturation throughput point for the different router types and topologies.  54  Chapter 7  Related Work Previous works have covered several topics related to our study. They primarily cover alternate FPGA router optimizations and methods of predicting NoC performance. CONNECT [24] is one of the few previous works that attempts to present an FPGA optimized NoC that can use VC routers. CONNECT is an NoC generator that produces NoCs with highly parameterized single stage routers. CONNECT identifies several differences between ASIC NoC and FPGA NoC design such as the limitation in data storage. CONNECT highlights Block RAMs as a scarce resource that is hard to use efficiently and CONNECT avoids them completely in favour of using Distributed RAMs to implement the VC buffers. BRS takes a different path than CONNECT and explicitly targets using Block RAMs to significantly reduce usage of logic resources. RASoC [36] also presents a router architecture optimized for FPGAs. RASoC is a parameterized VHDL model that can be tuned to fit the requirements of the target application. It is intended for use in low overhead NoCs. Similar to BRS, RASoC uses deterministic XY routing but RASoC uses wormhole switching while BRS uses the larger but higher performance VC routing. Similarly, LiPaR [30] is another router architecture designed for FPGAs. Unlike both RASoC and BRS, LiPaR opts to use store-and-forward flow control but it still uses the low overhead deterministic XY routing. Liu et al. [13] highlight the lack of NoC publications targeting FPGAs and propose three different networks designed for FPGAs that beat previous FPGA NoCs in one of area, frequency or latency. The first two designs are a packet switched network and a circuit switched network. The third design is a minimalistic network without arbitration designed to show the upper limit on FPGA NoC performance. Similar to the other designs, these networks use deterministic XY routing and do not use VCs. Lu et al. [22] try to optimize a VC router through changes to the VC buffer organization and allocator design. They use a memory block per router port to hold the VC buffers associated with the port. Normally the wide data lines to a memory block need to run through a multiplexer and 55  Chapter 7. Related Work multiplexers do not map to FPGA logic blocks very well. Instead they devise a system where the narrower memory block address lines are multiplexed. Our proposed optimizations differs in that we combine the VC buffers from two ports into a single memory block implemented using a single Block RAM. Lu et al.’s parallel VC/switch allocator optimization is orthogonal to our proposed optimizations and can be used to further reduce resource usage. GCQ [7] proposes an FPGA based switch implementation that uses an alternate method of sharing Block RAMs among multiple data queues. Their work targets a Combined Input and Crosspoint Queued switch architecture and shares Block RAM among multiple crosspoint buffers in a large crossbar. They run the Block RAM at a higher clock frequency than the rest of the switch and time multiplex accesses to it. While their work targets the crosspoint buffers, our work focuses on reducing the size of the input buffers. Both optimizations can potentially be used in the same design. Francis et al. [16] propose using an FPGA architecture with Time Division Multiplexed wiring and hard routers to reduce the size and improve performance of the NoC. BRS does not require either of these resources to function. Kapre et al. [18] study a time multiplexed FPGA NoC while our work targets improving a packet-switched NoC. Targeting specific communication patterns is another approach to optimizing a router. By targeting a master-slave communication pattern, Leary et al. [20] are able to decrease resource usage and network latency while improving throughput. Singh et al. [31] generate multiple NoC configurations that are loaded at runtime to match the application which takes advantage of the reconfigurability of the FPGA. There is also work looking at modelling the performance of NoCs implemented on an FPGA. Shannon et al. [21] look at creating a model for predicting the performance of NoCs that have been implemented on an FPGA over a variety of topologies and FPGAs. They use the node count, node degree and links width to estimated local and global routing demand. They focus on using this information to estimate the maximum frequency on the NoC rather than the effect on the performance of the application. Saldana et al. [28] evaluate several different FPGA NoC topologies in terms of resource usage and speed. They conclude that for networks with as few as 16 nodes, network topology usually has little effect on resource usage and, for a resource and clock speed penalty, a fully connected network is possible. However, their work shows that as the number of nodes increase and the NoC approaches the limits of the routing capacity of the FPGA, mesh and hypercube topologies perform best. Our proposed modifications 56  Chapter 7. Related Work do not affect the wiring between routers and can be used to reduce Block RAM usage for larger mesh networks. Although not targeting FPGAs, there has been work that attempts to find the best topology to use in an NoC. Teuscher et al. [6] attempt to use an evolutionary algorithm to generate an optimal NoC. Their algorithm supports links with different costs and they examine the distribution of wire length. Our work does not use an evolutionary algorithm and our model is intended for use with FPGAs.  57  Chapter 8  Conclusion This thesis presented a novel approach to improving the efficiency of using Block RAMs to reduce the size of a VC Router implementation on FPGAs and the BRS router architecture that uses the proposed optimization. We discussed the problem of VC buffer implementation and how they have a large impact on the ALM usage of a router and how a Block RAM implementation can be used to reduce this impact. We present a method of modifying the router to share Block RAMs between the VC buffers of two ports to minimize the Block RAM usage of BRS. The BRS routers were both synthesized and simulated using 4x4 mesh and 4x4 torus NoCs. The resource usage results displayed a 25% decrease in Block RAM usage and comparable ALM usage to a base router design that implements their VC buffers in Block RAMs. In comparison to a router that did not use Block RAMs, a 71% reduction in ALM usage was shown. In terms of performance, there was only a 15% decrease in saturation throughput when running uniform random traffic and 50% for neighbour traffic on a mesh which can be reduced to 3% if run on a torus. The effect on maximum clock frequency was a minor 4% decrease relative to the original design. This thesis also proposed the use of an analytical model to model the performance of an FPGA NoC and assist with determining the best topology to use. An initial framework for achieving this goal is described which shows a method of estimating the resource usages of the components in an NoC.  58  Bibliography [1] T. M. Aamodt. FPGA architecture and digital design flows [powerpoint slides]. [2] Altera. FPGAs. [3] Altera. Logic array blocks and adaptive logic modules in Stratix IV devices. [4] D. U. Becker. Efficient microarchitecture for network-on-chip routers. PhD thesis, Stanford University. August 2012. [5] N. Binkert, B. Beckmann, G. Black, et al. The gem5 simulator. In ACM SIGARCH Computer Architecture News, volume 39, pages 1–7, 2011. [6] H. Chung, A. P. Asnodkar, and C. Teuscher. A structural analysis of evolved complex networks-on-chip. In Proceedings of the Fifth International Workshop on Network on Chip Architectures (NoCArc), pages 17–22, December 2012. [7] Z. Dai and J. Zhu. Saturating the transceiver bandwidth: switch fabric design on FPGAs. In Int’l Symposium on Field Programmable Gate Arrays, pages 67–76, February 2012. [8] W. J. Dally. Virtual-channel flow control. In 17th Int’l Symposium on Computer Architecture, pages 60–68, June 1990. [9] W. J. Dally and C. L. Seitz. Deadlock-free message routing in multiprocessor interconnection networks. In IEEE Transactions on Computers, pages 547–553, May 1987. [10] W. J. Dally and B. Towles. Route packets, not wires: On-chip interconnection networks. In Proceedings of the 38th Annual Design Automation Conference, pages 684–689, June 2001.  59  Bibliography [11] W. J. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers, 2003. [12] T. Dorta, J. Jiménez, J. L. Martı́n, U. Bidarte, and A. Astarloa. Reconfigurable multiprocessor systems: A review. In International Journal of Reconfigurable Computing, October 2010. [13] A. Ehliar, J. Eilert, and D. Liu. A comparison of three FPGA optimized NoC architectures. In Swedish System-on-Chip Conference (SSoCC), 2007. [14] G. Estrin. Organization of computer systems: The fixed plus variable structure computer. In Proceeding of IRE-AIEE-ACM ’60 (Western) Papers, pages 33–40, May 1960. [15] G. Estrin. Reconfigurable computer origins: the UCLA fixed-plusvariable (F+V) structure computer. In IEEE Annals of the History of Computing, pages 3–9, October 2002. [16] R. Francis and S. Moore. Exploring hard and soft networks-on-chip for FPGAs. In Int’l Conference on Field-Programmable Technology (FPT), pages 261–264, December 2008. [17] M. Galles. Scalable pipelined interconnect for distributed endpoint routing: The SGI SPIDER chip. In Symposium on Hot Interconnects, pages 141–146, August 1996. [18] N. Kapre, N. Mehta, M. deLorimier, R. Rubin, H. Barnor, M. J. Wilson, M. Wrighton, and A. DeHon. Packet switched vs. time multiplexed FPGA overlay networks. In IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 205–216, April 2006. [19] J. Kwa and T. M. Aamodt. Small virtual channel routers on FPGAs through block RAM sharing. c 2012 IEEE. reprinted with permission. In Proceedings of the IEEE International Conference on Field Programmable Technology (FPT), pages 71–79, December 2012. [20] G. Leary, K. Mehta, and K. S. Chatha. Performance and resource optimization of NoC router architecture for master and slave IP cores. In 5th IEEE/ACM/IFIP Int’l Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pages 155–160, 2007.  60  Bibliography [21] J. Lee and L. Shannon. Predicting the performance of applicationspecific NoCs implemented on FPGAs. In Int’l Symposium on Field Programmable Gate Arrays, pages 23–32, February 2009. [22] Y. Lu, J. McCanny, and S. Sezer. Exploring virtual-channel architecture in FPGA based networks-on-chip. In Int’l SOC Conference (SOCC), pages 302–307, September 2011. [23] B. Neji, Y. Aydi, R. Ben-atitallah, S. Meftaly, M. Abid, and J-L. Dykeyser. Multistage interconnection network for MPSoC: Performances study and prototyping on FPGA. In 3rd Int’l Design and Test Workshop, pages 11–16, December 2008. [24] M. K. Papamichael and J. C. Hoe. CONNECT: Re-examining conventional wisdom for designing NoCs in the context of FPGAs. In Int’l Symposium on Field Programmable Gate Arrays, pages 37–46, February 2012. [25] K. Parnell and R. Bryner. Comparing and contrasting FPGA and microprocessor system design and development. May 2005. [26] L. Peh and W. J. Dally. A delay model and speculative architecture for pipelined routers. In Int’l Symposium on High-Performance Computer Architecture, pages 255–266, January 2001. [27] A. Roca, J. Flich, and G. Dimitrakopoulos. DESA: Distributed elastic switch architecture for efficient networks-on-FPGAs. In 2012 22nd International Conference on Field Programmable Logic and Applications (FPL), pages 394–399, August 2012. [28] M. Saldana, L. Shannon, and P. Chow. The routability of multiprocessor network topologies in FPGAs. In Int’l Workshop on System-Level Interconnect Prediction, pages 49–56, March 2006. [29] G. Schelle and D. Grunwald. Exploring FPGA network on chip implementations across various application and network loads. In Int’l Conference on Field Programmable Logic and Applications., pages 41– 46, September 2008. [30] B. Sethuraman, P. Bhattacharya, J. Khan, and R. Vemuri. LiPaR: A Light-Weight parallel router for FPGA-based Networks-on-Chip. In Proceedings of the 15th ACM Great Lakes symposium on VLSI, pages 452–457, April 2005. 61  Bibliography [31] A. Singh, A. Kumar, T. Srikanthan, and Y. Ha. Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. In Int’l Conference on Field-Programmable Technology (FPT), pages 365–368, December 2010. [32] 1-CORE Technologies. FPGA logic cells comparison. [33] Xilinx. Field programmable gate array (FPGA). [34] Xilinx. FPGA. [35] Xilinx. FPGA vs. ASIC. [36] C. A. Zeferino, M. E. Kreutz, and A. A. Susin. RASoC: A router softcore for Networks-on-Chip. In Design, Automation and Test in Europe Conference and Exhibition, pages 198–203, February 2004.  62  


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items