UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Architecture of block-RAM-based massively parallel memory structures : multi-ported memories and content-addressable… Abdelhadi, Ameer M. S. 2016

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2016_november_abdelhadi_ameer.pdf [ 11.06MB ]
Metadata
JSON: 24-1.0314219.json
JSON-LD: 24-1.0314219-ld.json
RDF/XML (Pretty): 24-1.0314219-rdf.xml
RDF/JSON: 24-1.0314219-rdf.json
Turtle: 24-1.0314219-turtle.txt
N-Triples: 24-1.0314219-rdf-ntriples.txt
Original Record: 24-1.0314219-source.json
Full Text
24-1.0314219-fulltext.txt
Citation
24-1.0314219.ris

Full Text

Architecture of Block-RAM-Based MassivelyParallel Memory Structures:Multi-Ported Memories and Content-Addressable MemoriesbyAmeer M. S. AbdelhadiB.Sc. Computer Engineering, Technion, 2007M.Sc. Electrical Engineering, Technion, 2010A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)September 2016c© Ameer M. S. Abdelhadi, 2016AbstractSince they were first introduced three decades ago, Field-Programmable Gate Ar-rays (FPGAs) have evolved from being merely used as glue-logic to implementingentire compute accelerators. These massively parallel systems demand highly par-allel memory structures to keep pace with their concurrent nature since memoriesare usually the bottleneck of computation performance. However, the vast majorityof FPGA devices provide dual-ported SRAM blocks only. In this dissertation, wepropose new ways to build area-efficient, high-performance SRAM-based parallelmemory structures in FPGAs, specifically Multi-Ported Random Access Memoryand Content-Addressable Memory (CAM).While parallel computation demands more RAM ports, leading Multi-PortedRandom Access Memory techniques in FPGAs have relatively large overhead inresource usage. As a result, we have produced new design techniques that arenear-optimal in resource overhead and have several practical advantages. Thesuggested method reduces RAM usage by over 44% and improves clock speed byover 76% compared to the best of previous approaches. Furthermore, we proposea novel switched-ports technique that allows further area reduction if some RAMiiports are not simultaneously active. A memory compiler is proposed to generalizethe previous approach and allow generating Multi-Switched-Ports Random AccessMemory.Content-Addressable Memories (CAMs), the hardware implementation of as-sociative arrays, are capable of searching the entire memory space for a specificvalue within a single clock cycle. CAMs are massively parallel search enginesaccessing all memory content to compare with the searched pattern simultaneously.CAMs are used in a variety of scientific fields requiring high-speed associativesearches. Despite their importance, FPGAs lack an area-efficient CAM implemen-tation. We propose a series of scalable, area-efficient, and high-performance BinaryContent-Addressable Memories (BCAMs) based on hierarchical search and datacompression methods. Compared to current RAM-based BCAM architectures,our BCAMs require a maximum of 18% the RAM storage while enhancing clockspeed by 45% on average, hence exhibiting a superior single-cycle search rate.As a result, we can build faster and more cost-effective accelerators to solvesome of the most important computational problems.iiiPrefaceThe major contributions of this dissertation have also been published in journalpapers and conference proceedings [1–5] as outlined below.For all of these publications, I proposed the ideas, carried out the research,performed all of the design methodology and implementation, conducted theexperiments, data generation and the analysis of the results. Furthermore, I preparedthe manuscripts for these publications under the supervision of Prof. Lemieux, whoalso provided editorial support for all of these manuscripts and provided advice onthe research design methodology.• Multi-ported memories– Modular Multi-Ported SRAM-based Memories [1]Published in the 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2014). Parts of this publicationappear in Chapter 3.– Modular Switched Multi-ported SRAM-based Memories [2]Accepted for publication in ACM Transactions on Reconfigurable Tech-ivnology and Systems (TRETS) Special Issue on Reconfigurable Compo-nents with Source Code. Parts of this publication appear in Chapter 4.– A Multi-Ported Memory Compiler Utilizing True Dual-port BRAMs[3]To be published in the 2016 IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM 2016). Parts ofthis publication appear in Chapter 4.• Content-addressable memories– Deep and Narrow Binary Content-Addressable Memories usingFPGA-based BRAMs [4]Published in the 2014 International Conference on Field-ProgrammableTechnology (ICFPT 2014). Parts of this publication appear in Chap-ter 5.– Modular SRAM-based Binary Content-Addressable Memories [5]Published in the 2015 IEEE International Symposium on Field-ProgrammableCustom Computing Machines (FCCM 2015). Parts of this publicationappear in Chapter 6.[1] A. M. S. Abdelhadi and G. G. F. Lemieux. Modular Multi-Ported SRAM-Based Memories. In ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA), pages 35–44, February 2014v[2] Ameer M.S. Abdelhadi and Guy G.F. Lemieux. Modular Switched Multi-Ported SRAM-Based Memories. ACM Transactions on ReconfigurableTechnology and Systems (TRETS), 9(3):22:1–22:26, July 2016[3] A. M. S. Abdelhadi and G. G. F. Lemieux. A Multi-Ported Memory Com-piler Utilizing True Dual-Port BRAMs. In IEEE International Symposiumon Field-Programmable Custom Computing Machines (FCCM), pages200–207, May 2016[4] A. M. S. Abdelhadi and G. G. F. Lemieux. Deep and Narrow BinaryContent-Addressable Memories Using FPGA-Based BRAMs. In Inter-national Conference on Field-Programmable Technology (FPT), pages318–321, December 2014[5] A. M. S. Abdelhadi and G. G. F. Lemieux. Modular SRAM-Based BinaryContent-Addressable Memories. In IEEE International Symposium onField-Programmable Custom Computing Machines (FCCM), pages 207–214, May 2015viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvList of Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Multi-Ported Random Access Memories (MPRAMs) . . . 31.1.2 Content-Addressable Memories (CAMs) . . . . . . . . . . 4vii1.2 Thesis Statement and Research Goals . . . . . . . . . . . . . . . 61.3 Research Problem: On the feasibility of Parallel memory structuresin FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3.1 Multi-Ported Random Access Memories (MPRAMs) . . . 71.3.2 Content-Addressable Memories (CAMs) . . . . . . . . . . 81.4 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . 101.4.1 Invalidation-Based Live-Value Table (I-LVT) . . . . . . . 101.4.2 Switched Ports . . . . . . . . . . . . . . . . . . . . . . . 111.4.3 2-Dimensional Hierarchical Search BCAM (2D-HS-BCAM) 121.4.4 Indirectly Indexed Hierarchical Search BCAM (II-HS-BCAM) 131.5 Research Methodology and Evaluation Metrics . . . . . . . . . . 131.5.1 Verilog Description . . . . . . . . . . . . . . . . . . . . . 131.5.2 Simulation and Synthesis . . . . . . . . . . . . . . . . . . 141.5.3 Results Collection . . . . . . . . . . . . . . . . . . . . . . 151.5.4 BRAM Packing Approximation . . . . . . . . . . . . . . 151.6 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . 162 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 182.1 RAM Multi-Porting Techniques in Embedded Systems . . . . . . 182.1.1 Register-based RAM . . . . . . . . . . . . . . . . . . . . 192.1.2 RAM Multi-Pumping . . . . . . . . . . . . . . . . . . . . 192.1.3 Multi-Read RAM: Bank Replication . . . . . . . . . . . . 212.1.4 Multi-Write RAM: Emulation via Multi-banking . . . . . 21viii2.2 Multi-ported SRAM-based Memories: Prior Work . . . . . . . . . 232.2.1 LVT-Based Multi-ported RAM . . . . . . . . . . . . . . . 232.2.2 Multi-Ported Random Access Memories with True Dual-Ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.2.3 XOR-Based Multi-Ported RAM . . . . . . . . . . . . . . 262.3 FPGA-Based Binary Content-Addressable Memories (BCAMs) . 292.3.1 Register-Based BCAMs . . . . . . . . . . . . . . . . . . 312.3.2 Brute-Force Approach via Transposed Indicators RAM(TIRAM) . . . . . . . . . . . . . . . . . . . . . . . . . . 322.3.3 BCAM Pattern Width Cascading and Scaling . . . . . . . 352.3.4 Reconfiguration Memory Based Content-Addressable Mem-ories (RCAMs) . . . . . . . . . . . . . . . . . . . . . . . 372.3.5 Algorithmic Heuristics . . . . . . . . . . . . . . . . . . . 392.3.6 Vendor Support for BCAMs . . . . . . . . . . . . . . . . 403 Multi-Ported Random Access Memories via Invalidation-Based Live-Value Table (I-LVT) . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2 Invalidation Table . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.1 Bank ID Embedding: Binary-Coded Bank IDs and Selectors 463.2.2 Mutually-Exclusive Conditions: Thermometer-Coded BankIDs with One-hot-Coded Selectors . . . . . . . . . . . . . 523.2.3 Data Dependencies and Bypassing . . . . . . . . . . . . . 55ix3.2.4 Initializing Multi-Ported RAM Content . . . . . . . . . . 583.2.5 Comparison and Discussion . . . . . . . . . . . . . . . . 603.2.5.1 SRAM Usage based on RAM Architecture . . . 603.2.5.2 Register Usage based on RAM Architecture . . . 623.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 683.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754 Multi-Ported Random Access Memories with Switched Ports . . . . 774.1 RAM Port Classification . . . . . . . . . . . . . . . . . . . . . . 794.2 Multi-Ported Memories with Single Switched-Port . . . . . . . . . 814.2.1 Single Switched-Port Support . . . . . . . . . . . . . . . 834.2.1.1 Data Dependencies and Bypassing . . . . . . . . 884.2.1.2 SRAM Usage based on Port Functionality . . . . 904.2.2 Experimental Results . . . . . . . . . . . . . . . . . . . . 944.3 Multi-Switched-Ports . . . . . . . . . . . . . . . . . . . . . . . . 964.3.1 Multi-Ported RAM with Multiple Switched Ports . . . . . 974.3.1.1 Motivation and Key Idea . . . . . . . . . . . . . 974.3.1.2 Port Assignment and Problem Definition . . . . 1004.3.1.3 Modeling Data Banks with Data Flow Graph(DFG) . . . . . . . . . . . . . . . . . . . . . . 1024.3.1.4 Multi-Switched-Ports DFG Optimization . . . . 1044.3.1.5 Solving the Cover Problem . . . . . . . . . . . 1064.3.1.6 Data Dependencies and Bypassing . . . . . . . . 109x4.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . 1144.3.2.1 Experimental Framework . . . . . . . . . . . . 1144.3.2.2 Methodology . . . . . . . . . . . . . . . . . . . 1154.3.2.3 Test Cases . . . . . . . . . . . . . . . . . . . . 1184.3.2.4 Results . . . . . . . . . . . . . . . . . . . . . . 1184.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1235 2-Dimensional Hierarchical Search BCAMs (2D-HS-BCAMs) . . . . 1245.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1255.2 The 2-Dimensional Hierarchical Search BCAM (2D-HS-BCAM)Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1275.3 BCAM Bypassing . . . . . . . . . . . . . . . . . . . . . . . . . . 1355.4 Comparison and Discussion . . . . . . . . . . . . . . . . . . . . . 1355.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 1395.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1446 Indirectly Indexed Hierarchical Search BCAMs (II-HS-BCAMs) . . 1456.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1466.2 Motivation and Key Idea . . . . . . . . . . . . . . . . . . . . . . 1486.3 Design and Functionality . . . . . . . . . . . . . . . . . . . . . . 1506.4 Feasibility on Altera’s Stratix Devices . . . . . . . . . . . . . . . 1566.5 Comparison and Discussion . . . . . . . . . . . . . . . . . . . . . 1566.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 1616.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164xi7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1657.1 Dissertation Summary . . . . . . . . . . . . . . . . . . . . . . . . 1657.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 1687.2.1 Invalidation-Table Multi-Ported Memories . . . . . . . . . 1687.2.2 BRAM-Based Content-Addressable Memories . . . . . . 1697.2.3 Parallel Reconfigurable Computing with CustomizableConcurrent Memories . . . . . . . . . . . . . . . . . . . . 170Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171A Brute-Force Transposed Indicators (BF-TI) BCAM Writing Mech-anism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184B Wide Priority Encoders in FPGAs . . . . . . . . . . . . . . . . . . . 187C Verilog IPs User Guide . . . . . . . . . . . . . . . . . . . . . . . . . 189xiiList of TablesTable 3.1 Bypassing for XOR-based and binary/thermometer-coded I-LVT multi-ported memories. . . . . . . . . . . . . . . . . . . . 59Table 3.2 Summary of SRAM bits usage. . . . . . . . . . . . . . . . . . 65Table 3.3 Summary of M20K blocks usage. . . . . . . . . . . . . . . . . 66Table 3.4 Summary of register usage. . . . . . . . . . . . . . . . . . . . 67Table 4.1 Single-stage and two-stage bypassing for simple and true dual-port RAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89Table 4.2 Resources consumption for a 4W/8R multi-ported RAM test-case with 8k-entries of 32-bit words and new data RDW bypassing. 92Table 4.3 Register consumption for a 4W/8R multi-ported RAM test-casewith 8k-entries of 32-bit words . . . . . . . . . . . . . . . . . 93Table 4.4 Biclique patterns and their attributes. . . . . . . . . . . . . . . 105Table 4.5 Comparison of purely fixed-ports versus multi-switched-portsimplementations of the example in Figure 4.8. . . . . . . . . . 110Table 4.6 Bypassing of multi-switched-ports. . . . . . . . . . . . . . . . 112xiiiTable 4.7 Single-stage and two-stage BRAM bypassing. . . . . . . . . . 113Table 4.8 Multi-switched-ports conversion (example from Figure 4.8). . . 117Table 4.9 Heterogeneous multi-ported RAM testcases. . . . . . . . . . . 120Table 4.10 Experimental results. . . . . . . . . . . . . . . . . . . . . . . . 121Table 4.11 Results comparison. . . . . . . . . . . . . . . . . . . . . . . . 122Table 6.1 BRAM usage of a single stage II2D-BCAM . . . . . . . . . . 154Table 6.2 Storage efficiency µs (inversed). . . . . . . . . . . . . . . . . . 157Table C.1 List of SMPRAM module interface ports . . . . . . . . . . . . 191Table C.2 List of SMPRAM module parameters . . . . . . . . . . . . . . 192xivList of AbbreviationsThe following lists all abbreviations that have been used in the dissertation.MSPRAM Multi-Switched-Ports Random Access MemoryMPRAM Multi-Ported Random Access MemoryLVT Live-Value TableI-LVT Invalidation-Based Live-Value TableALM Adaptive Logic ModuleESB Embedded System BlockMFB Multi-Function BlockALU Arithmetic Logic UnitAPI Application Programming InterfaceASIC Application-Specific Integrated CircuitBRAM Block-RAMxvCAD Computer-Aided DesignCAM Content-Addressable MemoryBCAM Binary Content-Addressable MemoryTCAM Ternary Content-Addressable MemoryRCAM Reconfiguration Memory Based Content-Addressable MemoryPE Priority EncoderPE Processing ElementLPME Longest-Prefix Match EncoderLPM Longest-Prefix MatchCGRA Coarse-Grained Reconfigurable ArrayCMOS Complementary Metal-Oxide SemiconductorCPU Central Processing UnitDFG Data Flow GraphDSP Digital Signal ProcessingFF Flip-FlopFPGA Field-Programmable Gate ArrayGUI Graphical User InterfacexviIC Integrated CircuitLP Linear ProgrammingILP Integer Linear ProgrammingIP Intellectual PropertyIP Internet ProtocolIPv4 Internet Protocol Version 4IPv6 Internet Protocol Version 6LAB Logic Array BlockMLAB Memory Logic Array BlockLUT Look-Up TableMSB Most Significant BitLSB Least Significant BitRAM Random Access MemoryROM Read Only MemoryRTL Register Transfer LevelSRAM Static RAMVLIW Very Large Instruction WordxviiLZD Leading Zero DetectorLZC Leading Zero CounterWAW Write-After-WriteRAW Read-After-WriteRDW Read-During-WriteTLB Translation Lookaside BufferTIRAM Transposed Indicators RAMSTIRAM Set Transposed Indicators RAMBF-TI Brute-Force Transposed Indicators2D-HS-BCAM 2-Dimensional Hierarchical Search BCAMII-HS-BCAM Indirectly Indexed Hierarchical Search BCAMBST Binary Search TreeBGP Border Gateway ProtocolMSPS Million Searches per SecondCMOS Complementary Metal-Oxide-SemiconductorNMOS N-Type MetalOxideSemiconductorACM Association for Computing MachineryxviiiSIGDA Special Interest Group on Design AutomationIEEE The Institute of Electrical and Electronics EngineersTRETS Transactions on Reconfigurable Technology and SystemsFCCM Field-Programmable Custom Computing MachinesICFPT International Conference on Field-Programmable TechnologyNSERC Natural Sciences and Engineering Research Council of CanadaxixList of NotationsThe following lists all notations that have been used in the dissertation.WAddr Write AddressRAddr Read AddressWData Write Data BusRData Read Data BusRWData Bidirectional Read/Write Data BusRBankSel Read Bank SelectorWPatt Write PatternWAddr Write AddressMPatt Match PatternMAddr Match AddressMIndc Match IndicatorsxxRmPatt Removed PatternMultiPatt Multiple PatternsRefRAM Reference RAMSetRAM Sets RAMnW Number of Write PortsnR Number of Read Portsnt Number of True PortsnW, f Number of Write Simple (Fixed) PortsnR, f Number of Read Simple (Fixed) PortsnW,s Number of Write Switched PortsnR,s Number of Read Switched Portsd Memory Depthw Data WidthnM20K Number of M20K BlocksnBReg Number of Bypass Registersf f b I-LVT Feedback Functionfout I-LVT Output Extraction FunctionxxiP Ports GroupW Writes GroupR Reads GroupEs Switched Edges SetRD RAM DepthCD CAM DepthDW Data WidthPW Pattern WidthSW Set WidthPW,opt Optimal Cascaded Pattern WidthIp,a Match IndicatorA Address SetS Set SetRW,max Widest RAM WidthRD,min Shallowest RAM DepthnC Number of CascadesxxiiAcknowledgmentsMy gratitude begins with my research advisor Dr. Guy Lemieux, who have offeredconsistent support, guidance, dedication and patience throughout my researchstudy and bringing this dissertation to fruition. Without his invaluable assistance,encouragement, advice, and technical insight, this work would not have beenpossible. I am very grateful for having him as my mentor and research advisor!In addition, I would like to thank Dr. Mark Greenstreet, with whom I haveworked on several exciting and productive project. His brilliance and expertisewere a real inspiration. I am also thankful to Dr. Steve Wilton for the severaldiscussions we had on academia, research and teaching. His deep knowledge,insights, support, and encouragement were invaluable.I would also like to thank my external examiner, Dr. Paul Chow from theUniversity of Toronto, for providing valuable comments that helped improve thequality of this dissertation and for being kind enough to travel and attend thisdissertation defence in person. I thank my university examiners, Dr. AlexandraFedorova and Dr. Jeremy Heyl and my supervisory committee Dr. Mieszko Lisand Dr. Mieszko Lis for their constructive feedback and suggestions during thisxxiiidissertation defence.I thank the System-on-Chip technical support staff, Roozbeh Mehrabadi andDr. Roberto Rosales, for their help and dedication. I thank all my colleagues in theSystem-on-Chip research group, and other friends whom I have had the pleasureof working with. In particular, Rehan Ahmed, Usman Ahmed, AbdalrahmanArafeh, Samer Al-Kiswany, Mohammed Al-Taha, Alex Brant, Anas Bsoul, AssemBsoul, Joydip Das, Stuart Dueck, Joseph Edwards, Abdullah Gharaibeh, JeffreyGoeders, David Grant, Eddie Hung, Keith Lee, Zhiduo Liu, Hossein Omidian,Aaron Severance, Ahmad Sharkia, Douglas Sim, Tom Tang, Michael Yue, ChrisWang, and others.Financial support from the Natural Sciences and Engineering Research Councilof Canada (NSERC) is gratefully acknowledged. Thanks to Altera Corp. and CMCMicrosystems for donating hardware and software licenses.Last but most important, thanks and love goes to my family –my parents,Mohammad and Rasmia, my wife, Soaad, and our son, Mohammad– I am gratefulfor your unwavering love, gracious support and continuous encouragement. Thankyou Soaad for being there for me again and again when I needed you the most. Iwould never have gotten to this point without you. Thank you for reminding me ofmy ability to accomplish this goal. I love you!xxivTo my parents, Mohammad and Rasmia,my wife, Soaad,and our son, Mohammad.xxvChapter 1IntroductionThe following introductory chapter is organized as follows. The need for Block-RAM-based massively parallel memory structures and the motivation behind ourwork is described in detail in Section 1.1. The thesis statement and researchgoals are provided in Section 1.2. Section 1.3 explains the feasibility problem ofparallel memory structures in Field-Programmable Gate Arrays (FPGAs). Ourresearch contributions are listed in Section 1.4. Section 1.5 describes the researchmethodology and evaluation metrics used to evaluate and compare our design toother techniques. The organization of the rest of this dissertation is summarized inSection 1.61.1 MotivationSince they were first introduced three decades ago, Field-Programmable GateArrays (FPGAs) have evolved from being merely used as glue-logic to competing1with custom-designed Application-Specific Integrated Circuits (ASICs). ModernFPGAs comprise hundreds of thousands of programmable logic gates augmentedwith thousands of configurable Digital Signal Processing (DSP) blocks and memoryblocks, all on the same chip with flexible routing fabric. Routing and configurationflexibility of these numerous hardware blocks grants FPGAs their inherent paral-lelism; hence, FPGAs are exploited for massively parallel computing and can betailored as an accelerator for specific applications.These massively parallel systems demand highly parallel memory structures tokeep pace with their concurrent nature since memories are usually the bottleneck ofcomputation performance. In this dissertation, we propose new ways to build twokey types of parallel memory structures in FPGAs, specifically Multi-Ported Ran-dom Access Memories (MPRAMs) and Content-Addressable Memories (CAMs),using the regular logic, registers and dual-ported memory blocks found in modernFPGAs.FPGA devices provide SRAM blocks with only one or two access ports. Thisallows RAM content in one block to be accessed concurrently by one or two “users”at the same time. To allow more concurrent access, a method is required to allowpotentially dozens of users to simultaneously read or write an SRAM in the sameclock cycle as depicted in Figure 1.1.In addition, modern FPGAs do not provide hard CAM blocks. Instead, theymust be built from existing SRAM blocks; existing construction techniques do notscale well, making large CAMs very inefficient.Below, we elaborate further on MPRAMs and CAMs.2Multi-Ported Memory (Shared Memory)D  a  t  aReadReadD  a  t  aALUDSPNetworkAddressP o r tD  a  t  aAddressD  a  t  aAddressD  a  t  aAddressWriteP o r tWriteP o r tReadP o r tWriteP o r tD  a  t  aAddressP o r tAddressReadP o r tD  a  t  aAddressReadP o r tD  a  t  aAddressStreamingD  a  t  aAddressWriteP o r tReadP o r tD  a  t  aAddressFigure 1.1: Multi-Ported Random Access Memory (MPRAM) abstraction asshared memory allowing concurrent access for several users.1.1.1 Multi-Ported Random Access Memories (MPRAMs)Multi-ported memories are the cornerstone of all high-performance Central Pro-cessing Unit (CPU) designs. They are often used in register files, but also in othershared-memory structures such as Translation Lookaside Buffers (TLBs), cachesand coherence tags. Hence, high-bandwidth memories with multiple parallel read-ing and writing ports are required. In particular, multi-ported RAMs are oftenused by wide superscalar processors [6], Very Large Instruction Word (VLIW)processors [6, 7], multi-core processors [8, 9], vector processors, Coarse-GrainedReconfigurable Arrays (CGRAs) [10], and Digital Signal Processors (DSPs). Forexample, the second generation of the Itanium processor architecture employs a20-port register file constructed from SRAM bit cells with 12 read ports and 8write ports [8]. The key requirement for all of these designs is fast, concurrent,single-cycle access from multiple requesters. These multiple requesters requireconcurrent access for performance reasons. While there is demand for more RAM3ports, the two leading multi-ported RAM techniques in FPGAs have relativelylarge overhead in (1) register usage or (2) total SRAM block count [11, 12]. Thisdissertation introduces two new design techniques that are near-optimal in resourceoverhead and have several practical advantages. Furthermore, it provides a mech-anism to construct and optimize memory structures with time-switched ports. ARAM compiler has also been developed to automate the construction of theseswitched memories.1.1.2 Content-Addressable Memories (CAMs)CAMs, being a hardware implementation of associative arrays, are massivelyparallel search engines accessing all memory content to compare with the searchedpattern simultaneously. An abstraction of a CAM is shown in Figure 1.2 whereCAM patterns are stored in a memory. To find a specific pattern in the CAM,all CAM patterns are read simultaneously, then compared to the matched pattern.Comparing all CAM patterns to the matched pattern generates match indicators (ormatch lines) which indicated for each pattern in the CAM if this pattern matchesthe searched pattern. A priority-encoder will detect if there is a match and generatethe address of the first matching pattern.CAMs are considered heavy power consumers due to the very wide memorybandwidth requirement and the concurrent compare. While a standard RAM returnsdata located in a given memory address, a CAM returns an address containing aspecific given datum, thus performing a memory-wide search for a specific value.To do this, it must perform a memory-wide search for a specific value, and there4Search for: “UBC”(Match Pattern)CAM Patterns“SFU”“KPU”“UBC”“UBC”01240124=?=?=?=?“UFV”33=?Match Indicators (Match Lines)Found in: “2”(Match Address)“UBC” is Found!(Match/Match)CAM AddressesCAM AddressesFigure 1.2: Content-Addressable Memory (CAM) abstraction as a massivelyparallel search engine accessing all memory content to compare withthe searched pattern simultaneously.may be multiple addresses that all match the data.Since a CAM is actually a high-performance implementation of a very basicassociative search, it can be used in many science fields [13]. CAMs are keystonesof network processors [14–17], specifically used for Internet Protocol (IP) lookupengines for packet forwarding [18–24], intrusion detection [25–29], packet filteringand classification [30–32]. In high-performance processors, CAMs are used formemory management as coherence tag arrays for highly-associative caches [33]and Translation Lookaside Buffers (TLBs) [34, 35]. CAMs are also used toimplement load and store queues in out-of-order instruction schedulers with a widescheduling window [36]. In addition, CAMs are used for pattern matching [37–39],data compression [40], DSP [41, 42], databases [43, 44], bioinformatics [45, 46],and logic minimization [47]. A variety of other scientific fields also use CAMs assingle-cycle associative search accelerators with millions of search entries.Despite their importance, the high implementation cost of CAMs means theyare used sparingly. As a result, FPGA vendors do not provide any dedicated CAM5circuitry or any special infrastructure to enable a construction of efficient CAMs.Furthermore, FPGAs lack an area-efficient soft CAM implementation. CurrentCAM approaches in vendor IP libraries achieve a maximum of 64K entries andutilize all the resources of a modern FPGA device. Instead, designers tend to usealgorithmic search heuristics causing a dramatic performance degradation.1.2 Thesis Statement and Research GoalsThis dissertation addresses the memory bottleneck of massively parallel reconfig-urable systems by providing efficient, parallel and customizable embedded memorystructures. Although MPRAMs and CAMs are important, their high implementa-tion cost means they are used sparingly. As a result, FPGA vendors only providestandard dual-ported memories to handle the majority of usage patterns, and relyupon a simplistic soft CAM implementation that does not scale. This dissertationdescribes a novel, efficient and modular approach to construct MPRAMs andCAMs out of basic dual-ported RAM blocks, logic and registers.The main goal is to address scaling issues with both designs, to permit use ofdeeper MPRAMs with more ports, as well as deeper and wider CAMs that can scalebetter. These new designs must be practical as well, meaning they achieve highclock rates and provide complete functionality such as initialization, bypassing,and fast updates.6WordBit BitWord1..nBit1..nBit1..nFigure 1.3: (left) A single-port SRAM cell using two cross-coupled CMOSinverters. (right) Multi-ported access using additional pass-gates, wordand bit lines.1.3 Research Problem: On the feasibility ofParallel memory structures in FPGAsIn this section, we describe the limitations of creating area-efficient and performance-oriented parallel memory structures in FPGAs, specifically multi-ported memoriesand content-addressable memories.1.3.1 Multi-Ported Random Access Memories (MPRAMs)A Static RAM (SRAM) cell is a Complementary Metal-Oxide-Semiconductor(CMOS) bi-stable circuit built out of CMOS cross-coupled inverters and accesspass-gates as illustrated in Figure 1.3 (left). To provide more access ports, the basicbasic SRAM bit cell can be altered to provide more bit lines, word lines, and accesstransistors as described in Figure 1.3 (right), however, the area growth is quadraticwith the number of ports [48]. Furthermore, this requires a custom design for eachunique set of parameters (e.g., number of ports, width and depth of RAM).7Since FPGAs must fix their RAM block design for generic, most commonusage cases, it is too costly to provide highly specialized RAMs with a largenumber of ports. In FPGAs, one way of synthesizing a multi-ported RAM is tobuild it from registers and logic. However, this is only feasible for very smallmemories. Other techniques, covered in Chapter 2, are inefficient and do notscale well for deep or wide MPRAMs. Hence, a method of composing arbitrary,multi-ported RAMs from simpler RAM blocks is required.1.3.2 Content-Addressable Memories (CAMs)CAMs are usually custom-designed at the transistor level [49–53]. As depictedin Figure 1.4, four additional transistors over the standard 6-transistor SRAM cellare required. The additional transistors form a comparison circuit, an XOR withNMOS-stack only, since its output (the match line) is pulled-up. Other variationsof the BCAM cells are also available, e.g., the 9-T NAND-type BCAM cell [49].However, the custom approach requires more engineering effort and longer delaysin time-to-market. This custom approach is unsuitable for FPGAs.Older FPGA devices, including Altera’s FLEX, Mercury and APEX devices[54] employed minor architectural features to directly construct small CAM blocks.However, FPGA vendors do not provide dedicated hard cores for CAMs in moderndevices. These have been replaced with soft CAM cores that employ existing Block-RAMs in a brute-force approach described in this dissertation as the traditional ortransposed-indicators RAM approach.While modern databases can easily contain millions of entries, the area growth8WordLineBitLineMatchLineSearchLineBitLineSearchLineFigure 1.4: CMOS 10-T NOR-type BCAM cell; solid lines form a 6-TSRAM cell.of traditional CAM techniques in FPGAs is high and is currently limited to 64kentries. Wide and shallow RAMs are needed to efficiently implement brute-forceCAMs. Shallow RAMs are required because each extra bit in the CAM patternwidth doubles the required RAM depth, resulting in poor efficiency. In addition,deeper CAMs can be built by increasing RAM width. However, FPGA RAM blockwidth is growing slowly. For example, M4K blocks in Stratix II devices haveminimal depth of 128 with maximal width of 36, M9K blocks in Stratix III andStratix IV devices have minimal depth of 256 and maximal width of 36, M20Kblocks in Stratix V devices [55] have minimal depth of 512 and maximal width of40. With the increasing depth of RAMs, and limited width growth, the brute-forceapproach is getting less efficient.91.4 Research ContributionsThis section lists the contributions of my dissertation. The multi-ported mem-ory contributions can be categorized into the Invalidation-Based Live-Value Ta-ble (I-LVT) work in Chapter 3 and the switched ports work in Chapter 4. Whereasour preliminary work on BCAMs is the 2-Dimensional Hierarchical Search BCAM(2D-HS-BCAM) work in Chapter 5 and the follow-up work is the Indirectly In-dexed Hierarchical Search BCAM (II-HS-BCAM) in Chapter 6.1.4.1 Invalidation-Based Live-Value Table (I-LVT)Recently, a few FPGA-based multi-ported RAM designs have been proposed.They use a Live-Value Table (LVT) together with multi-banking for data stor-age [11]. While each writing port writes to a different bank, the LVT tracks thelast-written bank for each memory address, allowing reading of the latest data.Since the LVT is composed of registers, it cannot build deep MPRAMs efficiently.Alternatively, the XOR-based method retrieves the latest written data by utiliz-ing cancellation properties of the XOR operator[12]. The XOR-based methodovercomes the register-based memory limitation by storing all data in BRAMs;however, this incurs excessive memory overhead since wide memories are requiredto accommodate all the XOR-ed data.In Chapter 3, an SRAM-based LVT is proposed by utilizing an invalidation-table data structure [1]. Similar to a regular LVT, the Invalidation-Based Live-ValueTable (I-LVT) determines the correct bank to read from, but it differs in how updatesto the table are made.10While previous approaches use registers to implement a live-value-table, thebreakthrough of this approach is that it is near-optimal and purely SRAM-based,allowing it to efficiently scale to large depths. Compared to other multi-portedapproaches, the I-LVT also provides improved overall performance. This disser-tation demonstrates the viability, area reduction, and performance benefits of theproposed approach and provide an open source library with a fully tested, genericand modular implementation that can be adopted in parallel computing systems.1.4.2 Switched PortsSwitched ports, first introduced in [2], are a generalization of true (bidirectional)ports, where a certain number of write ports can be dynamically switched into adifferent number of read ports using a common read/write control signal. Whilea true port is a pair of read/write ports, switched ports are best described as aset of read ports and set of write ports. Furthermore, a given application mayhave multiple sets, each set with a different read/write control. While previouswork generates multi-ported RAM solutions that contain only true ports [56], oronly simple (unidirectional) ports, this research demonstrates that using only twomodels is too limiting and prevents optimizations from being applied. In particular,the use of switched ports can lead to a reduction in the amount of storage neededby the data banks. It does this by using the underlying true-dual-port capability ofFPGA Block-RAMs.The general problem of switched ports is optimized by solving the correspond-ing set cover problem via Integer Linear Programming (ILP). This is the first time11such an optimization model is used to construct multi-ported memories. A memorycompiler that automates the construction of a multi-ported RAM with switchedports was released as an open source library.1.4.3 2-Dimensional Hierarchical SearchBCAM (2D-HS-BCAM)Due to the increasing amount of processed information, modern BCAM applica-tions demand a deep searching space. However, traditional BCAM approaches inFPGAs suffer from storage inefficiency. The 2-Dimensional Hierarchical SearchBCAM (2D-HS-BCAM) [4] is a novel and efficient technique for constructingBCAMs out of standard SRAM blocks in FPGAs. To achieve high storage effi-ciency, the suggested technique first searches for a row containing multiple patterns,then a search is done within the row for a precise match.Using Altera’s Stratix V device, the traditional design method achieves up to a64K-entry BCAM while the proposed technique achieves up to 4M entries. Forthe 64K-entry test-case, the traditional method consumes 43 times more AdaptiveLogic Modules (ALMs), 18 times longer mapping runtime, and achieves onlyone-third of the Fmax of the proposed method. A fully parameterized Verilogimplementation is being released as an open source hardware library. The libraryhas been extensively tested using ModelSim and Altera’s Quartus tools.121.4.4 Indirectly Indexed Hierarchical SearchBCAM (II-HS-BCAM)The hierarchical search just described does not allow wide CAMs to be constructed;it incurs an exponential growth as pattern width increases. The II-HS-BCAM [5] ap-proach solves this by indirectly storing match indicators. This allows II-HS-BCAMblocks to be cascaded in pattern width, allowing for linear growth. Our methodexhibits high storage efficiency and is capable of implementing up to nine timeswider BCAMs compared to other approaches.1.5 Research Methodology and Evaluation MetricsIn the section, we describe the research methodology, including design methodol-ogy, simulation and synthesis used throughout the dissertation. Furthermore, wereview the evaluation metrics used to evaluate and compare our design to othertechniques.1.5.1 Verilog DescriptionAll architectures which have been proposed in this dissertation, together withprevious and standard approaches, have been fully developed as parameterizedVerilog modules. The Verilog implementation is used to verify correctness andevaluate characteristics of the suggested architectures and compare to standardapproaches and previous attempts.Some modules could not be described with Verilog directly. Alternatively, aVerilog generator was developed to generate these modules based on given design13parameters. For instance, the Priority Encoder (PE) which was used in Chapter 5and Chapter 6 to generate CAM match addresses has a recursive definition. SinceVerilog is not capable of describing recursive modules, a generator was used tocreate the Verilog code recursively. In addition, the memory compiler in Chapter 4requires solving a set-cover problem via Linear Programming (LP) to optimizethe generated modules. Since performing this optimization process in Verilog isnot possible, a Verilog generator was used to create Verilog code based on theoptimized solution.1.5.2 Simulation and SynthesisIn each chapter, a large variety of different architectures and design parametersare swept and simulated in batch using Altera’s ModelSim version 10.1e, eachwith over a million random cycles. All different design modules were compiled,mapped and fit using Altera’s Quartus II version 14.0 [57] on Altera’s Stratix V5SGXMABN1F45C2 device [55]. This is a high-end performance-oriented speedgrade 2 device with 360k ALMs, 2640 M20K blocks, and 1064 I/O pins. Half ofthe ALMs can be used to construct Memory Logic Array Blocks (MLABs), wherea single MLAB consists of 10 ALMs.A run-in-batch simulation and synthesis flow manager that simulates and syn-thesizes various designs with various parameters in batch using Altera’s ModelSimand Quartus II is also provided. The Verilog modules, the Verilog generators, thealgorithmic scripts, and the flow manager are available online as an open sourcecontribution [58].141.5.3 Results CollectionA general sweep is performed to test all combinations. Following this, the full setof results is analyzed. In this dissertation, we omit many of the in-between settingsbecause they behaved as one might expect to see via interpolation of the endpoints.The run-in-batch flow manager was also used to collect design results ofthe proposed and previous techniques. Performance evaluation (e.g., Fmax) wasgenerated by TimeQuest, the Quartus II Static Timing Analysis (STA) engine.Resource consumption, such as registers, Look-Up Tables (LUTs), ALMs, LogicArray Blocks (LABs), MLABs, and Block-RAMs (BRAMs), was collected afterthe design was fit to the Stratix V device by the Quartus II fitter.1.5.4 BRAM Packing ApproximationAs shown in Figure 1.5, Altera’s M20K blocks can be configured into severalRAM depth and data width configurations [55]. The total amount of utilizedSRAM bits can be either 16Kbits, or 20Kbits. Assuming that the RAM packingprocess minimizes the number of blocks cascaded in depth to avoid additionaladdress decoding, each 16K lines will be packed into single bit-wide blocks, andthe remainder will be packed into the minimal required configuration.An estimation of the number of packed M20K blocks required to construct aRAM with a specific depth, d, and data width, w, is nM20K(d,w). This value isdescribed as follows15 5 10 20 40½124Data width (bits)BRAM depth (K lines)   1 2 4 8 16 32½124816Data width (bits)BRAM depth (K lines)Figure 1.5: Altera M20K configuration (left) 20Kb (right) 16Kb.nM20K(d,w) =⌊d16k⌋·w d%16k > 8kbw/2 c 8k ≥ d%16k > 4kbw/5 c 4k ≥ d%16k > 2kbw/10c 2k ≥ d%16k > 1kbw/20c 1k ≥ d%16k > 12kbw/40c 12k ≥ d%16k. (1.1)1.6 Dissertation OrganizationThe rest of this dissertation is organized as follows. Chapter 2 provides a back-ground on multi-ported memories and content-addressable memories in FPGAs.Standard and previous methods are surveyed and their limitations are discussed.Chapter 3 provides a detailed description of our near-optimal Invalidation-Based16Live-Value Table (I-LVT) multi-ported memory. Multi-ported memories withswitched port and the Multi-Switched-Ports Random Access Memory (MSPRAM)compiler are described in Chapter 4. The 2-Dimensional Hierarchical SearchBCAM (2D-HS-BCAM), an efficient and deep BCAM for narrow patterns is pro-vided in Chapter 5. Chapter 6 describes the Indirectly Indexed Hierarchical SearchBCAM (II-HS-BCAM), an area-efficient and high-performance BCAM architec-ture, based on hierarchical search and compression techniques that supports widepatterns. Future directions and conclusions are drawn in Chapter 7.17Chapter 2Background and Related WorkThis chapter provides a background on multi-ported memories and content-addressablememories in FPGAs. Standard and previous methods are surveyed and their limita-tions are discussed.2.1 RAM Multi-Porting Techniques in EmbeddedSystemsThis section provides a review of current methods of creating multi-ported memo-ries in embedded systems. Creating multi-ported access to register-based memoriesis described in Section 2.1.1. Multi-pumping is described in Section 2.1.2. Repli-cating a memory bank to increase the number of read and write ports is describedin Section 2.1.3 and Section 2.1.4, respectively.182.1.1 Register-based RAMMulti-ported RAM arrays can be constructed using basic Flip-Flop (FF) cellsand logic. As depicted in Figure 2.1, each writing port uses a decoder to steerthe relevant written data into the addressed row. Each read port uses a mux tochoose the relevant register output. This method is not practical for large memoriesdue to area inflation, fan-out increase, performance degradation, and a decline inroutability.2.1.2 RAM Multi-PumpingA time-multiplexing approach can be applied to an SRAM block to reuse accessports and share them among several clients, each during a different time slot. Asdepicted in Figure 2.2, addresses and data from several clients are latched thengiven round-robin access to a dual-ported RAM. The RAM must operate at ahigher frequency than the rest of the circuit. If the maximum RAM frequency issimilar to the pipe frequency, or a large number of access ports are required, thenmulti-pumping cannot be used.In FPGAs the process of generating multi-pumped memories can be automatedand tailored for heterogeneous applications [59–62]. Furthermore, the multi-pumping methods can be combined with other techniques to save area [63]. Anumber of designs utilize multi-pumping to gain additional access ports while keep-ing area overhead minimal [64, 65]. The 2.3GHz Wire-Speed POWER processoruses double-pumping to double the writing ports [66].19Register-basedArray Width (w)RDatanR-1RAddrnR-1WData1WData0Depth (d)WAddr1WAddr0RData1RAddr1RData0RAddr0Address Decoders Reg0Regd-1Reg1EnD QQQDEnEnDFigure 2.1: Register-based multi-ported memory.1 Write/1 Read Dual-port RAMRData1RDatanR-1RData0RAddr1RAddrnR-1RAddr0RAddrRDataWAddrWDataRead-portWrite-portMod nRCounter ÷nRMod nWCounter÷nWWData1WDatanW-1WData0enddddecoderWAddr1WAddrnW-1WAddr0enenFigure 2.2: Multi-pumping: RAM is overclocked, allowing multiple accessduring one pipeline cycle.202.1.3 Multi-Read RAM: Bank ReplicationTo provide more reading ports, the whole memory bank can be replicated whilekeeping common write address and data as shown in Figure 2.3. Data will bewritten to all bank replicas at the same address, hence reading from any bankis equivalent. This method incurs high SRAM area and consumes more power.However, the replication approach has two strong advantages over other multi-porting approaches. The first is the simplicity and modularity of bank replication.The second is that read access time is unaffected as the number of port increases;only write delays increase due to fan-out, but this can be hidden by pipelining andbypassing. The bank replication technique is commonly used in state-of-the-artprocessors to increase parallelism. The 2.3GHz Wire-Speed POWER processorreplicates a 2-read SRAM bank to achieve 4 read ports [66]. Each of the twointeger clusters of the Alpha 21264 processor has a replicated 80-entry registerfile, thus doubling the number of read ports to support two concurrent integer units.Similarly, the 72-entry floating-point register file is duplicated, supporting twoconcurrent floating-point units [67].2.1.4 Multi-Write RAM: Emulation via Multi-bankingMulti-ported memories are very expensive in terms of area, delay, and powerfor a large number of ports. The overhead of multi-porting can be reduced bymulti-banking if one relaxes the guaranteed access delay constraint. As depicted inFigure 2.4, the total RAM capacity can be divided into several banks, each withfew ports (e.g., dual-port). A fixed hashing scheme is used to match each address211W/1R Dual-port RAM ArrayReadPort 1WritePort WritePortReadPort AddrDataReadPort 2WritePortReadPort AddrDataAddrDataAddrDataReadPort 2WritePortReadPort AddrDataAddrData1W/nR Multi-read RAMReadPort 1AddrDataReadPort 2AddrDataReadPort nAddrDataAddrDataWritePortFigure 2.3: (left) Replicated dual-ported banks with a common write port.(right) Multi-read RAM symbol used in this dissertation.RAM Bank mHashing,  Arbitration, and  Conflict resolvingRAM Bank 2RAM Bank 1Port  1    2        nFigure 2.4: Multi-banking: RAM capacity is divided into several banks. Portsaccess with a fixed hashing scheme.to a single bank; often, the address MSBs are used. Arbitration logic steers accessfrom multiple ports to each bank. Since two ports can request access to the samebank at the same time, a conflict resolving circuit determines which port grantsaccess to a specific bank. The other port will miss the arbitration and is required torequest access again. Not only does this approach provide unpredictable accesslatency due to the arbitration miss, but it also increases delay due to the additionalcircuitry. Several approaches have been proposed to improve multi-banking [6, 68–71]. State-of-the-art memory controllers and caches are based on multi-bankingdue to area and power efficiency. For example, Intel’s i486 CPU has a data cachewith 8 interleaved banks and two access ports [72].222.2 Multi-ported SRAM-based Memories: PriorWorkIn this section a review of three previous multi-ported SRAM-based memoriesare provided. The first approach is based on multi-banking with a LVT [11, 73]and is described in Section 2.2.1. The second approach retrieves the latest writtendata by utilizing logical XOR properties [12, 73] and is described in Section 2.2.3.Constructing data banks with true ports support are provided in Section 2.2.2.These methods are surveyed and their limitations are discussed.2.2.1 LVT-Based Multi-ported RAMAs depicted in Figure 2.5 (left), LVT-based multi-ported memories are compro-mised of two portions, LVT and data storage. In the data storage, each storagebank has a single write port and nR read ports. This bank is replicated nW times,where nW and nR are the number of write ports and read ports respectively. When awrite port writes data to a specific address, the LVT is updated to indicate the portof this latest write operation. When this address is later read, the LVT provides thiswrite port ID to retrieve the correct data.For each RAM address, the LVT stores the ID of the bank replica that holdsthe latest data. The data storage uses a different bank replica for each writing port,while each bank replica has several reading ports. All banks are accessed by allread addresses in parallel; the LVT helps to steer the read data out of the correctbank since it holds the ID of last accessed (written) bank for each address.Actually, the LVT itself is a multi-ported memory with the same depth and23WAddrRAddrLVTBankSelWAddr RAddrnW write/nR read RAMRAddrWAddrRAM 01 Write/nR ReadWAddrRAM 11 Write/nR ReadRAM nW-10WData0WAddr1WData1WAddrnW-1WDatanW-10RData0RAddr1RData1RAddrnR-1RDatanR-1RAddrWAddrData Banks1 Write/nR Read Bank IDs BankSelnW write/nR read LVTFigure 2.5: (left) LVT-based multi-ported memory. (right) An LVT imple-mented using multi-ported memory.number of writing ports as the implemented multi-ported RAM. However, sincethe LVT stores only bank IDs, the data (line) width of the LVT table is only dlog2eof writing ports. As described in Figure 2.5 (right), the LVT doesn’t have writedata, instead it writes a fixed bank ID for each port.Since an LVT is a narrow multi-port memory, it is implemented as a register-based multi-ported memory. As explained in Section 2.1.1, register-based RAMis not suitable for building deep memories. While the LVT width is only log2 ofthe number of writing ports, the depth of the LVT is still identical to the depthof the original RAM. This is the main cause of the area overhead. In Chapter 3we present two methods for building the LVT out of Block-RAMs instead ofregisters, allowing deep and wide multi-ported memories to be constructed muchmore efficiently.Assuming that bank IDs are binary encoded, the total number of registers24required to implement the LVT isd · dlog2 nW e, (2.1)where d is the depth of the memory.For deep memories, the large number of registers and huge read multiplexersmake register-based LVTs impractical. For example, a Stratix V GX A5 device(185k ALMs) can fit up to 16k-deep memory with four write ports.For the data storage part, a total of nW multi-read banks are required for eachwrite port. Each multi-read bank supports nR read ports, allowing the LVT to selectthe required data block. The total number of SRAM cells in the data storage isd ·w ·nW ·nR, (2.2)where w is width of data.Using Altera’s Stratix M20K BRAMs , the total number of required M20Kblocks for the data storage isnM20K(d,w) ·nW ·nR. (2.3)2.2.2 Multi-Ported Random Access Memories with TrueDual-PortsIn the previous subsection, the data storage portion is built with simple read andsimple write parts. Two papers [56, 73] introduced a modification to the data banks25to build a multi-ported memory where all ports are true (bidirectional) ports. Thisis achieved by utilizing the bidirectional functionality of underlying true-dual-portBRAMs in the FPGA. Figure 2.6 (left) illustrates a generalization of this method.Each port either writes to a set of data banks, or reads from them. Every pair ofports has one data bank in common, hence, when a port reads, it can access datawritten from any other port. A register-based LVT determines which bank holdsthe latest data. Figure 2.6 (right) describes an example of a RAM with 3 true-ports.This problem is identical to the mathematical handshakes problem, where eachport must connect (handshake) to all other ports via a RAM. Hence, a total of12·nt · (nt−1) (2.4)data copies are required, where nt is the number of bidirectional (true) ports. Incontrast, the original LVT approach [11] requires n2t data copies. Nevertheless, thistrue-port RAM architecture is still based on a register-based LVT, hence it suffersfrom the same shortcomings.2.2.3 XOR-Based Multi-Ported RAMWhile the two LVT-based multi-ported memories just shown implement their LVTsas a register-based multi-ported memory, the XOR-based multi-ported memory isimplemented using SRAM blocks [12, 73]. This makes it more efficient for deepmemories. However, as will be shown, it is inefficient for wide memories.The XOR-based method utilizes the special properties of the XOR function26RAMR/WData3S3n Read / n WriteRegister-basedLVTS0 S1 S2 S3     Sn-1R/WData1RAM3 Read / 3 WriteRegister-basedLVTFigure 2.6: Data banks with bidirectional true-ports support. Each port sharea single BRAM with any other port.(left) Generalized approach. (right)A 3 true-ports example.to retain the latest written data for each write port [74]. XOR is commutativea⊕b = b⊕a, associative (a⊕b)⊕ c = a⊕ (b⊕ c), zero is the identity a⊕0 = a,and the inverse of each element is itself a⊕a = 0.Like the previous methods, the XOR-based method contains two structures:one replaces the LVT and is used to encode written data, while the other is similarto the data storage but it stores encoded data.As illustrated in Figure 2.7, each write port has a bank with multi-read and asingle write. Some of the read ports are used as a feedback to encode the data to berewritten, while the remaining read ports generate the data output. To perform awrite, the new data is XOR’ed together with all the old data from the other banks;this encoded value is written to the corresponding bank. Hence if an address A is27written through write port i with data WDatai, Banki will be written withBanki[A]← Bank0[A]⊕Bank1[A]⊕·· ·⊕WDatai⊕·· ·⊕BanknW−1[A]. (2.5)A read is performed by XOR’ing all the data for the corresponding read addressfrom all the banks, hence,RDatai[A]← Bank0[A]⊕Bank1[A] · · ·⊕BanknW−1[A]. (2.6)Substituting Banki[A] from (Equation 2.5) into (Equation 2.6) and applying com-mutative and associative properties of the XOR shows that each bank appears twicein the XOR equation, hence will be cancelled since XORing similar elements is 0.The only remaining item will be WDatai, the required data.The XOR-based multi-ported memory requires nW multi-read banks for eachwrite port. Each multi-read bank supports nW −1 read ports for feedback, and nRread ports. Each feedback read port is of width d, to match the write data, so thesefeedback memories can be quite large. The number of required SRAM cells forthe entire multi-ported memory isd ·w ·nW · (nR+nW −1). (2.7)Using Altera’s Stratix M20K BRAMs , the total number of required M20Ks isnM20K(d,w) ·nW · (nR+nW −1). (2.8)28BRAM0,0BRAMnW-2,0BRAM0,0BRAM1,0BRAMnR-1,0WData0 Replicas of read-portS i m p l e  d u a l - p o r t  R A M  A r r a y sBRAM0,1BRAMnW-2,1BRAM0,1BRAM1,1BRAMnR-1,1BRAM0,nW-1BRAMnW-2,nW-1BRAM0,nW-1BRAM1,nW-1WData1RData0RData1BRAMnR-1,nW-1XOR Cancellation(LVT Replacement)Data Banks(Encoded Data)Figure 2.7: XOR-based multi-ported memory.Since FPGA Block-RAM is synchronous, data feedbacks are read with a one-cycleread delay. Hence, the written data, their addresses and write-enables must beretimed to match the feedback data. This requires the following number of registersnW · (w+ dlog2 de+nW ). (2.9)2.3 FPGA-Based Binary Content-AddressableMemories (BCAMs)CAMs are usually designed at the transistor level [49–53]. Older FPGA devices,including Altera’s FLEX, Mercury and APEX devices [54], employed minor29architectural features to support small CAM blocks. However, FPGA vendorsdo not provide dedicated hard cores for CAMs in modern devices. These havebeen replaced with soft CAM cores that employ a brute-force approach. While theaddress space in modern databases can easily exceed millions of entries, traditionalCAM techniques in FPGAs cannot satisfy these requirements. Wide and shallowRAMs are needed to efficiently implement brute-force BCAMs, yet BRAM widthis growing slowly.CAMs can be classified into two major classes: Binary Content-AddressableMemories (BCAMs) and Ternary Content-Addressable Memories (TCAMs). WhileBCAMs hold binary values only, TCAMs can hold don’t care values (X’s) that canmatch any value of the corresponding pattern bit. In this dissertation, we focuson CAM architecture in FPGAs. As depicted in Figure 2.8, CAM architecturein FPGAs can be classified into three categories, based on the FPGA memoryresources they utilize. The first is register-based where registers are used to storepatterns and concurrently compare all register values, however, register resourcesare limited, a state-of the art Altera’s Stratix V device [55] can provide up to 32K-entries of a single byte pattern BCAM. SRAM blocks are utilized in the secondapproach to store for each pattern if it exists in any of the address individually,hence the name transposed RAM, also known as the brute-force approach. Thesame Altera Stratix V device can realize a CAM with byte-wide patterns up to64K in depth. The third method is reconfigurable Reconfiguration Memory BasedContent-Addressable Memory (RCAM), where LUT configuration memory isutilized. It is possible to implement a 128K-line single byte pattern RCAM on the30BCAM in FPGAsSRAM-based Register-basedReconfiguration Memory (RCAM) Transposed-RAM (Brute-force)Set-Transposed-RAM (Hierarchical Search)Figure 2.8: CAM classification in FPGAs.same Stratix V device. However, a LUT’s SRAM configuration is written serially,which adversely affects writing throughput. The following subsections provide adetailed review for these categories.This section provides a review of current BCAM architectures in FPGAs. Usingregisters to create BCAMs is described in Section 2.3.1. The traditional brute-force BRAM-based approach is described in Section 2.3.2. BCAM pattern widthcascading is described in Section 2.3.3. Reconfiguration Memory Based Content-Addressable Memories (RCAMs) are explained in Section 2.3.4. Alternativealgorithmic searches are briefly described in Section 2.3.5. A review of FPGAvendors’ supports of BCAMs is placed in Section 2.3.6.2.3.1 Register-Based BCAMsThe flexibility of reading and writing flip-flops makes it possible to concurrentlyread and compare all the patterns as depicted in Figure 2.9. Similar to a register-based RAM, an address decoder is used to generate one-hot decoded address lines,enabling a single line for writing. Each registered pattern is compared with theMatch Pattern (MPatt) simultaneously; the comparison results drive the match31PWP=En Reg Match/MIndc⌊log2CD⌋ Address DecoderPWWAddrWPWPW==MAddrCD1D Q0EnDEnQRegRegCD-1⌊log2CD⌋MatchWPatt D Q MPattFigure 2.9: Register-based BCAM.line, also called Match Indicators (MIndc) followed by a priority-encoder to detectthe first Match Address (MAddr). The high demand for limited resources suchas registers, comparators, address decoder and priority encoder (proportional toBCAM depth), make this approach infeasible for deep BCAMs; using Altera’shigh-end Stratix V device [55], only a one byte-wide, 32K-depth BCAM can begenerated.2.3.2 Brute-Force Approach via Transposed IndicatorsRAM (TIRAM)The brute-force approach creates BCAMs out of standard BRAMs. These BRAMsare addressed with the match pattern, thus allowing a single cycle pattern match.This approach is utilized by FPGA vendor IP libraries or application notes. Asdepicted in Figure 2.10, a BRAM is addressed by the match pattern while each bitof the RAM data bits indicates the existence of the pattern. The data bit positioncorresponds to the BCAM address location. Thus, the depth of the CAM, CD, mustmatch the width of the data RAM, DW . Also, the pattern width, PW , of the CAM32Indicators RAM (Transposed) PEPWMPatt MAddr⌊log2CD⌋MIndcDW=CDAddr Data011 2 3 4 5 6 7CAM Locations (RAM Data)Match PatternRAM Width = CAM DepthRAM Depth = 2Pattern WidthFigure 2.10: (left) CAM as a Transposed-RAM followed by a Priority En-coder (PE). (right) Example of a pattern indicator for pattern ‘10’ inaddress 4.must match the address width of the data RAM, i.e., PW = dlog2RDe, where RD isRAM depth.In this dissertation, the Transposed Indicators RAM (TIRAM) structure inFigure 2.10 (left) is described as a matrix of indicatorsT IRAM =I0,0 I0,1 · · · I0,CD−1I1,0 I1,1 · · · I1,CD−1...... . . ....I|P|−1,0 I|P|−1,1 · · · I|P|−1,CD−1∀a ∈ A, p ∈ P : Ip,a =(RAM [a] EQ P),(2.10)Where A is the address space set and P is the pattern set in the correspondingRAM structure.The complete system of the brute-force TIRAM approach is described inFigure 2.11. Reading from the TIRAM is performed by providing the MatchPattern (MPatt) as address to read the Match Indicators (MIndc) for the entireBCAM address space for this specific match pattern. A priority encoder detects33the first Match Address (MAddr) from the match indicators. However, writing(also called or updating) to the TIRAM structure requires more computation sinceit requires setting the new match indicator and clearing the old match indicator.Appendix A provides the detailed writing mechanism of the brute-force TIRAMBCAM.To implement a BCAM with CD entries and PW pattern width, namely a CD×PW BCAM, The brute-force TIRAM approach requires CD ·PW SRAM cells forthe Reference RAM (RefRAM), shown in Figure 2.11, which stores a copy of thePattern written to at each CAM location. In contrast, the TIRAM requires 2PW ·CDSRAM cells, for a total of:CD ·PW +2PW ·CD. (2.11)Assuming that RefRAM is fully utilized, and the TIRAM uses the widestBRAM configuration, the BRAM count is estimated as⌈CD ·PWRD,min ·RW,max⌉+⌈2PWRD,min⌉·⌈CDRW,max⌉, (2.12)where RW,max and RD,min are BRAM parameters indicating the maximum width,and minimum depth, respectively.This BRAM-based brute-force approach is adopted by Xilinx [75–78] andAltera [79] to create soft CAMs as described in their application notes. Zerbini andFinochietto [80] apply the pattern width cascaded brute-force approach to emulateTCAMs for packet classification; however updating the TCAM content is not dis-34⌊ ⌋WPattWAddrPWWDataAddrR e f R A M  C D  X  P W2 31 00 1AddressesExampleWrite Controllerlog2CDPW01PriorityEncoder⌊log2PW⌋MIndcAddrRDataW/RT I R A M  2 P w  X  C DCD7 16 15 14 23 3AddressesPatterns4×8 ExampleAddresses8×2 ExamplePWWr MPattMPattPatternsFigure 2.11: Brute-force TIRAM approach with 8 × 2 example.cussed. Jiang [81] also uses the brute-force approach to emulate TCAMs, however,a pattern update requires a sequential rewriting of all RAM addresses for eachpattern. Ullah et al. [82–85] also use the brute-force approach, but their TCAMscan be partitioned, allowing the search of specific TCAM fragments. However, therequired storage is still similar to the brute-force approach. Furthermore, rewritingthe TCAM requires a serial rewriting of all RAM locations.2.3.3 BCAM Pattern Width Cascading and ScalingAs shown in Equation 2.11, SRAM cell usage for the brute-force TIRAM approachis exponential to pattern width PW . A wide pattern width will make the SRAM re-quirements infeasible. BCAM pattern width cascading relaxes this SRAM growth35from exponential into linear. As depicted in Figure 2.12, the BCAM pattern (bothmatched and written pattern) is divided into smaller pattern segments; each seg-ment is associated with a separate BCAM. The BCAM pattern width cascadingtechnique is used by Xilinx[75–78] and Altera [79] to create soft scalable BCAMsas described in their application notes.The write operation writes every pattern segment into its corresponding BCAM,while match operation matches each pattern segment with its corresponding BCAM.A match for the entire pattern is found if a match is found for all segments individ-ually at the same BCAM location (hence the bitwise AND). The optimal patternsegment width is determined by the minimal depth RD,min of the BRAM, namelythe shallowest and widest configuration, since choosing a wider pattern requires ex-ponential growth. The optimal pattern width is therefore PW,opt =⌊log2(RD,min)⌋and the total number of BCAM pattern segments cascades is nC =⌈PWPW,opt⌉.One stage of TIRAM will consume⌈CDRW,max⌉BRAMs and the total BCAMconsumption of the TIRAM is thereforenC ·⌈ CDRW,max⌉+⌈CD ·PW,optRD,min ·RW,max⌉ . (2.13)The linear relation to PW and CD in Equation 2.13 is clear, in contrast to theuncascaded version in Equation 2.12 where the relation to PW is exponential.36Segment0Segment1Segmentn-1Matched Pattern:CAM0CAM1CAMn-1Segment CAMs:Segment Match Indicators:Pattern Segments:Bit-wise AND:Pattern Match Indicators:MIndcMPattWAddrCD PWWPatt PWBit-wiseANDCDCDCDWPattWAddrCAM0CDXPWWrWPattWAddrCAM1CDXPWWrWPattWAddrCAMn-1CDXPWWrMIndcMPattMIndcMPattMIndcMPatt01n-1PW0PW1PWn -1PW0PW1PWn -1Figure 2.12: BCAM pattern width cascading: (top) abstraction, (bottom)details.2.3.4 Reconfiguration Memory Based Content-AddressableMemories (RCAMs)FPGA configuration memory is an SRAM chain loaded with the configurationbit-stream via a serial link and is used to configure the device functionality, mainlyrouting and logic (LUT) functionality. A segment of this configuration chain isshown in Figure 2.13, where it used to configure a 4-input LUT. A modern FPGAdevice accommodates several Mbits of configuration SRAM cells, for instance,37Altera’s Stratix V E device contains 22Mbits of configuration bits for LUTs only[55].The SRAM reconfiguration memory in FPGAs can be utilized as a wide andshallow memory to generate Reconfiguration Memory Based Content-AddressableMemories (RCAMs). As depicted in Figure 2.13, a LUT MUX chooses one SRAMcell for each specific input that represents the LUT function for this input. Alterna-tively, the LUT input is used as a BCAM match pattern, and the LUT configurationSRAM cells indicates for a single CAM address if this specific pattern exists.Hence, a 4-inputs LUT fits a single address BCAM with 4-bits pattern, namely a1×4 BCAM. The address space can be increased by searching concurrently thesame pattern for several BCAMs, i.e., sharing LUT inputs. The pattern width canbe increased by cascading BCAM blocks as described in Section 2.3.3.Due to its inherent nature, a variety of device-dependent RCAM approaches hasbeen proposed [86–92]. All these device-dependent methods utilize Xilinx devicesreconfiguration capabilities via JBits [93], a Java based Application ProgrammingInterface (API) to reconfigure Xilinx devices by accessing and modifying theconfiguration bit-stream. However, RCAM writing requires rewriting the entirebit-stream or parts of it, in case partial reconfiguration is supported. In addition,using LUTs as RCAMs will block logic resources.On the other hand, Altera’s Stratix devices provides full accessibility to LUTconfiguration memory as SRAM blocks with decoded addresses called MLABs[55]. However, the LUT configuration memory can be used either for LUT configu-ration or as part of the MLABs. For example, each ALM of the Stratix V device can38accommodate a 6-input LUT, hence 64 configuration bits. Each 10 ALMs (a singleLAB) creates a simple dual-ported 64×10 or 32×20 MLAB block. MLABs canbe utilized to create BCAMs in a similar method as the reconfiguration memory,although the writing mechanism is different.Other approaches cascade logic LUTs to generate area-efficient CAM [94–97].However, writing to these CAMs is either not supported or requires rewriting theLUT content. Furthermore, logic resources are limited and incur increased delays.The RCAM approach suffers from several drawbacks that make it impracticalin many cases. The RCAM writing requires rewriting the entire bit-stream or partsof it, in case partial reconfiguration is supported. While some applications requirea single-cycle writing capability, RCAMs require several cycles to rewrite thebit-stream, depending on the configuration mechanism. In addition, using LUTs asRCAMs will consume logic resources; this is not the case in BRAM-based BCAMswhere the logic resource and the SRAM blocks are separate resources. Finally,RCAMs are device-dependent, due to the usage of internal LUT parameters andconfiguration. A redesign should be applied for different devices, which requiresmore engineering effort.2.3.5 Algorithmic HeuristicsAlgorithmically, BCAMs are functionally equivalent to associative array datastructures where a set of keys is provided and each key is associated with a specificvalue. Providing a specific key, an associative array will search for the key andreturn the respective data associated with this key.39LUT MUXLUT Output(CAM match)LUT Inputs(CAM Mpatt)LUT Configuration Bit-stream (BCAM Pattern Indicators)Figure 2.13: FPGA LUT as BCAM.A linear search or a scan will traverse the memory space sequentially, hence aworst case runtime of O(n). Hash-tables [98] distribute entries across the memoryand reduce the average computation into O(1), while the worst case scenario isO(n). Self-balancing or height-balanced Binary Search Tree (BST), e.g., AVL treesand red-black trees [98], can also be used to algorithmically construct associativearrays, with a worst case runtime of O(logn).Algorithmic heuristics CAM emulation for specific applications are widelyavailable, often requiring multiple cycles per lookup. For instance, Trie and BSTbased IP lookup engines [18–22], Bloom filters [99, 100], and Hash-table basedassociative arrays [30, 33].2.3.6 Vendor Support for BCAMsModern FPGAs provide plenty of embedded hard-coded blocks, such as Block-RAMs, external memory controllers, processors, DSP blocks/multipliers, and fastI/O transceivers. However, hard CAM blocks do not exist in modern FPGAs pre-sumably due to their high area and power consumption, and their highly specializednature. While most FPGA vendors provide simple register-based or brute-force40SRAM-based CAMs, some old devices provide partial support for CAM construc-tion. Altera’s legacy FLEX, Mercury and APEX [54] device family integratesintrinsic BCAM support into their Embedded System Blocks (ESBs). The ESB canbe configured into several modes; a 2Kbits RAM/ROM mode with configurablewidth and depth, 32 product terms with 32 literal inputs, or a 32×32 BCAM. TheseBCAM blocks can be used in parallel to increase the address space, and can becascaded as described in the previous subsection to increase pattern width. SinceESBs are limited to a few hundred blocks in these devices, and due to routing com-plexity, deep CAMs are infeasible. Furthermore, BCAMs can only be implementedas soft IP in modern Altera devices.On the other hand, Xilinx devices do not provide native support for BCAMs.However, partial configuration capabilities in Xilinx Virtex devices can be utilizedto create a CAM as described in Xilinx application notes [75, 101, 102], howeverthis approach is very slow at writing new patterns. Other Xilinx application notessuggest utilizing the brute-force approach to create BCAMs [75, 76, 78].Lattice ispXPLD devices [103] have an integrated support for CAMs via theirMulti-Function Blocks (MFBs) which can be configured into 128×48 TernaryCAM block (with don’t care values). Alternatively, Actel application notes [104]recommend using multi-cycle CAMs by searching BRAM in parallel. For a single-cycle CAM, using registers is suggested.41Chapter 3Multi-Ported Random AccessMemories via Invalidation-BasedLive-Value Table (I-LVT)In this chapter, a novel and modular approach is proposed to construct multi-portedmemories out of basic dual-ported RAM blocks. Like other multi-ported RAM de-signs, each write port uses a different RAM bank and each read port uses replicationwithin a bank. The main contribution of this work is an optimization that mergesthe previous live-value-table (LVT) and XOR approaches into a common designthat uses a generalized, simpler structure we call an invalidation-based live-value-table (I-LVT). Like a regular LVT, the I-LVT determines the correct bank to readfrom, but it differs in how updates to the table are made; the LVT approach requiresmultiple write ports, often leading to an area-intensive register-based implementa-42tion, while the XOR approach uses wider memories to accommodate the XOR-eddata and suffers from lower clock speeds. Two specific I-LVT implementations areproposed and evaluated, binary and thermometer coding. The I-LVT approach isespecially suitable for larger multi-ported RAMs because the table is implementedonly in SRAM cells. The I-LVT method gives higher performance while occupyingless block RAMs than earlier approaches: for several configurations, the suggestedmethod reduces the block RAM usage by over 44% and improves clock speedby over 76% compared to the best of previous approaches. To assist others, weare releasing our fully parameterized Verilog implementation as an open sourcehardware library. The library has been extensively tested using ModelSim andAltera’s Quartus tools.3.1 IntroductionIn this section, a modular and parametric multi-ported RAM is constructed outof basic dual-ported RAM blocks while keeping minimal area and performanceoverhead. The suggested method significantly reduces SRAM use and improvesperformance compared to previous attempts. To verify correctness, the proposedarchitecture is fully implemented in Verilog, simulated using Altera’s ModelSim,and compiled using Quartus II. A large variety of different memory architecturesand parameters, e.g., bypassing, memory depth, data width, number of reading orwriting ports are simulated in batch, each with over one million random memoryaccess cycles. Stratix V, Altera’s high-end performance-oriented FPGA, is usedto implement and compare the proposed architecture with previous approaches.43Major contributions of this chapter are:• A novel I-LVT architecture to produce modular multi-ported SRAM-basedmemories. It is built out of dual-ported SRAM blocks only, without anyregister-based memories. To the authors’ best knowledge, compared to othermulti-ported approaches, the I-LVT consumes the fewest possible SRAMcells. It also provides improved overall performance.• A fully parameterized Verilog implementation of the suggested methods,together with previous approaches is released as an open source hardwarelibrary. A flow manager to simulate and synthesize various designs withvarious parameters in batch using Altera’s ModelSim and Quartus II is alsoprovided. The Verilog modules and the flow manager are available online[105].The rest of this section is organized as follows. The proposed invalidation-based live-value-table method is described in detail and compared to previousmethods in Section 3.2. The experimental framework, including simulation andsynthesis and results, are discussed in Section 3.3, and conclusions are drawn inSection 3.4.3.2 Invalidation TableAs described in Section 2.2.3, to build an MPRAM, the XOR-based approachrequires nW · (nR+nW −1) encoded copies of the RAM content, while the LVTapproach requires another register-based multi-ported memory with the same44number of read and write ports for bank IDs.This work proposes to implement LVTs using SRAM blocks only, which hasa major scaling advantage over register-based LVTs and a lower SRAM areacompared to the XOR-based approach. Instead of requiring multiple write portsto each multi-read bank in the regular LVT method, we suggest a design with asingle write port each like the XOR method. This makes it feasible to implementthe LVT using standard dual-ported RAMs. However, writing an ID to one bankrequires also invalidating the IDs in the other banks, which produces the need forthe multiple write ports. Instead, we suggest writing an ID to only one specificbank and invalidating all the other IDs with a single write by using an invalidationtable. Since the invalidation table has the same functional behavior as an LVT, wecall it an invalidation-based LVT, or I-LVT.The I-LVT doesn’t require multiple writes to indicate the last-written bank.Instead, as shown in Figure 3.1, the I-LVT reads all other bank IDs as feedback,embeds the new bank ID into the other values through a feedback function f f b,then rewrites the specific bank. To extract back the latest written bank ID, allbanks are read and data is processed with the output function fout to regenerate therequired ID. Selection of the f f b and fout functions is what distinguishes differentI-LVT implementations.The I-LVT requires nW multi-read banks, each with nR read ports for outputextraction. Furthermore, an additional nW −1 read ports are required in each bankfor feedback rewriting. The data width of these read ports varies depending on thefeedback method and the bank ID encoding. In this section, two bank ID encoding45methods are presented, binary and thermometer. The binary method employsexclusive-OR functions to embed the bank IDs, while the second uses mutually-exclusive conditions to invalidate table entries and generate one-hot-coded bankselectors. The two methods are described in Section 3.2.1 and Section 3.2.2,respectively.3.2.1 Bank ID Embedding: Binary-Coded Bank IDs andSelectorsThis approach attempts to reduce the SRAM cell count in the I-LVT by employingbinary-coded bank IDs. The special properties of the exclusive-OR function areutilized to embed the latest written bank ID, hence invalidating all other IDs. Thecurrent written bank ID is XOR’ed with the content of all the other banks from thesame write address as described in the following feedback function,f f b,k = k⊕0≤i<nW ;i 6=kBanki[WAddrk], (3.1)where k is the ID of the currently written bank.Similar to the XOR-based method described in Section 2.2.3, the last writtenbank ID is extracted by XOR’ing the content of all the banks from the same readaddress as described in the following output extraction functionfout,k =⊕0≤i<nWBanki[RAddrk]. (3.2)Without loss of generality, if address A in bank k is written with the feedback46Write-PortWrite-PortWaddr0Waddr1WaddrnW-1Raddr0Raddr1RaddrnR-1RBankSelnR-1Feedback Read-Ports (Address,Data) Pairsf fb,nW-1f fb,0Out Read- AddrPort nR-1 DataOut Read- AddrPort 0 DataOut Read- AddrPort 1 DataFB Read- AddrPort 0 DataFB read- Addrport nW-2 Data1 Write/nW+nR-1 Read Bank 0Write-PortDataAddrOut Read- AddrPort nR-1 DataOut Read- AddrPort 0 DataOut Read- AddrPort 1 DataFB Read- AddrPort 0 DataFB read- Addrport nW-2 Data1 Write/nW+nR-1 Read Bank 1DataAddrOut Read- AddrPort nR-1 DataOut Read- AddrPort 0 DataOut Read- AddrPort 1 DataFB Read- AddrPort 0 DataFB read- Addrport nW-2 Data1 Write/nW+nR-1 Read Bank nW-1DataAddrf fb,10 nW-20Port#Bank#0 nW-210 nW-2nW-1f outf outf outRBankSel1RBankSel0Figure 3.1: Generalized approach for building the I-LVT.47function from Equation 3.1, thenBankk[A] = k⊕0≤i<nW ;i6=kBanki[A]. (3.3)If one of the read ports, say read port r, is trying to read from the same address,namely RAddrr = A, then the read bank selector will be generated using the sameoutput extraction function from Equation 3.2, henceRBankSelr =⊕0≤i<nWBanki[A]. (3.4)Due to XOR operation associativity, RBankSelr from Equation 3.4 can be expressedasRBankSelr = Bankk[A]⊕0≤i<nW ;i 6=kBanki[A], (3.5)Substituting Bankk[A] from Equation 3.3 into Equation 3.5 providesRBankSelr = k⊕0≤i<nW ;i6=kBanki[A]⊕0≤i<nW ;i6=kBanki[A]. (3.6)The last two series in Equation 3.6 can be reduced revealing that RBankSelr = k,the ID of the latest writing bank into address A, as required.Figure 3.2 provides an example of a 2W/2R binary-coded I-LVT. As willbecome apparent in the next section, in case of 2 write ports only, the binary-coded and thermometer-coded I-LVTs are identical. Figure 3.3 shows a 3W/2Rbinary-coded I-LVT.48Write-PortDataAddrWAddr0WAddr1RBankSel0Write-PortDataAddrFB Read Addr-Port 0 DataOut Read Addr-Port 1 DataOut Read Addr-Port 0 DataFB Read Addr-Port 0 DataOut Read Addr-Port 1 DataOut Read Addr-Port 0 Data1 write/3 read bank 1 (1 bit width)RAddr0RAddr1RBankSel11 write/3 read bank 0 (1 bit width)Figure 3.2: A 2W/2R SRAM-based I-LVT; identical for binary-coded orthermometer-coded bank IDs.The required data width of the I-LVT SRAM blocks is dlog2 nW e. Also, nWmulti-read banks are required each with nR output ports for ID extraction andnW − 1 feedback ports for ID rewriting. Hence, the number of required SRAMcells isd · dlog2 nW e ·nW · (nW +nR−1). (3.7)Hence, the number of required M20K block RAMs isnM20K(d,dlog2 nW e) ·nW · (nW +nR−1). (3.8)Similarly, the number of registers required for retiming isnW · (dlog2 de+1). (3.9)49Waddr0Raddr0Raddr1RBankSel1Feedback Read-PortsWrite-PortDataAddr1 Write/4 Read RAM Bank 0 - 2 Bits Width‘00’Port#Bank#0 1 011 0 120{{{{{{{ { {FB Read Addr-Port 1 Data‘10’‘01’Waddr1Waddr2FB Read Addr-Port 0 DataFB Read Addr-Port 1 DataFB Read Addr-Port 0 DataFB Read Addr-Port 1 DataFB Read Addr-Port 0 DataRBankSel0Write-PortDataAddr1 Write/4 Read RAM Bank 1 - 2 Bits WidthWrite-PortDataAddr1 Write/4 Read RAM Bank 2 - 2 Bits WidthOutRead Addr-Port 1 DataOut Read Addr-Port 0 DataOut Read Addr-Port 1 DataOut Read Addr-Port 0 DataOut Read Addr-Port 1 DataOut Read Addr-Port 0 DataFigure 3.3: A 3W/2R SRAM-based I-LVT with binary-coded bank IDs.50Waddr0Raddr0Raddr1Feedback Read-PortsWrite-PortDataAddr1 Write/4 Read RAM Bank 0 - 2 Bits WidthPort#Bank#0 1 011 0 120{{{{{{{ { {Waddr1Waddr2FB Read Addr-Port 1 DataFB Read Addr-Port 0 Data<1><1><0><1><0><0>abcRBankSel1<0>Write-PortDataAddr1 Write/4 Read RAM Bank 1 - 2 Bits WidthWrite-PortDataAddr1 Write/4 Read RAM Bank 2 - 2 Bits Width<0><1><0><1><0><1>RBankSel1<1>RBankSel1<2>abcRBankSel0<0><0><1><0><1><0><1>RBankSel0<1>RBankSel0<2>FB Read Addr-Port 1 DataFB Read Addr-Port 0 DataOut Read Addr-Port 1 DataOut Read Addr-Port 0 DataFB Read Addr-Port 1 DataFB Read Addr-Port 0 DataOut Read Addr-Port 1 DataOut Read Addr-Port 0 DataOut Read Addr-Port 1 DataOut Read Addr-Port 0 DataFigure 3.4: A 3W/2R SRAM-based I-LVT with thermometer-coded bankIDs.513.2.2 Mutually-Exclusive Conditions: Thermometer-CodedBank IDs with One-hot-Coded SelectorsThe previous binary-coded I-LVT incurs a long path delay through the feedbackand output extraction functions due to the nW -wide XOR gates used to generatethese functions, which causes a performance reduction in structures with moreports. On the other hand, employing a thermometer ID encoding reduces thefeedback paths to a single inverter at most, compared to the nW -wide XOR usedearlier.Mutually-exclusive conditions are used to rewrite the RAM contents. A specificbank is written data that contradicts all the other banks, hence only this specificbank will be valid and all the others are invalid. By checking the appropriatemutually-exclusive condition for each bank, only the latest written bank will holdthe valid data.Equation 3.10, Equation 3.11 and Equation 3.12 describe mutually-exclusivefeedback functions for nW values 1, 2, and 3, respectively. These feedback functionsare also illustrated in Figure 3.5. The angle brackets in theses equations are usedfor bit selection and concatenation, while the square brackets in other equationsare used for RAM addressing. As can be seen from these equations, writing toone bank will invalidate all the other banks at the same address since one mutualnegated bit is shared between each two lines. For example, writing to bank1 whennW = 3 (Equation 3.11) will write Bank1〈0〉 ← Bank0〈0〉 which will invalidate52bank 0, and Bank1〈1〉 ← Bank2〈1〉 which will invalidate bank2.nW = 2 : f f b,0 : Bank0〈0〉 ← Bank1〈0〉f f b,1 : Bank1〈0〉 ← Bank0〈0〉 (3.10)nW = 3 :f f b,0 : Bank0〈1 : 0〉 ← 〈Bank2〈0〉,Bank1〈0〉〉f f b,1 : Bank1〈1 : 0〉 ← 〈Bank2〈1〉,Bank0〈0〉〉f f b,2 : Bank2〈1 : 0〉 ← 〈Bank1〈1〉,Bank0〈1〉〉(3.11)nW = 4 :f f b,0 : Bank0〈2 : 0〉 ← 〈Bank3〈0〉,Bank2〈0〉,Bank1〈0〉〉f f b,1 : Bank1〈2 : 0〉 ← 〈Bank3〈1〉,Bank2〈1〉,Bank0〈0〉〉f f b,2 : Bank2〈2 : 0〉 ← 〈Bank3〈2〉,Bank1〈1〉,Bank0〈1〉〉f f b,3 : Bank3〈2 : 0〉 ← 〈Bank2〈2〉,Bank1〈2〉,Bank0〈2〉〉(3.12)2 1 0B a n k 0210Bank2210Bank1210Bank31 0Bank010Bank110Bank 2Bank1Bank0 nw=4:nw=3:nw=2:0 0Figure 3.5: Feedback updates for nW = 2, 3 and 4.53Equation 3.13 generalizes the feedback function tof f b,k〈i〉|0≤i<nW−1 : Bankk[WAddrk]〈i〉 ←Banki[WAddrk]〈k−1〉 i < kBanki+1[WAddrk]〈k〉 otherwise(3.13)This equation shows that each bank is using bits from all other banks to writeits own content. To prove that each two banks are mutually exclusive, one bit ofthese banks should be mutually negated. Suppose 0≤ k0 ≤ nW −1 a bank ID, and0≤ i0 ≤ nW −1 a bit index. From Equation 3.13 if i0 ≥ k0 then another bank ID k1and bit index i1 exist such that Bankk0〈i0〉 ← Bankk1〈i1〉, k1 = i0+1, and i1 = k0.Hence, i1 < k1 and from Equation 3.13 Bankk1〈1〉 ← Bankk0〈i0〉 as required. Theproof in case of i0 < k0 is identical.The output extraction function checks for each one-hot output selector if theread data from a specific bank matches the mutually-exclusive case. Hence, onlyone case will match due to exclusivity. The output extraction function consists ofan nW −1 bit wide comparator for each one-hot selector.An example of a 2W/2R thermometer-coded I-LVT is shown in Figure 3.2,while a 3W/2R thermometer-coded I-LVT is depicted in Figure 3.4.The thermometer-coded I-LVT requires nW −1 SRAM bits to save the mutuallyexclusive cases. However, the feedback read ports require only one bit, sinceonly one bit is used by the feedback function from each bank. nW multi-readbanks are required each with nR output ports for one-hot selectors extraction andnW −1 feedback ports for mutually-exclusive case rewriting. Hence, the number54of required SRAM cells isd · (nW −1) ·nW ·nR+d ·nW · (nW −1). (3.14)Thus, the number of required M20K block RAMs isnM20K(d,nW −1) ·nW ·nR+BM20K(d,1) ·nW · (nW −1). (3.15)Similarly, the number of registers required for retiming is equal to the binary-coded case and is described by Equation 3.9.3.2.3 Data Dependencies and BypassingThe new I-LVT structure and the previous XOR-based multi-ported RAMs incursome data dependencies due to feedback functions and the latency of reading theI-LVT to decide about the last written bank. Data dependencies can be handled byemploying bypassing, also known as forwarding.Figure 3.6 illustrates two types of bypassing based on write data and addresspipelining. Bypassing is necessary because dual-port block RAMs in FPGAscannot internally forward new data when one port reads and the other port writesthe same address on the same clock edge, constituting a read-during-write (RDW)hazard. Both bypassing techniques are functionally equivalent, allowing reading ofthe data that is being written on the same clock edge, similar to single register func-tionality. However, the fully-pipelined two-stage bypassing shown in Figure 3.6(right) can overcome an additional cycle latency. This capability is required if a55block RAM has pipelined inputs (e.g., cascaded from another block RAM) thatneed to be bypassed.DinPort ADoutAddrR/WPort B DinDoutAddrR/W=Din01‘1’‘0’DoutAddrAddrDinPort ADoutAddrR/WPort B DinDoutAddrR/W=DoutAddr‘1’Din‘0’001x01=AddrFigure 3.6: RAM bypassing (left) single-stage (right) 2-stages fullypipelined.The single-stage and the two-stage bypass circuitry for a Block-RAM withw bits data width and d lines depth requires w registers for data bypassing, twodlog2 de wide address registers and one enable register, for a total ofnBypReg(d,w) = w+2dlog2 de+1. (3.16)The most severe data dependency the I-LVT design suffers from is Write-After-Write (WAW), namely, writing to the same address that has been written before inthe previous cycle. This dependency occurs because of the feedback reading andwriting latency. A single-stage bypassing for the feedback banks should solve thisdependency.Two types of reading hazards are introduced by the proposed I-LVT design,Read-After-Write (RAW) and Read-During-Write (RDW). RAW occurs when thesame data that have been written in the previous clock edge are read in the current56clock edge. RDW occurs when the same data are written and read on the sameclock edge.Due to the latency of the I-LVT, reading from the same address on the nextclock edge after writing (RAW) will provide the old data. To read the new datainstead, the output banks of the I-LVT should be bypassed by a single-stage bypassto overcome the I-LVT latency.The deepest bypassing stage is reading new data on the same writing clockedge (RDW), which is similar to single register stage latency. This can be achievedby 2-stage bypassing on the output extract ports of the I-LVT or the XOR-baseddesign to allow reading on the same clock edge. The data banks, which are workingin parallel with the I-LVT should also be bypassed by a single-stage bypass toprovide new data. Table 3.1 summarizes the required bypassing for data banks,feedback banks and output banks for each type of bypassing of the XOR-based,binary-coded and thermometer-coded I-LVT designs.Since XOR-based multi-ported RAM requires bypassing for all the nW · (nW +nR− 1) banks to read new data when RAW or RDW, the additional registersrequired for the bypassing arenW · (nW +nR−1) ·nBypReg(d,w). (3.17)RAW for binary-coded method requires bypassing the I-LVT only. Since theI-LVT is built out of nW · (nW + nR− 1) blocks, each with dlog2 nW e bits width57data, the following amount of additional registers is requirednW · (nW +nR−1) ·nBypReg(d,dlog2 nW e). (3.18)RAW for thermometer-coded method requires bypassing the whole I-LVT,nW · (nW − 1) feedback banks with 1 bit width and nW · nR output banks withnW −1 bits width, hence a total registers ofnW · (nW −1) ·nBypReg(d,1)+nW ·nR ·nBypReg(d,nW −1). (3.19)RDW for both binary and thermometer-coded methods require bypassing thenW ·nR data banks in addition to the I-LVT, hence the following amount of registersis added to the previous count in Equation 3.18 and Equation 3.19nW ·nR ·nBypReg(d,w). (3.20)3.2.4 Initializing Multi-Ported RAM ContentSince some applications require to initializing RAM content to a specific valueon power up to enable processing of this initial data during runtime. Dual-portedBRAMs (e.g., Alteras M20K BRAMs [55]) allow initialization with a specificcontent on power up. However, initializing multi-ported memories requires specialhandling.For the XOR-based multi-ported RAM, the first multi-read bank should be58Table 3.1: Bypassing for XOR-based and binary/thermometer-coded I-LVTmulti-ported memories.XOR-based I-LVT-basedFeedbackbanksOutput banksDatabanksFeedbackbanksOutput banksAllow WAW 1-stage None None 1-stage NoneNew data RAW 1-stage 1-stage None 1-stage 1-stageNew data RDW 1-stage 2-stage 1-stage 1-stage 2-stageinitialized to the required initial content; all the other multi-read banks should beinitialized to zero.On the other hand, LVT-based multi-ported memories require storing the initialcontent into a specific data bank (e.g., Bank 0), then initializing the LVT to thesame bank ID (zeros) points to the location of the initial data.Considering that the initial data will be stored in data bank 0, the thermometer-coded I-LVT-based multi-ported RAM requires initializing all the I-LVT bankswith zeros. The binary-coded I-LVT will generate a selector to the first data bank(indexed zero), since XOR’ing all the initial values (zeros) will generate zero.Similarly, the thermometer-coded I-LVT will be initialized to the first mutuallyexclusive case, hence the first bank will be selected. Only the first bank holds theinitial data; the remaining banks are left uninitialized. The initial values of eachbank in the binary/thermometer-coded I-LVT and XOR-based designs are shownin Figure 3.7.59IWAddrRAddrRData0RData1RDatanR-1WData0WData1WDatanW-1UU I-LVTBankSel0 I I I I I0 0 0 0 00 0 0 0 0WData 0WData 1WDatanW-1RData0RData1RDatanR-1Figure 3.7: Initial value for (left) I-LVT-based (right) XOR-based. (0: zeros,I: initial content, U: uninitialized).3.2.5 Comparison and DiscussionIn this section, we compare SRAM and register consumption of our proposedapproaches with previous approaches based on RAM architecture and ports func-tionality. A usage guideline based on the required RAM parameters is also provided.These analytical results are in agreement with experimental results in Section 3.3.3.2.5.1 SRAM Usage based on RAM ArchitectureIn this section, we compare the previous LVT and XOR approaches to the newI-LVT approaches for building multi-port memories. Using the equations provided,we will illustrate why the I-LVT approach is superior in terms of number ofBRAMs required, and number of registers required. Also, between the two I-LVTmethods proposed, we will inspect the number of BRAMs and registers used byeach bypassing method.Table 3 summarizes SRAM resource usage for each of the three multi-portedRAM approaches: the XOR-based and the binary/thermometer-coded I-LVT. Both60the general SRAM cell count and the number of Altera’s M20K blocks are de-scribed. Comparing the SRAM cell counts, the XOR-based approach consumesfewer SRAM cells than the thermometer-coded I-LVT ifw < nR+1. (3.21)This inequality is unlikely to be satisfied, since for a single byte data width,the number of reading ports nR would need to be larger than 8, which is very rareexcept for systems with multiple requesters requiring a concurrent access to a fewbits of data (e.g., mutex or mailbox system). Hence, for typical use cases, thethermometer-coded I-LVT approach will consume fewer SRAM cells.Comparing the XOR-based approach to the binary-coded approach, the XOR-based approach consumes fewer SRAM cells only ifw <dlog2 nW e · (nW +nR−1)nW −1∣∣∣∣∣nW>1. (3.22)Both Equation 3.21 and Equation 3.22 show that the XOR-based approach willconsume less SRAM cells only for a very narrow data widths which are uncom-monly used. Hence, the I-LVT approach will be the choice for most applications.Comparing the two I-LVT approaches, Table 3.2 shows that the thermometer-codedI-LVT consumes fewer SRAM cells than the binary-coded I-LVT if1 < nW ≤ 3 OR nR < (nW −1) · (dlog2 nW e−1)(nW −1)−dlog2 nW e∣∣∣∣∣nW>3. (3.23)61Register-based RAMRegister-based LVTXOR-based I -L VTdwα nWnR binary3Figure 3.8: Multi-ported RAM usage guideline. α is determined by theminimum of Equation 3.21 and Equation 3.22, while the trend f (nW )is determined by Equation 3.23.Figure 3.8 illustrates a guideline for choosing a multi-ported RAM architecturebased on area efficiency. For shallow memories, register-based RAM and LVToffer the best area efficiency. However, their area rapidly inflates as RAM goesdeeper. The usage of BRAMs alleviates the problem since SRAM-based memorieshave higher capacities that register-based memories. The XOR-based method issuitable for memories narrower than α which is determined by the minimum ofEquation 3.21 and Equation 3.22. Choosing binary-coded or thermometer-codedI-LVT is based on nW and nR only and is determined by Equation 3.23.3.2.5.2 Register Usage based on RAM ArchitectureTable 3.4 summarizes register usage for all multi-ported RAM architectures andbypassing. Only the register-based LVT architecture is directly proportional tomemory depth. As a consequence, it consumes much more registers than otherarchitectures, making register-based LVTs impractical for deep memories.With a single-stage bypassing, the XOR-based design consumes fewer registers62than the binary-coded ifw < dlog2 nW e. (3.24)Equation 3.24 is unlikely to be satisfied. Even if the data width is just one byte(w = 8), the number of write ports nW would need to be larger than 256, which isimpractical.On the other hand, with a single-stage bypass, the XOR-based design consumesfewer registers than the thermometer-coded I-LVT design ifw <1+nR1+ nRnW−1∣∣∣∣∣nW>1. (3.25)In a typical compute-oriented design, nR = 2 · nW . Assuming that nR = 2 ·(nW −1) requires that 3 ·w−1 < nR; even for a one byte data width, this requires23< nR to satisfy Equation 3.25, which is impractical. Therefore, for a single-stagebypass, the I-LVT based designs will consume fewer registers than the XOR-baseddesign.Considering two-stage bypassing, I-LVT based designs will consume nW ·nR ·nBReg(d,w) more registers, as described in Equation 3.20. In this case, XOR-baseddesign consumes fewer registers than the binary-coded I-LVT design only ifw < dlog2 nW e ·(1+nRnW −1). (3.26)On the other hand, XOR-based design consumes fewer registers than the63thermometer-coded I-LVT design only ifw < nR+1. (3.27)Similar to Equation 3.21, which is equal to Equation 3.27, this is unlikely to besatisfied in practical designs. Hence, in the case of two-stage bypassing, the I-LVT-based design will consume fewer bypassing registers than the XOR-basedmethod.In the next sections, we will show these analytical results are in agreement withexperimental results.64Table 3.2: Summary of SRAM bits usage.Data banks LVT feedback banks LVT output banksRegister-based LVT d w nW nR N/A N/AXOR-based d w nW (nR+nW −1) N/A N/ABinary-coded I-LVT d w nW nR d dlog2 nW e nW (nW −1) d dlog2 nW e nW nRThermometer-coded I-LVT d w nW nR d nW (nW −1) d (nW −1 ) nW nR65Table 3.3: Summary of M20K blocks usage1.Data banks LVT feedback banks LVT output banksRegister-based LVTnM20K(d,w) nW nR N/A N/AXOR-based nM20K(d,w) nW (nR+nW−1) N/A N/ABinary-codedI-LVTnM20K(d,w) nW nR nM20K(d,dlog2nW e) nW (nW−1) nM20K(d,dlog2 nW e) nW nRThermometer-coded I-LVTnM20K(d,w) nW nR nM20K(d, ) nW (nW−1) nM20K(d,nW −1 ) nW nR66Table 3.4: Summary of register usage2.No bypass Additional registers for RAW bypass Additional forRDWRegister-based LVTd dlog2 nW e None NoneXOR-based nW (w+ dlog2 de+1) nW (nR+nW −1)nBReg(d,w ) NoneBinary-codedI-LVTnW ( dlog2 de+1) nW (nR+nW −1)nBReg(d,dlog2 nW e) nW nR nBReg(d,w)Thermometer-coded I-LVTSame as binary-coded nW ((nW−1) nBReg(d,1)+nR nBReg(d,nW−1)) Same as binary-coded673.3 Experimental ResultsIn order to verify and simulate the suggested approach and compare to previousattempts, fully parameterized Verilog modules have been developed. Both theprevious XOR-based multi-ported RAM method, and the proposed I-LVT methodhave been implemented. To simulate and synthesize these designs with variousparameters in batch using Altera’s ModelSim version 10.1e and Quartus II version14.0, a run-in-batch flow manager has also been developed. The Verilog modulesand the flow manager are available online [105]. To verify correctness, the proposedarchitecture is simulated using Altera’s ModelSim. A large variety of differentmemory architectures and parameters are swept, e.g., bypassing, memory depth,data width, number of ports, and simulated in batch, each with over one millionrandom memory access cycles.All different multi-ported design modules are implemented using Altera’sQuartus II on Altera’s Stratix V 5SGXMA5N1F45C1 device. This is a high-performance device with 185k ALMs, 2304 M20K blocks and 1064 I/O pins. Weperformed a general sweep and test all combinations of the following parameters:• Writing ports (nW ): 2, 3 and 4 writing ports.• Reading ports (nR): 3, 4, 5 and 6 reading ports.• Memory depth (d): 16 and 32 K-lines.• Data width (w): 8, 16, and 32 bits.• Bypassing: No bypassing, RAW and RDW.68Following this, we analyze the full set of results. In this section, we omit manyof the in-between settings because they behaved as one might expect to see viainterpolation of the endpoints.Figure 3.9 plots the maximum frequency derived from Altera’s Quartus II STAat 0.9V and temperature of 0 ◦C. The results show a higher Fmax for binary/thermometer-coded I-LVT compared to the XOR-based approach for all design cases. Withthree or more writing ports, the thermometer-coded I-LVT supports a higher fre-quency compared to all other design styles. Compared to the XOR-based approach,the thermometer-coded I-LVT improves Fmax by 38% on average for all designconfigurations, while the maximum Fmax improvement is 76%.Figure 3.10 (top) plots the number of Altera’s M20K blocks used to implementeach multi-ported RAM configuration. The proposed binary/thermometer-codedI-LVT consumes the least BRAM blocks in all cases. The average reduction of thebest of binary/thermometer-coded I-LVT compared to the XOR-based approachis 19% for all tested design configurations, while it can reach 44% for specificconfigurations. The difference of consumed Altera’s M20Ks between binary-codedI-LVT and thermometer-coded I-LVT is less than 6%. To clarify the differencein BRAM consumption, Figure 3.10 (bottom) shows the percentage of BRAMoverhead above the register-based LVT, which uses the fewest possible BRAMSoverall. The XOR-based design consumes more BRAMs in all cases, up to twicethe BRAMs compared to register-based LVT. On the other hand, I-LVT-basedmethods consume only 12.5% more BRAMs in the case of 32-bit wide memories.Figure 3.11 shows the number of ALMs consumed by each design with different69bypassing methods. New data RAW bypassing consumes more ALMs than thenon-bypassed version due to address comparators and data muxes. On the otherhand, new data RDW bypassing requires an additional address comparator anda wider mux; hence it consumes more ALMs than a new data RAW bypass. Inall bypass modes, as memory data width goes higher, the XOR-based methodconsumes more ALMs than the I-LVT methods due to wider XOR gates.The number of registers required for various designs and bypassing styles isshown in Figure 3.12. The I-LVT-based methods consume fewer registers comparedto the XOR-based method for no bypassing or new data RAW bypass. For newdata RDW bypass, the I-LVT based methods must bypass the data banks, hencethe register consumption goes higher than the XOR-based method. However, theregister consumption of the register-based LVT method is the highest overall andcan be three orders of magnitude higher since it is directly proportional to memorydepth. Furthermore, register-based LVT memories with 4 write ports and over16k-entries failed to synthesize on our Stratix V with 185k ALMs.Since the register-based LVT approach is not feasible with the provided deepmemory test-cases, the register-based LVT trends are derived analytically fromTable 3.2, Table 3.3 and Table 3.4 and not from experimental results. Hence, theregister-based LVT trend was added as a reference baseline to Figure 3.10 andFigure 3.12 only.70Figure 3.9: Fmax (MHz) T=0C (top) No bypass (bottom) new data RDW bypass.71  Figure 3.10: M20K blocks (top) total count (bottom) overhead percentage relative to register-based LVT.72024681 01 21 41 61 82 02 22 48 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 3216 32 16 32 16 32 16 32 16 32 16 323 6 3 6 3 62 3 4M20K Total Blocks Count (100's)WidthDepth (k)#Read#WriteXOR-basedBinary-coded I-LVTOne-hot-coded I-LVTRegister-based LVT01020304050607080901008 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 3216 32 16 32 16 32 16 32 16 32 16 323 6 3 6 3 62 3 4M20K Overhead (%)WidthDepth (k)#Read#Write1 5 02 0 02 5 03 0 03 5 04 0 04 5 08 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 3216 32 16 32 16 32 16 32 16 32 16 323 6 3 6 3 62 3 4Fmax (MHz) -No BypassWidthDepth (k)#Read#Write2 0 02 5 03 0 03 5 04 0 08 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 32 8 16 3216 32 16 32 16 32 16 32 16 32 16 323 6 3 6 3 62 3 4Fmax (MHz) -New Data RDWWidthDepth (k)#Read#WriteXOR-basedBinary-coded I-LVTOne-hot-coded I-LVT0 . 111 01 0 01 0 0 08 16 32 8 16 32 8 16 32 8 16 3216 32 16 323 62 4Registers -No Bypass (100's)WidthDepth (k)#Read#WriteXOR-basedBinary-coded I-LVTOne-hot-coded I-LVTRegister-based LVT11 01 0 01 0 0 08 16 32 8 16 32 8 16 32 8 16 3216 32 16 323 62 4Registers -New Data RDW (100's)WidthDepth (k)#Read#Write01234567898 16 32 8 16 32 8 16 32 8 16 3216 32 16 323 62 4ALMs -No Bypass (100's)WidthDepth (k)#Read#Write051 01 52 02 53 08 16 32 8 16 32 8 16 32 8 16 3216 32 16 323 62 4ALMs -New Data RDW (100's)WidthDepth (k)#Read#WriteXOR-based [ref]Binary-coded I-LVTOne-hot-coded I-LVT- 8- 3271 21 72 24 5 6 4 5 6 4 5 60 1 2Fmax Increase (%)#Read ext.#Write ext.61 11 62 12 63 14 5 6 4 5 6 4 5 60 1 2ALMs Overhead (%)#Read ext.#Write ext.Binary-coded I-LVTOne-hot-coded I-LVT51 01 52 02 53 03 54 04 55 04 5 6 4 5 6 4 5 60 1 2M20K Reduction (%)#Read ext.#Write ext.SimpleTrueFigure 3.11: ALMs (top) no-bypass (bottom) new data RDW bypass.73            					 !"#$%&'()*+%,-#./('(1#234!+'#5/6#./('(1#234"'7*&6',#$%&'(234            			8			9:;<=>Figure 3.12: Registers count (top) no-bypass (bottom) new data RDW bypass.743.4 ConclusionsIn this chapter, we have proposed the use of an invalidation-based live-value-table,or I-LVT, to build modular SRAM-based multi-ported memories. The Invalidation-Based Live-Value Table (I-LVT) is used to determine the latest written data banks.This generalizes and replaces two prior techniques, the LVT and XOR-basedapproaches. A general I-LVT is described, along with two specific implementations:binary-coded and thermometer-coded. Both methods are purely SRAM based,hence they scale efficiently with required memory depth. A detailed analysisand comparison of resource consumption of the suggested methods and previousmethods is provided. The original LVT approach can use an infeasible numberof registers. In contrast, the I-LVT register usage is not directly proportional tomemory depth; hence it requires orders of magnitude fewer registers. Furthermore,the proposed I-LVT method can reduce BRAM consumption up to 44% andimprove Fmax by up to 76% compared to the previous XOR-based approach.The thermometer-coded I-LVT method exhibits the highest Fmax, while keepingBRAM consumption within 6% of the minimal required BRAM count. Meanwhile,the binary-coded I-LVT uses fewer BRAMs than the one-hot coded when thereare more than 3 write ports. A detailed analysis and comparison of resourceconsumption of the suggested methods and previous methods is provided. With thisinformation we develop a guideline for choosing the most area efficient approach.Generally, past approaches of XOR and LVT are only recommended for narrow datawidths or shallow depths, respectively. In all other cases, the new I-LVT approaches75are superior. A fully parameterized and generic Verilog implementation of thesuggested methods is provided as open source hardware [105].76Chapter 4Multi-Ported Random AccessMemories with Switched PortsRecent attempts to create FPGA-based multi-ported memories suffer from lowstorage utilization. While most approaches provide simple unidirectional portswith a fixed read or write, others propose true bidirectional ports where each portdynamically switch read and write. True RAM ports are useful for systems withtransceivers and provide high RAM flexibility; however, this flexibility incurs highBRAM consumption. In this chapter, a novel, modular and BRAM-based switchedmulti-ported RAM architecture is proposed. In addition to unidirectional ports withfixed read/write, this switched architecture allows a group of write ports to switchwith another group of read ports dynamically, hence altering the number of activeports. Switched ports are a generalization of true ports, where a certain numberof write ports can be dynamically switched into a possibly different number of77read ports using one common read/write control signal. While a true port is madeup of a single read port and a single write port sharing a single read/write control,switched ports are best described as a set. Furthermore, a given application mayhave multiple sets, each set with a different read/write control. While previous workgenerates multi-port RAM solutions that contain only true ports, or only simpleports, we contend that using only these two models is too limiting and preventsoptimizations from being applied. The proposed switched-ports architecture is lessflexible than a true-multi-ported RAM where each port is switched individually.Nevertheless, switched memories can dramatically reduce BRAM consumptioncompared to true-ports for systems with alternating port requirements. The I-LVTcan be used with fixed ports, true-ports or the proposed switched ports architectures.Most previous work has focused on the design of the LVT to reduce area andimprove performance. In this chapter, we instead reduce area by optimizing thedesign of the data banks portion. The optimization is embedded into a memorycompiler that solves a set cover problem; the optimization can only reduce thearea of the data banks and never inflate it. When the set cover problem is solvedoptimally, the data banks use minimum area. Experimental results on 10 randominstances of multi-port RAMs show 18% BRAM reduction on average comparedto the best of other approaches. In our experiments, the optimization is alwaysable to find an optimal cover and this results in minimum area for the data banks.Formal proofs for the suggested methods, resources consumption analysis, usageguideline and analytic comparison to other methods are provided. The compilerand a fully parameterized Verilog implementation is released as an open source78library. The library has been extensively tested using Altera’s EDA tools.The rest of this chapter is organized as follows. In Section 4.1, RAM portclassification is provided. Multi-ported memories with single switched port areproposed in Section 4.2. In Section 4.3, the switched ports approach is generalizedto support multi-switched-ports. A multi-switched-ports memory compiler auto-mates the generation of switched ports based on user’s requirements is describedin detail in Section 4.3. Conclusions are drawn in Section 4.4.4.1 RAM Port ClassificationAs described in Figure 4.1 (top), RAM ports can be classified into two categories:fixed ports and switched ports.A fixed port is either a simple read port or a simple write port; the activity ofany write is not switched with any other read. Hence, for a set of fixed ports, allwrites and reads in the set can be concurrently active.A true dual-port, or simply a true port, is a single port that can perform eitherread or write under the control of a single read/write line. A true port is oftendrawn as two distinct data ports (one for read, one for write), but they share addresslines in common. It is possible (but uncommon) to do a read at the same time as awrite to the same address, but the resulting data are implementation-specific.A switched port is a collection or set of read ports and write ports. The numberof read and write ports may be different. The read-address lines are usuallydistinct from the write-address lines. However, the entire set of ports share a singleread/write line that controls whether the write ports are active, or the read ports are79RAM PortsFixed Ports Switched PortsSimple Read(a single read)Simple Write(a single write)True Ports(a single write switched with a single read)RWDetailsTrue W RSwitchedRWTrue Addr RData1RAddr1RData2RAddr2WData1WAddr1WData2WAddr2SwitchedR/W (shared)W1R2SymbolsSwitchedSimple WriteW E nSimpleWriteSimpleWriteSimpleReadSimpleReadAddrR / WRDataWDataAddrWData RDataW2R1Simple ReadFixedRData1RAddr1RData2RAddr2WData1WAddr1WData2WAddr2FixedW1R2W2R1GeneralFixedGeneral TrueRWFigure 4.1: RAM port classification: (top) Venn diagram. (bottom) Symbolsand port assignment.active. Reads and writes cannot be simultaneously active. Note that a true port is aspecial case where a switched port consisting of a single read and a single write,and the address lines are shared. A given application may have multiple switchedports, each with an independent read/write line.Figure 4.1 (bottom) shows symbols and black box connectivity for these dif-ferent type of RAM ports. The switched port read/write activity is controlled by ashared R/W control signal. Figure 4.2 shows the two modes of a switched port.The first is the write mode where R/W = 0, writes are active and reads are inactive.The second is the read mode, where R/W = 1, reads are active and writes areinactive.80Read ModeWrite  ModeRData1RAddr1RData2RAddr2WData1WAddr1WData2WAddr2SwitchedR/W=1RData1RAddr1RData2RAddr2WData1WAddr1WData2WAddr2SwitchedR/W=0RWRWFigure 4.2: Switched-port modes; faint ports are inactive.4.2 Multi-Ported Memories with SingleSwitched-PortIn this section, a novel, modular and parametric switched multi-ported RAM isconstructed out of basic dual-ported BRAMs while keeping minimal area andperformance overhead. The proposed method provides a modular architecturethat supports mixed simple/switched port requirements and significantly reducesBRAM consumption and improves performance compared to previous attempts.Despite being less flexible than true RAM ports, switched ports dramatically reduceBRAM consumption if mixed-ports are required. The suggested switched databanks employs an SRAM-based invalidation-live-value-table (I-LVT) [1] to trackthe latest written data banks for each RAM address, hence this architecture ispurely SRAM-based. To verify correctness, the proposed architecture is fullyimplemented in Verilog, simulated using Altera’s ModelSim, and compiled usingQuartus II. A large variety of different architectures and parameters, e.g., bypassing,memory depth, width and number of ports are simulated in batch, each with over amillion random memory cycles. Stratix V, Altera’s high-end performance-oriented81FPGA, is used to implement and compare the proposed approach with previoustechniques. In addition to the suggested switched multi-ported RAM architecture,major contributions of this chapter are:• A bypassing circuitry for both simple (unidirectional) and true (bidirectional)ports. The bypassing circuit is capable of producing new data for Read-After-Write (RAW) and Read-During-Write (RDW) data dependencies.• A detailed analytic comparison of the proposed and previous methods. Aguideline for choosing the most efficient architecture based on design param-eters is also provided.• A fully parameterized Verilog implementation of the suggested methods,together with previous approaches. A flow manager to simulate and synthe-size designs with various parameters in batch using Altera’s ModelSim andQuartus II is also provided. The Verilog modules and the flow manager areavailable online [106].The rest of this section is organized as follows. In Section 4.2.1, multi-portedmemories with a single switched port are discussed. Bypassing techniques of thesemulti-ported memories are discussed in Section 4.2.1.1, whereas Section 4.2.1.2provides an analysis of BRAM consumption based on port functionality. Theexperimental framework, including simulation and synthesis results, are discussedin Section 4.2.2.824.2.1 Single Switched-Port SupportThe I-LVT multi-ported RAM, similar to other previous register-based LVT [11]and XOR [12] cancellation methods, offers a fixed number of simple writing andreading ports. However, the vast majority of computation applications use differentnumbers of reading and writing ports for each computation cycle. On the otherhand, Choi et al. multi-ported RAM architecture supports bidirectional true-portsonly [56]; this excessive port flexibility incurs high BRAM consumption, especiallyfor memories with mixed simple/true-port requirements.For instance, Figure 4.3 (left) provides an example of a CPU–RAM pairingwhere the CPU operations are mutual-exclusive; namely, only one operation canbe active at a single cycle. The mutual-exclusive operations in Figure 4.3 (left) are:f with 3 operands and 3 results, and g with 6 operands and a single result. when fis active, 3 RAM writes and 3 RAM reads are required, while a single RAM writeand 6 RAM reads are required when g is active. At any given cycle, a maximumof 3 writes and 6 reads are required, hence using a multi-ported RAM with fixedports requires three writing ports (nW = 3) and 6 reading ports (nR = 6). On theother hand, true ports can be configured into writes or reads, hence using true-portsrequires the maximum of total writes and reads at any given cycle, namely 7 true-ports (nt = 7). Since RAM ports will not be active at the same cycle, a multi-portedRAM with switched write/read ports as illustrated in Figure 4.3 (right) can be usedto reduce SRAM consumption.The configurability of true dual-ported BRAMs is utilized to construct RAMports with exchangeable read and write functionality. The proposed switched ports83architecture has two sets of ports. The first set is nR, f and nW, f read and writesimple (fixed) ports, respectively. As their name suggest, these are fixed simpleunidirectional ports. The second set is nR,s and nW,s read and write switched ports,respectively. The functionality of these ports alternates at runtime dynamicallyin two modes, read and write, as follows. If the write mode is chosen, the nW,sswitched write ports are active, while the nR,s read ports are inactive. On the otherhand, if the read mode is chosen, the nR,s switched read ports are active, while thenW,s write ports are inactive. The suggested architecture reconfigures dual-portedBRAM write ports into reads when more reads and less writes are required (readmode). As depicted in Figure 4.5 (right), dual-ported BRAMs in switched banksare replicated nR, f times to provide nR, f reads in write mode. Each instance of thenR, f replicas reconfigures its write into a read (in addition to the other read port)in the read mode. Hence, up to nR, f additional switched reads can be generated,namely nR,s ≤ nR, f .The key idea behind the SRAM savings is reconfiguring unused writing portsinto reading ports. Figure 4.5 illustrates the suggested architecture. Only databanks whose writing ports are unused in read mode are altered, namely nW,s banks.The writing ports of each of these banks are redirected to serve as reading portsin read mode as depicted in Figure 4.5 (lower right). Other banks that will keepwriting ports in read mode (nW, f banks) must increase the number of readingports to nR, f +nR,s to match read ports requirement in read mode as described inFigure 4.5 (upper right). Compared to a fixed ports multi-ported RAM with themaximum number of writing ports nW = nW, f +nW,s and the maximum number84of reading ports nR = nR, f +nR,s, the proposed switched architecture reduces thenumber of data banks by nW,s ·nR,s. Hence, this reduces the number of BRAMs bynW,s ·nR,s ·nM20K(d,w). (4.1)Figure 4.4 describes an example of a switched multi-ported RAM with (nW, f ,nR, f )=(1,3) fixed ports and (nW,s,nR,s) = (2,3) switched ports. Therefore, in read mode,there is only one write port and double the fixed read ports. Figure 4.4 (left)shows the write mode configuration, while Figure 4.4 (right) shows the read modeconfiguration. In this example, the upper multi-read bank keeps the minimal singlewrite operation, while the other banks sacrifice write ports to provide additionalread ports. According to Equation 4.1, if the switched RAM in Figure 4.4 has32-bits in width (w = 32) and 32 k-lines in depth (d = 32k), 384 BRAM blocksare saved compared to fixed-ports RAM.85CPUR0R1R2R3R4R5W0WE0W1WE1W2WE2‘1’f/gMP-RAMg(a,b,c,d,e,f)enbf1(a)f2(b)f3(c)W0 R0R1R2W1W2R3R4R5RWFigure 4.3: (left) Switched ports application example: CPU multi-portedRAM pairing. (right) Symbol of th required switched ports RAM.Normal BankSwitched BankSwitched BankW0W1W2R0R1R2W0R0R1R2R3R4R5Normal BankSwitched BankSwitched BankFigure 4.4: switched ports example with (nW, f ,nR, f ) = (1,3) and(nW,s,nR,s) = (2,3) (left) Write configuration (right) Read configura-tion.86WDataWAddr RAddr0RData0RAddrRDatanR-1nR-1Simple Bank 0Simple Bank nW,f-1Switched Bank nW,fSwitched Bank nW-1WData0WAddr0WDataWAddrnW,f-1nW,f-1WDataWAddrnW,fnW,fWDataWAddrnWnWRd/WrnW write / nR read(nW=nW,f+nW,s; nR=nR,f+nR,s)I-LVTRAddr0RAddrnR-1RData0RDatanR-1BankSelWAddr’sRAddr’sWDataWAddr RAddr0RData0RAddrRDatanR-1nR-1WDataWAddr RAddr0RData0RAddrRDatanR-1nR-1WDataWAddr RAddr0RData0RAddrRDatanR-1nR-1Rd/WrRd/WrSwitched Read PortsRAddrRDataRAddrRDataRAddrRData Normal Read PortsWAddrDinPort ADoutAddrR/WPort BDinDoutAddrR/WWData BRAM0BRAMnR-1RAddr0RData0nR,f-nR,s-1nR,f-nR,s-1nR,f-nR,snR,f-nR,snR,f-1nR,f-1RAddrRDatanR,fnR,fRAddrRDatanR-1nR-1DinPort ADoutAddrR/WPort BDinDoutAddrR/WBRAMnR,f-nR,s-1DinPort ADoutAddrR/WPort BDinDoutAddrR/WBRAMnR,f-nR,sDinPort ADoutAddrR/WPort BDinDoutAddrR/WnR,f-1RAddrRDataWAddrWDatanR-1nR-1DinPort ADoutAddrR/WPort BDinDoutAddrR/WBRAM0DinPort ADoutAddrR/WPort BDinDoutAddrR/WBRAM1DinPort ADoutAddrR/WPort BDinDoutAddrR/WBRAMnR-1RAddrRDataRAddrRData0011Switched BankSimple BankFigure 4.5: (left) Switched ports architecture. Data banks: (upper right) simple, and (lower right) switched.874.2.1.1 Data Dependencies and BypassingThe switched multi-ported RAM described in Section 4.2.1 utilizes true-dual-port BRAMs to switch port functionality. However, since writing and readingoperations in true-dual-ported RAMs are exchangeable, the bypassing circuitryrequires special handling. As described in Table 4.1 (right), the bypass circuit of thebidirectional RAM is mirrored compared to the unidirectional RAM. Thus, it canbypass written data from any direction. However, the control logic that drives thebypassing mux selectors need to be altered to detect the direction of writing. Sincethe bidirectional bypassing circuit is a mirroring of the unidirectional bypassingcircuit, it requires twice the registers used for the unidirectional bypass circuit,hencenBReg,bidirectional(d,w) = 2 ·nBReg,unidirectional(d,w). (4.2)88Table 4.1: Single-stage and two-stage bypassing for simple and true dual-port RAM.Unidirectional(Simple-dual-port)Bidirectional(True-dual-port)Single-stageDinPort ADoutAddrR/WPort B DinDoutAddrR/W=Din01‘1’‘0’DoutAddrAddrDinPort ADoutAddrR/WPort B DinDoutAddrR/W=DinDoutAddrR/WDinDoutAddrR/W0101Two-stageDinPort ADoutAddrR/WPort B DinDoutAddrR/W=DoutAddr‘1’Din‘0’001x01=AddrDinPort ADoutAddrR/WPort B DinDoutAddrR/W=DinDoutAddrR/WDinDoutAddrR/W001x01==001x01894.2.1.2 SRAM Usage based on Port FunctionalityThe previous analysis provides a guideline for using XOR, LVT, binary-coded orthermometer-coded I-LVT. However, a guideline for using simple, true, or switchedport architectures as illustrated in Figure 4.1 and Figure 4.2 is required. A multi-ported RAM with the following mixed port requirements is used for comparison:(1) nW, f write / nR, f read simple (fixed) ports, (2) nt true-ports, and (3) nW,s write /nR,s read switched ports. To implement the mixed-port multi-ported RAM usingdifferent port architectures, the following transformations are required: (1) A true-port can be emulated as two simple ports, a write and a read sharing the sameaddress. (2) A switched port can be emulated as nW,s simple write ports and nR,ssimple read ports, or max(nW,s,nR,s) true-ports. The M20K count of the mixed-portRAM for each port architecture is provided by:Simple : nM20K(d,w)((nR, f +nt +nR,s) · (nW, f +nt +nW,s))True : nM20K(d,w)(na(na−1)2∣∣∣na=nW, f+nR, f+nt+max(nW,s,nR,s))Switched : nM20K(d,w)((nW, f +nt +nW,s)(nR, f +nt)+(nW, f +nt)nR,s).(4.3)To simplify the comparison, we assume that the number of read ports is twicethe number of write ports for simple and switched ports, hence nR, f = 2nW, f andnR,s = 2nW,s. Figure 4.6 shows a linear approximation of the BRAM consumptionequilibrium. These plots form guidelines for choosing the most area efficientarchitecture. Figure 4.6 (left) shows that the true-ports architecture is more efficientin BRAM usage only if the number of true-ports is more than√5 times the90nt True-portsSwitched-ports√5nW,fnW,snt True-portsSimple-portsnW,fFigure 4.6: Port architecture usage guideline.simple write ports count. On the other hand, Figure 4.6 (right) shows that thetrue-ports architecture is more efficient only if the number of true-ports is morethan√5nW, f +(√5−1)nW,s.91Table 4.2: Resources consumption for a 4W/8R multi-ported RAM test-case with 8k-entries of 32-bit wordsand new data RDW bypassing. 1 2.Register-based LVT Binary-coded I-LVT Thermometer-coded I-LVTXOR-basedSimpleSwitched%ChangefromXORSimpleSwitched%ChangefromXORSimpleSwitched%ChangefromXORM20K’s 704 512 384 -45.5% 556 428 -39.2% 588 460 -34.7%Registers 2781 18288 17816 +540.6% 3220 2748 -1.2% 3240 2768 -0.5%ALM’s 2010 61662 61333 +2951.4% 1454 1290 -35.8% 1604 1445 -28.1%Fmax(Mhz) 270 213 213 -21.1% 325 338 +25.2% 390 388 +43.7%92Table 4.3: Register consumption for a 4W/8R multi-ported RAM test-case with 8k-entries of 32-bit words 3.Register-based LVT Binary-coded I-LVT Thermometer-coded I-LVTXOR-basedSimpleSwitched%ChangefromXORSimpleSwitched%ChangefromXORSimpleSwitched%ChangefromXORNo Bypass 184 16400 16400 +8813.0% 56 56 -69.6% 56 56 -69.6%Allow WAW 892 16400 16400 +1738.6% 404 404 -54.7% 392 392 -56.1%New Data RAW 2780 16400 16400 +489.9% 1332 1307 -53.0% 1352 1352 -51.4%New Data RDW 2781 18288 17816 +540.6% 3220 2748 -1.2% 3240 2768 -0.5%934.2.2 Experimental ResultsThe switched multi-ported RAM is demonstrated with several design cases andnominal parameters of d = 32K and w = 32. The total number of write ports isfixed to 3 (nW = nW,s+nW, f = 3), while the number of switched write ports is asweep of 1, 2 and 3 ports (nW,s = 1,2,3); hence the number of simple write ports isa sweep of 2, 1, and 0 (nW, f = 2,1,0), respectively. On the other hand, the numberof simple read ports is set to 3 (nR, f = 3), while the number of switched ports is asweep of 1, 2, and 3 ports (nR,s = 1,2,3); hence the number of total read ports is asweep of 4, 5, and 6 (nR = nR,s+nR, f = 4,5,6), respectively. Figure 4.7 (top andmiddle) is a plot of Fmax increase and ALMs overhead percentages, respectively,compared to a simple-ports baseline design (without the switched ports mechanism)and with the maximum available ports, namely nW writing ports and nR readingports. Figure 4.7 (bottom) shows the M20K reduction for the equivalent design withsimple-ports only (nW writing ports and nR reading ports), and for the equivalentdesign with true-ports only (nW, f +nR, f +max(nR,s,nW,s) true-ports, as describedin Section 4.2.1.2). For the given test case, using switched-ports can save upto 45% of the BRAM consumption compared to the equivalent simple-port ortrue-port designs. The BRAM consumption can be anticipated from Equation 4.1and Equation 4.3. The BRAM reduction relieves the routing resources hencean Fmax increase is observed as shown in Figure 4.7 (top). The Fmax increase ismore significant in thermometer-based I-LVT, achieving over 22% improvement.Additional data multiplexing causes an increase of ALMs. However, the increasepercentage is lower than 31% in all designs as shown in Figure 4.7 (middle).94        	   !"#         	  $%&'()* !"# +#,"-./01.2345,.60./01.234					        	  &	78*9:;<=8*:>=?;::>@#ABC4"DEFGHIJKLKMNOPQRSTUVWXYZY[\]^_`a^_`cdefgdefhijklmn opoFigure 4.7: Switched-ports compared to simple-ports and true-ports.954.3 Multi-Switched-PortsIn this section, we demonstrate how to build multi-port RAMs that contain anynumber of swiched ports. As well, the multi-port RAM may contain any numberof simple read ports, simple write ports, and any number of true dual ports (each ofthe latter will be treated as a switched port). Since a true port is a special case of aswitched port, we are effectively generalizing the techniques used in [56] and [2].Under this new model, we show an optimization method that can minimize the useof block RAMs in the data banks. We also note that for any given problem, theremay be multiple solutions. Our solution uses a minimum set cover algorithm tomap required read/write dependences into true dual-port block RAMs. In all ofour test cases, the set cover problem was solved optimally, leading to the use ofminimal block RAMs in the data banks.This section constructs novel, generic, modular and parametric switchedBRAM-based multi-switched-ports RAMs. Compared to previous simple-port andtrue-port methods, the proposed technique significantly reduces BRAM consump-tion. The data banks are optimized to support mixed-port configuration by utilizingtrue-dual-ported BRAMs. The data bank connectivity is converted into a DataFlow Graph (DFG), where vertices represent ports and edges represent data banks,respectively. Each block RAM deployed in the solution will cover one or moreDFG edges. An optimal set covering all of the DFG edges results in deploying aminimal number of block RAMs. Our architecture is fully implemented in Verilog,simulated using Altera’s ModelSim, and compiled using Quartus II. A large variety96of different architectures and parameters, bypassing, depth and width are simulatedin batch, each with over a million random memory cycles. Stratix V, Altera’s high-end performance-oriented FPGA, is used to implement and compare the proposedapproach with previous techniques. A run-in-batch simulation and synthesis flowmanager is also provided. The Verilog modules and the flow manager are availableonline [107].The rest of this section is organized as follows. The proposed multi-switched-ports synthesis method is described in detail in Section 4.3.1. The experimentalframework and results, are discussed in Section 4.3.2.4.3.1 Multi-Ported RAM with Multiple Switched PortsIn this section, we describe how multiple switched ports may arise, and how todesign multi-ported RAMs with them.4.3.1.1 Motivation and Key IdeaConsider the design of a Processing Element (PE) that has three distinct (non-overlapping) states of operation:1. Write to shared RAM using 4 write ports2. Compute value in shared RAM with Arithmetic Logic Unit (ALU) using 2read ports, 1 write port3. Read from shared RAM using 4 read ports97In this example, it is necessary to build a shared RAM structure that has multipleread and write ports. There are several approaches to building such a multi-portRAM, including:A. 4 fixed read ports and 4 fixed write portsB. 4 true dual-portsC. two switched ports, with the first having 1 read or 1 write port (ie, a trueport), and the second having 3 read or 3 write portsD. 2 fixed read ports, 1 fixed write port, and one switched port containing 2 reador 3 write portsIn addition, there are many other possible port assignment solutions, whereeach port assignment may yield a solution requiring a different number of blockRAMs. In this dissertation, we do not consider the problem of computing anoptimal port assignment; that is left for future work. Instead, we must first be ableto compute the block RAM requirements for a given port assignment; that is thepurpose of this chapter.The example above requires port assignment because there are multiple states.Consider simpler problems, such as those in Figure 4.8, where each user of theshared memory is under the control of a single read/write line. In such cases, thereexists only one possible port assignment with switched ports.The one valid port assignment is illustrated on the right-hand side of Figure 4.8and can broken down as follows. The ALU consists of two mutually-exclusive98functional blocks, f and g. First, the control signal f/g enables either f or gfunctional blocks; in either case the R0 operand must always be read and can beassigned to a fixed port R0,0. When f is selected, the R1 operand can be read fromswitched read port R2,0, but when g is selected we do not need R2,0 but insteadneed switched write ports W2,0 and W2,1. Thus, f/g is also the read/write controlsignal for the P2 group of switched ports. Similarly, the bus has a read/write controlsignal that directly controls the P1 group of switched ports with R1,0 and W1,0.Thus, Figure 4.8 shows two possible implementations of the shared memory:one with all fixed ports on the left, and one with switched ports on the right. Thisleads to two possible ways to build the shared memory; we will show that the latterway is more efficient in terms of block RAM usage.The key idea of our methodology is based on constructing a Data Flow Graph(DFG) to describe RAM port dependencies, where vertices represent ports andedges represent data banks. Two types of edges exist: regular (solid) edges forfixed ports, and dashed edges for switchable ports.The goal is to cover all of these edges, and this cover directly describes animplementation using BRAMs. For example, a single regular edge can be coveredby a simple dual-port BRAM with 1 fixed write port and 1 fixed read port (1W/1R).However, true dual-port BRAMs have two terminals on each end (2W/2R), andup to 4 edges between them. Thus, up to 4 edges in the graph can be covered by atrue dual-port BRAM. The objective is to cover all edges with minimal BRAMs.This forms a set cover problem (SCP) which can be solved using special subgraphpatterns to indicate possible covers.99/√ xfg1/xALUf/g0           1>>Shared Register-Filewith Fixed PortsW0 W1 W2 W3R0 R1 R2busr/wShared bus/√ xfg1/xALUf/g0           1>>R1,0R0,0busr/wShared busR2,0Figure 4.8: Simplified parallel system with shared memory. (left) Multi-ported RAM with fixed ports connection. (right) Multi-switched-portsconnection.4.3.1.2 Port Assignment and Problem DefinitionThe problem input is a list of port requirements. Unlike switched ports, fixedports are unrelated and operate individually; hence, they can be aggregated intoa single port group (as in Figure 4.8) named P0. The superset of all port groups,P, includes nP port groups, were the first port P0 is a fixed-port and each of theremaining P1···nP−1 are switched port groups. Each Pi represents an ordered pairwith the number of writes nW,i and the number of reads nR,i for this specific port,namelyP ={P0,P1, · · · ,PnP−1 | Pi =(nWi,nRi)}. (4.4)For instance, port requirements for the example in Figure 4.8 isP ={P0 = (1,1),P1 = (1,1),P2 = (2,1)}. (4.5)100Writes and reads in this RAM are indexed by two indices, the first index isthe port group index ranging 0 · · ·nP−1, while the second index is the write indexwithin a specific port i ranging 0 · · ·nW,i−1, or the read index within a specific porti ranging 0 · · ·nR,i−1. The writes group W and the reads group R are defined asW ={Wi, j | 0≤ i < nP; 0≤ j < nW,i}R ={Ri, j | 0≤ i < nP; 0≤ j < nR,i} (4.6)For instance, write and read sets for the example in Figure 4.8 are W ={W0,0,W1,0,W2,0,W2,1}and R ={R0,0,R1,0,R2,0}, respectively.The total number of writes and reads is denoted by nW and nR (without indices),respectively, namely,nw = |W |=nP−1∑i=0nW,inR = |R| =nP−1∑i=0nR,i(4.7)For instance, the example in Figure 4.8 consists of 4 writing ports and 3 readingports in total, hence nW = 4 and nR = 3.Figure 4.9 generalizes the port assignment for the multi-switched-ports RAM.Given these port requirements, and using true-dual-ported BRAMs, the objectiveof our work is to construct the data banks to satisfy the given fixed and switchedports with the fewest BRAMs.101 W, 0W0 , nW,  Pn  -1WW0 , 0 W, 1W1 , nW1 , 0W         RW         RR , 0R 0 , nR 0 , 0R , 1R 1 , nR 1 , 0Port nP-1 (Switched)Pn  -1  ,  nWPn  -1  ,  0R  ,  Pn  -1RPn  -1  ,  nRPn  -1  ,  0Port 1 (Switched)Port 0 (Fixed)Figure 4.9: Multi-switched-ports RAM port assignment.4.3.1.3 Modeling Data Banks with Data Flow Graph (DFG)An LVT-based multi-ported RAM built using only fixed ports requires every writeport to write to a dedicated data bank, allowing concurrent writes. Furthermore,every write-specific bank should be accessible by every read port, allowing all readports to read data written by any write port. This requirement can be modeled as acomplete bipartite graph (complete bigraph).A bigraph is a graph G consisting of two disjoint sets of vertices, say U andV . Each edge in a bigraph connects a vertex from U to another vertex in V . Acomplete bigraph is a special case where every vertex in U is connected to everyvertex in V , in other wordsG =(U,V,E |U ∩V = ø,E = {{u,v} | u ∈U,v ∈V}) (4.8)To model data bank connectivity, a bigraph DFG is constructed where sourcevertices are writing ports U =W and sink vertices are reading ports V = R. Graph102R0R1R2R0,0R1,0R2,0Figure 4.10: Bigraph DFG representing data banks connectivity of the sharedmemory in Figure 4.8 (left) fixed-ports data banks (right) multi-switched-ports.edges connect writes to reads, hence they represent 1W/1R simple-dual-portBRAMs. Figure 4.10 (left) shows the bigraph of the fixed-port system in Fig-ure 4.8 (left).Similarly, Figure 4.10 (right) shows a bigraph of the switched-port system inFigure 4.8 (right). However, the bigraph is slightly different from before; someedges Es ⊆ E are labeled as switched edges using dashed lines in the figure. Exceptfor port group P0, which has only fixed ports, the other port groups give rise toswitched edges that connect the write vertices to read vertices within the same portgroup. Formally, the switched edge set is described as followsEs ={{Wp,i,Rp, j} | 1≤ p < nP,0≤ i < nW,p,0 j < nR,p} . (4.9)For instance, in Figure 4.10 (right), the switched edges areEs ={{W1,0,R1,0},{W2,0,R2,0},{W2,0,R2,1}}.1034.3.1.4 Multi-Switched-Ports DFG OptimizationUsing a true dual-ported BRAM gives us the ability to cover several possiblesubgraphs that appear in a bigraph DFG. Since these subgraphs are also completebigraphs, we call them biclique patterns, or BPs. All different biclique patterns aredescribed in Table 4.4.For each BP in Table 4.4, we can identify several specific instances that it cancover within the biclique DFG from Figure 4.10 (right). Figure 4.11 enumeratesall possible BP instances that occur within the original biclique DFG.Once this full enumeration has occurred, all that is necessary is to select asubset of these BP instances such that all edges in the original biclique DFG arecovered. Each BP instance requires a BRAM, so minimizing the number of BPinstances in the cover will minimize BRAM usage.Biclique patterns must cover all the edges in the switched bigraph DFG. How-ever, different BPs may have shared edges. Efficiently covering the bigraph DFGedges is not trivial; simply choosing all largest 2W/2R-BPs may result in inefficientresults. Covering the edges of the bigraph DFG is actually equivalent to the setcover problem. The set cover problem is an NP-complete problem [108].104Table 4.4: Biclique patterns and their attributes.Pattern Name Biclique BRAM Connectivity1W/1RFixed F1W1RBiclique PatternF1W1RWp,iWp,jRp,kRp,lW1 W2W1/R1 W2/R2R1 R2‘0’Rp,lW1 W2W1/R1 W2/R2R1 R2W1 W2W1/R1 W2/R2R1 R2W1 W2W1/R1 W2/R2R1 R2W1 W2W1/R1 W2/R2R1 R2Rp,k‘0’Rq,lW1 W2W1/R1 W2/R2R1 R2Rp,kRq,lRp,lRp,kRp,kRp,kW1 W2W1/R1 W2/R2R1 R2‘0’Rp,kW1 W2W1/R1 W2/R2R1 R2‘0’Rq,lBRAMWp,iWq,jRp,kRq,lWp,iWp,jRp,kWp,iWq,jRp,kWp,i Rp,kRp,lWp,i Rp,kRq,lWp,i Rp,kWp,iRq,lS1W1RF1W2RS1W2RF2W1RS2W1RF2W2RS2W2RBiclique PatternF1W1RWp,iWp,jRp,kRp,lW1 W2W1/R1 W2/R21 R2‘0’Rp,lW1 W2W1/R1 W2/R2R1 R2W1 W2W1/R1 W2/R2R1 R2W1 W2W1/R1 W2/R2R1 R2W1 W2W1/R1 W2/R2R1 R2Rp,k‘0’Rq,lW1 W2W1/R1 W2/R2R1 R2Rp,kRq,lRp,lRp,kRp,kRp,kW1 W2W1/R1 W2/R2R1 R2‘0’Rp,kW1 W2W1/R1 W2/R2R1 R2‘0’Rq,lBRAMWp,iWq,jRp,kRq,lp,iWp,jRp,kWp,iWq,jp,kWp,i Rp,kRp,lWp,i p,kRq,lp,i Rp,kWp,iq,lS1W1RF1W2RS1W2RF2W1RS2W1RF2W2RS2W2RSwitched S1W1RBiclique PatternF1W1RWp,iWp,jRp,kRp,lW1 W2W1/R1 W2/R2R1 R2‘0’Rp,lW1 W2W1/R1 W2/R2R1 R2W1 W2W1/R1 W2/R2R1 R2W1 W2W1/R1 W2/R2R1 R2W1 W2W1/R1 W2/R2R1 R2Rp,k‘0’Rq,lW1 W2W1/R1 W2/R2R1 R2Rp,kRq,lRp,lRp,kRp,kRp,kW1 W2W1/R1 W2/R2R1 R2‘0’Rp,kW1 W2W1/R1 W2/R2R1 R2‘0’Rq,lBRAMWp,iWq,jRp,kRq,lWp,iWp,jRp,kWp,iWq,jRp,kWp,i Rp,kRp,lWp,i Rp,kRq,lWp,i Rp,kWp,iRq,lS1W1RF1W2RS1W2RF2W1RS2W1RF2W2RS2W2RBiclique PatternF1W1RWp,iWp,jRp,kRp,lW1 W2W1/R1 W2/R2R1 R2‘0’Rp,lW1 W2W1/R1 W2/R21 R2W1 W2W1/R1 W2/R2R1 R2W1 W2W1/R1 W2/R2R1 R2W1 W2W1/R1 W2/R2R1 R2Rp,k‘0’Rq,lW1 W2W1/R1 W2/R2R1 R2Rp,kRq,lRp,lRp,kRp,kRp,kW1 W2W1/R1 W2/R2R1 R2‘0’Rp,kW1 W2W1/R1 W2/R2R1 R2‘0’Rq,lBRAMp,iWq,jRp,kRq,lWp,iWp,jp,kWp,iq,jRp,kWp,i p,kRp,lp,i Rp,kRq,lWp,i p,kWp,iRq,lS1W1RF1W2RS1W2RF2W1RS2W1RF2W2RS2W2R1W/2RFixed F1W2RBiclique PatternF1W1RWp,iWp,jRp,kRp,l1 W2W1/R1 W2/R2R1 R2‘0’Rp,l1 W2W1/R1 W2/R2R1 R21 W2W1/R1 W2/R2R1 R21 W2W1/R1 W2/R2R1 R21 W2W1/R1 W2/R2R1 R2Rp,k‘0’Rq,l1 W2W1/R1 W2/R2R1 R2Rp,kRq,lRp,lRp,kRp,kRp,k1 W2W1/R1 W2/R2R1 R2‘0’Rp,k1 W2W1/R1 W2/R2R1 R2‘0’Rq,lBRAMWp,iWq,jp,kRq,lWp,iWp,jRp,kWp,iWq,jRp,kWp,i Rp,kp,lWp,i Rp,kRq,lWp,i Rp,kWp,iRq,lS1W1RF1W2RS1W2RF2W1RS2W1RF2W2RS2W2RBiclique PatternF1W1RWp,iWp,jRp,kRp,lW1 W21/R1 W2/R2R1‘0’Rp,lW1 W2/R1 W2/R1 2W1 W21/R1 W2/R2R1W1 W21/R1 W2/R21W1 W2/R1 W2/RR1 2Rp,k‘0’Rq,lW1 W21/R1 W2/R2R1Rp,kRq,lRp,lRp,kRp,kRp,kW1 W2/R1 W2/RR1 2‘0’Rp,kW1 W21/R1 W2/R2R1‘0’Rq,lBRAMp,iWq,jp,kRq,lWp,ip,jp,kWp,iq,jRp,kp,i p,kRp,lp,i Rp,kq,lWp,i Rp,kWp,iRq,lS1W1RF1W2RS1W2RF2W1RS2W1RF2W2RS2W2RSwitched S1W2Biclique PatternF1W1Rp,iWp,jp,kRp,lW1 W2W1/R1 W2/R2R1 R2‘0’Rp,lW1 W2W1/R1 W2/R2R1 R2W1 W2W1/R1 W2/R2R1 R2W1 W2W1/R1 W2/R2R1 R2W1 W2W1/R1 W2/R2R1 R2Rp,k‘0’Rq,lW1 W2W1/R1 W2/R2R1 R2Rp,kRq,lRp,lRp,kRp,kRp,kW1 W2W1/R1 W2/R2R1 R2‘0’Rp,kW1 W2W1/R1 W2/R2R1 R2‘0’Rq,lBRAMp,iWq,jRp,kRq,lp,iWp,jRp,kp,iWq,jRp,kWp,i Rp,kRp,lWp,i Rp,kRq,lWp,i Rp,kWp,iRq,lS1W1RF1W2RS1W2RF2W1RS2W1RF2W2RS2W2RBiclique PatternF1W1RWp,iWp,jRp,kRp,lW1 W21/R1 W2/R2R1‘0’Rp,lW1 W2/R1 W2/RR1 2W1 W21/R1 W2/R2R1W1 W21/R1 W2/R21W1 W2/R1 W2/R1 2Rp,k‘0’Rq,lW1 W21/R1 W2/R2R1Rp,kRq,lRp,lRp,kRp,kRp,kW1 W2/R1 W2/RR1 2‘0’Rp,kW1 W21/R1 W2/R2R1‘0’Rq,lBRAMWp,iq,jp,kq,lWp,ip,jRp,kp,iq,jRp,kp,i p,kp,lWp,i p,kq,lWp,i p,kWp,iRq,lS1W1RF1W2RS1W2RF2W1RS2W1RF2W2RS2W2R2W/1RFixed F2W1Biclique PatternF1W1Rp,iWp,jp,kRp,l1 W2W1/R1 W2/R2R1 R2‘0’Rp,l1 W2W1/R1 W2/R2R1 R21 W2W1/R1 W2/R2R1 R21 W2W1/R1 W2/R2R1 R21 W2W1/R1 W2/R2R1 R2Rp,k‘0’Rq,l1 W2W1/R1 W2/R2R1 R2Rp,kRq,lRp,lRp,kRp,kRp,k1 W2W1/R1 W2/R2R1 R2‘0’Rp,k1 W2W1/R1 W2/R2R1 R2‘0’Rq,lBRAMp,iq jRp,kq,lp,ijRp,kp,iq jRp,kWp,i Rp,k,lWp,i Rp,kq,lWp,i Rp,kWp,iq,lS1W1RF1W2RS1W2RF2W1RS2W1RF2W2RS2W2RBiclique PatternF1W1RWp,ip,jRp,kRp,lW1 W21/R1 W2/R2R1‘0’Rp,lW1 W2/R1 W2/RR1 2W1 W21/R1 W2/R2R1W1 W21/R1 W2/R2R1W1 W2/R1 W2/RR1 2Rp,k‘0’Rq,lW1 W21/R1 W2/R2R1Rp,kRq,lRp,lRp,kRp,kRp,kW1 W2/R1 W2/RR1 2‘0’Rp,kW1 W21/R1 W2/R2R1‘0’Rq,lBRAMWp,iq,jRp,kRq,lp,ip,jRp,kp,iq,jRp,kp,i Rp,kRp,lWp,i Rp,kRq,lp,i Rp,kWp,iRq,lS1W1RF1W2RS1W2RF2W1RS2W1RF2W2RS2W2RSwitched S2W1Biclique PatternF1W1Rp,iWp,jRp,kRp,l1 W2W1/R1 W2/R2R1 R2‘0’Rp,l1 W2W1/R1 W2/R2R1 R21 W2W1/R1 W2/R2R1 R21 W2W1/R1 W2/R2R1 R21 W2W1/R1 W2/R2R1 R2Rp,k‘0’Rq,l1 W2W1/R1 W2/R2R1 R2Rp,kRq,lRp,lRp,kRp,kRp,k1 W2W1/R1 W2/R2R1 R2‘0’Rp,k1 W2W1/R1 W2/R2R1 R2‘0’Rq,lBRAMp,iq,jRp,kq,lp,i,jRp,kp,iq,jRp,kWp,i Rp,k,lWp,i Rp,kq,lWp,i Rp,kWp,iq,lS1W1RF1W2RS1W2RF2W1RS2W1RF2W2RS2W2RBiclique PatternF1W1Rp,i,jp,kp,lW1 W2/R1 W2/RR1‘0’Rp,lW1 W2/R1 W2/RR1W1 W2/R1 W2/RR1W1 W2/R1 W2/RR1W1 W2/R1 W2/RR1Rp,k‘0’Rq,lW1 W2/R1 W2/RR1Rp,kRq,lRp,lRp,kRp,kRp,kW1 W2/R1 W2/RR1‘0’Rp,kW1 W2/R1 W2/RR1‘0’Rq,lBRAMp,iq,jp,kq,lp,i,jRp,kp,iq,jRp,kWp,i Rp,kp,lp,i Rp,kRq,lp,i p,kp,iRq,lS1W1RF1W2RS1W2RF2W1RS2W1RF2W2RS2W2R2W/2RFixed F2W2Biclique PatternF1W1Rp,iWp,jp,kp,l1 W2/R1 W2/RR1‘0’Rp,l1 W2/R1 W2/RR11 W2/R1 W2/RR11 W2/R1 W2/RR11 W2/R1 W2/RR1Rp,k‘0’Rq,l1 W2W1/R1 W2/R2R1 R2Rp,kRq,lRp,lRp,kRp,kRp,k1 W2/R1 W2/RR1‘0’Rp,k1 W2/R1 W2/RR1‘0’Rq,lBRAMp,iq,jp,kq,lp,i,jp,kp,iq,jp,kWp,i p,k,lWp,i p,kq,lWp,i Rp,kWp,iq,lS1W1RF1W2RS1W2RF2W1RS2W1RF2W2RS2W2RBiclique PatternF1W1Rp,ip,jp,kRp,lW1 W1/R1 W2/R2R1‘0’Rp,lW1 W/R1 W2/RR1 2W1 W1/R1 W2/R21W1 W1/R1 W2/R2R1W1 W/R1 W2/R1 2Rp,k‘0’Rq,lW1 W1/R1 W2/R2R1Rp,kRq,lRp,lRp,kRp,kRp,kW1 W/R1 W2/R1 2‘0’Rp,kW1 W21/R1 W2/R2R1‘0’Rq,lBRAM,iq,jp,kRq,lp,ip,jRp,kp,iq,j,kp,i p,kRp,lp,i p,kRq,lp,i p,kWp,iRq,lS1W1RF1W2RS1W2RF2W1RS2W1RF2W2RS2W2RSwitched S2W2Biclique PatternF1W1Rp,iWp,jRp,kp,l1 2/R1 W2/RR1‘0’Rp,l1 2/R1 2/RR11 2/R1 2/RR11 2/R1 2/RR11 2/R1 2/RR1Rp,k‘0’Rq,lW1 W2W1/R1 W2/R2R1 R2Rp,kRq,lRp,lRp,kRp,kRp,k1 2/R1 2/RR1‘0’Rp,k1 2/R1 2/RR1‘0’Rq,lBRAMp,iq,jRp,kq,lp,i,jRp,kp,iq,jRp,kp,i Rp,k,lp,i Rp,kq,lp,i Rp,kp,iq,lS1W1RF1W2RS1W2RF2W1RS2W1RF2W2RS2W2RBiclique PatternF1W1Rp,ip,jp,kp,lW1 W21/R1 W2/R21 R‘0’Rp,lW1 W21/R1 W2/R21 RW1 W21/R1 W2/R21 RW1 W21/R1 W2/R21 RW1 W21/R1 W2/R21 RRp,k‘0’Rq,lW1 W21/R1 W2/R2R1 RRp,kRq,lRp,lRp,kRp,kRp,kW1 W21/R1 W2/R2R1 R‘0’Rp,kW1 W21/R1 W2/R2R1 R‘0’Rq,lBRAMp,iq,jp,kRq,lp,ip,jp,kp,iq,jp,kp,i p,kp,lp,i p,kq,lp,i p,kWp,iRq,lS1W1RF1W2RS1W2RF2W1RS2W1RF2W2RS2W2R1054.3.1.5 Solving the Cover ProblemThe set cover problem takes a universe set U and another set S of subsets of Uwhose union covers U (⋃s∈S s =U), namely,SCP =U,S | ∀s ∈ S : s⊆U,⋃s∈Ss =U . (4.10)The objective of the set cover problem is to find a cover of U with the fewestsets from S. Let T ⊆ S be such a subset of S, therefore the set cover problemobjective is,minT⊆S|T | ∣∣∣∣∣⋃t∈Tt =U (4.11)The bigraph DFG optimization solves a set cover problem where all DFG edgesare the universe and all biclique patterns are the covering subsets, namely,SCP =U,S =F1W1R ∪ S1W1R ∪ F1W2R ∪ S1W2R∪F2W1R ∪ S2W1R ∪ F2W2R ∪ S2W2R . (4.12)The set cover problem can be formulated and solved as the following binary106linear programming (BLP) problemminimize∑s∈Sxssubject to ∑s:e∈sxs ≥ 1 ∀e ∈U,xs ∈ {0,1} ∀s ∈ S(4.13)Where xs is a binary decision variable, indicating whether s is part of thesolution or not. Figure 4.11 provides a synthesis example for the bigraph DFGfrom Figure 4.10 (right). A set cover solution uses the highlighted biclique patternsin Figure 4.11 (left), producing the final synthesized data bank shown in Figure 4.11(right).As a comparison, Table 4.5 compares solutions for the purely fixed-ports andmulti-switched-ports implementations of Figure 4.8 For this specific example, thefixed-ports method consumes 12 BRAMs to construct the data banks while themulti-switched-ports method consumes only 8 BRAMs, yielding a 25% BRAMreduction.107F2W2RS2W1RF1W2R S1W1RR1,0R2,0R2,1R1,0R0,0R1,0R2,0R2,0R0,0R2,0R1,0R2,0R0,0R2,0R1,0R2,0R1,0R2,0R1,0R2,0R0,0BRAM5R1,0R2,0R0,0R2,0R0,0R1,0R0,0R1,0BRAM1BRAM3BRAM6BRAM0BRAM7R1,0R1,0R1,0R2,0R2,0R2,0R2,0F1W1R F2W1RBRAM4BRAM2W RW1 R1W2 R2W1 R1R2BRAM0W RBRAM1W RBRAM3W RBRAM5W RBRAM6W RBRAM7BRAM4BRAM2R1,0W2 , 0W0 , 0W1 , 0W2 , 1R2,0R0,0Figure 4.11: Synthesis example of the DFG from Figure 4.8 (left) All possible biclique patterns; optimal BP’sare highlighted. (right) Synthesized data banks.1084.3.1.6 Data Dependencies and BypassingDue to the pipelined nature of building a full multi-switched-port RAM, data depen-dencies due to internal latencies arise naturally. This requires internal forwardingand bypassing to solve these hazards. The full multi-switched-port RAM designconsists of the data banks and the I-LVT. The I-LVT itself consists of feedbackbanks and output extraction banks. For each of these three structures, Table 4.6summarizes the type of bypassing required to produce a correct design that cantolerate certain hazards. Further detail is provided below.The I-LVT-based structure [1] is used to steer the read data out of the multi-switched-ports data banks, however the I-LVT incurs data dependencies due tothe feedback functions and the latency of reading the I-LVT to decide about thelast written bank [1]. Data dependencies can be handled by employing bypassing,also known as forwarding. Bypassing is necessary since dual-port BRAMs cannotinternally forward new data when one port reads and the other port writes the sameaddress on the same clock edge, constituting a Read-During-Write (RDW) hazard.Table 4.7 shows two types of bypassing based on write data and addresspipelining. Both bypassing techniques are functionally equivalent, allowing readingof the data that is being written on the same clock edge, similar to single registerfunctionality. However, the fully-pipelined two-stage bypassing shown in Table 4.7(bottom) can overcome an additional cycle latency, namely an additional pipe stageon writing data and address (not shown in the figures). This capability is requiredif a BRAM has pipelined inputs (e.g., cascaded from another BRAM) that need tobe bypassed.109Table 4.5: Comparison of purely fixed-ports versus multi-switched-ports im-plementations of the example in Figure 4.8.Fixed-ports Multi-switched-portsPortAssignment/√ xfg1/xALUf/g0           1>>Shared Register-Filewith Fixed PortsW0 W1 W2 W3R0 R1 R2busr/wShared bus/√ xfg1/xALUf/g0           1>>R1,0R0,0busr/wShared busR2,0BicliqueDFGR0R1R2R0,0R1,0R2,0SynthesizedDataBanksW RW1 R1W2 R2W1 R1R2BRAM0W RBRAM1W RBRAM3W RBRAM5W RBRAM6W RBRAM7BRAM4BRAM2R1,0W2 , 0W0 , 0W1 , 0W2 , 1R2,0R0,0W RW1 R1W2 R2W1 R1R2BRAM0W RBRAM1W RBRAM3W RBRAM5W RBRAM6W RBRAM7BRAM4BRAM2R1,0W2 , 0W0 , 0W1 , 0W2 , 1R2,0R0,0110The proposed multi-switched-ports RAM utilizes true-dual-port BRAMs toprovide switched port functionality. However, since writing and reading operationsin true-dual-ported RAMs are exchangeable, the bypassing circuitry requires spe-cial handling. As described in Table 4.7 (right), the bypass circuit of the true/trueRAM configuration is mirrored compared to the true/simple RAM configuration.Thus, it can bypass written data from any direction. However, the control logicthat drives the bypassing mux selectors need to be altered to detect the direction ofwriting.The most severe data dependency that I-LVT design [1] suffers from is Write-After-Write (WAW), namely, writing to the same address that has been writtenin the previous cycle. This dependency occurs because of the feedback readingand writing latency. A single-stage bypassing for the feedback banks solves thisdependency.Two types of reading hazards are also introduced by the I-LVT design, Read-After-Write (RAW) and Read-During-Write (RDW). RAW occurs when the samedata that was written in the previous clock edge are read in the current clock edge.RDW occurs when the same data are written and read on the same clock edge.Due to the latency of the I-LVT, reading from the same address on the next clockedge after writing (RAW) will provide the old data. To read the new data instead,the output extraction banks of the I-LVT should be bypassed by a single-stagebypass to overcome the I-LVT latency.The deepest bypassing stage is reading new data on the same writing clock edge(RDW), which is similar to a single register stage latency. This can be achieved by111Table 4.6: Bypassing of multi-switched-ports.Data Banks I-LVT BanksFeedback Output ExtractAllow WAW None 1-stage NoneNew Data RAW None 1-stage 1-stageNew Data RDW 1-stage 1-stage 2-stage2-stage bypass on the output extraction banks of the I-LVT to allow reading on thesame clock edge. The data banks, which are working in parallel with the I-LVTshould be bypassed by a single-stage to provide new data.112Table 4.7: Single-stage and two-stage BRAM bypassing.Simple/Simple Configuration True/Simple Configuration True/True ConfigurationSingle-stageDinPort ADoutAddrR/WPort BDinDoutAddrR/W=DinDoutAddrR/WDinDoutAddrR/W001x01==001x01DinPort ADoutAddrR/WPort BDinDoutAddrR/W=DinDoutAddrR/WDinDoutAddrR/W0101DinPort ADoutAddrR/WPort BDinDoutAddrR/WDoutAddrDinDoutAddrR/W001x01DinPort ADoutAddrR/WPort BDinDoutAddrR/WDoutAddrDinDoutAddrR/W01DinPort ADoutAddrR/WPort BDinDoutAddrR/W=Din01‘1’‘0’DoutAddrAddrDinPort ADoutAddrR/WPort BDinDoutAddrR/W=DoutAddr‘1’Din‘0’001x01=Addr‘1’‘1’===Single-stageTwo-stageSimple/Simple Configuration True/Simple Configuration True/True ConfigurationDinPort ADoutAddrR/WPort BDinDoutAddrR/W=DinDoutAddrR/WDinDoutAddrR/W001x01==001x01DinPort ADoutAddrR/WPort BDinDoutAddrR/W=DinDoutAddrR/WDinDoutAddrR/W0101DinPort ADoutAddrR/WPort BDinDoutAddrR/WDoutAddrDinDoutAddrR/W001x01DinPort ADoutAddrR/WPort BDinDoutAddrR/WDoutAddrDinDoutAddrR/W01DinPort ADoutAddrR/WPort BDinDoutAddrR/W=Din01‘1’‘0’DoutAddrAddrDinPort ADoutAddrR/WPort BDinDoutAddrR/W=DoutAddr‘1’Din‘0’001x01=Addr‘1’‘1’===Single-stageTwo-stageSimple/Simple Configuration True/Simple Configuration True/True ConfigurationDinPort ADoutAddrR/WPort BDinDoutAddrR/W=DinDoutAddrR/WDinDoutAddrR/W01x01==01x01DinPort ADoutAddrR/WPort BDinDoutAddrR/W=DinDoutAddrR/WDinDoutAddrR/W0101DinPort ADoutAddrR/WPort BDinDoutAddrR/WDoutAddrDinDoutAddrR/W01x01DinPort ADoutAddrR/WPort BDinDoutAddrR/WDoutAddrDinDoutAddrR/W01DinPort ADoutAddrR/WPort BDinDoutAddrR/W=Din01‘1’‘0’DoutA drA drDinPort ADoutAddrR/WPort BDinDoutAddrR/W=DoutA dr‘1’Din‘0’01x01=A dr‘1’‘1’===Single-stageTwo-stageSi ple/Si ple Configuration True/Si ple Configuration True/True ConfigurationTwo-stageDinPort ADoutAddrR/WPort BDinDoutAddrR/W=DinDoutAddrR/WinDoutAddrR/W001x01==001x01DinPort ADoutAddrR/WPort BDinDoutAddrR/W=DinDoutAddrR/WDinDoutAddrR/W0101DinPort ADoutAddrR/WPort BDinDoutAddrR/WDoutAddrDinDoutAddrR/W001x01DinPort ADoutAddrR/WPort BDinDoutAddrR/WDoutAddrDinDoutAddrR/W01DinPort ADoutAddrR/WPort BDinDoutAddrR/W=Din01‘1’‘0’DoutAddrAddrDinPort ADoutAddrR/WPort BDinDoutAddrR/W=DoutAddr‘1’Din‘0’001x01=Addr‘1’‘1’===Single-stageTwo-stageSimple/Simple Configuration True/Simple Configuration True/True ConfigurationDinPort ADoutAddrR/WPort BDinDoutAddrR/W=DinDoutAddrR/WinDoutAddrR/W001x01==001x01DinPort ADoutAddrR/WPort BDinDoutAddrR/W=DinDoutAddrR/WDinDoutAddrR/W0101DinPort ADoutAddrR/WPort BDinDoutAddrR/WDoutAddrDinDoutAddrR/W001x01DinPort ADoutAddrR/WPort BDinDoutAddrR/WDoutAddrDinDoutAddrR/W01DinPort ADoutAddrR/WPort BDinDoutAddrR/W=Din01‘1’‘0’DoutAddrAddrDinPort ADoutAddrR/WPort BDinDoutAddrR/W=DoutAddr‘1’Din‘0’001x01=Addr‘1’‘1’===Single-stageTwo-stageSimple/Simple Configuration True/Simple Configuration True/True ConfigurationDinPort ADoutAddrR/Port BDinDoutAddrR/=DinDoutAddrR/inDoutAddrR/01x01==01x01DinPort ADoutAddrR/Port BDinDoutAddrR/=DinDoutAddrR/DinDoutAddrR/0101DinPort ADoutAddrR/Port BDinDoutAddrR/DoutAddrDinDoutAddrR/01x01DinPort ADoutAddrR/Port BDinDoutAddrR/DoutAddrDinDoutAddrR/01DinPort ADoutAddrR/Port BDinDoutAddrR/=Din01‘1’‘0’DoutAddrAddrDinPort ADoutAddrR/Port BDinDoutAddrR/=DoutAddr‘1’Din‘0’01x01=Addr‘1’‘1’===Single-stageTwo-stageSi ple/Si ple Configuration True/Si ple Configuration True/True Configuration1134.3.2 Experimental Results4.3.2.1 Experimental FrameworkThe proposed multi-switched-port RAM approach, complete with bypassing, hasbeen fully implemented in parameterized Verilog. For a given design instance, wedeveloped a memory compiler to convert the RAM port assignment into a bicliqueDFG, enumerate all of the biclique pattern instances in the DFG, and use theseto describe a set cover problem instance. The set cover problem is formulatedas a Binary Linear Programming (BLP) problem using AMPL (A MathematicalProgramming Language) [109], which is an algebraic modeling language used todescribe large-scale mathematical optimization problems. The BLP optimizationproblem is solved using GLPK (GNU Linear Programming Kit) [110], an opensource large-scale linear programming solver. Finally, the selected biclique patterns(covers) are used to automatically construct the data banks as described in Table 4.4and shown for example in Figure 4.11.A run-in-batch flow manager has also been developed to simulate and synthe-size these designs with various parameters in batch using Altera’s ModelSim andQuartus II. The Verilog modules, the algorithmic scripts and the flow manager areavailable online as an open source contribution [107].To verify correctness, each design instance is simulated using Altera’s Model-Sim. A large variety instances with of different RAM port assignments and designparameters, e.g., bypassing, RAM depth and data width, are swept and simulatedin batch, each with over a million random cycles. These multi-switched-port RAM114design modules were then compiled with Altera’s Quartus II into Altera’s Stratix V5SGXEA7N1F45C1 device [55]. This is a speed grade 1 device with 234k ALMsand 2560 M20K blocks.4.3.2.2 MethodologyThe proposed multi-switched-port RAM design process is generic and can supportany number of fixed and heterogeneous switched ports. This means all previousmultiport-RAM methods are actually special cases of this new proposed method.For benchmarking purposes, we take a number of design instances, and for eachone we find a multi-switched port RAM using our new method. We then need togenerate comparative results using multiport RAM designed with fixed ports [1],with true ports [56], and with a single switched port [2].Then, we can compare results against the older fixed-port method [1] usingthe same tooling. By treating switched ports as fixed ports, we thereby place allread and write ports into the first port group P0 (the fixed port). This generates afixed-port solution that satisfies the same design instance requirements.To compare against using the true-ports method [56], we must avoid using thefixed port group P0. Instead, as described in Table 4.8, each fixed port must bemapped into a true port (with a fixed RW control), hence nW,0+nR,0 true ports arerequired to implement the fixed ports. In addition, every read and write pair in aswitched port can be mapped into a single true port, hence, max(nW,i,nR,i) trueports are required to implement a switched port group Pi. In total, the number of115the required true ports isnt = nW,0+nR,0+nP−1∑i=1max(nW,i,nR,i) (4.14)For example, the system in Figure 4.8 can be implemented using nt = 5 trueports.To compare against the single-switched-port method [2], the largest switchedport (say Pm) is chosen to be implemented as the single-switched port, and all theother ports are implemented using fixed ports. This becomesP ={P0 =(nW −nW,m,nR−nR,m),P1 =(nW,m,nR,m)}(4.15)where m is the index of the switched-port with the maximum writes and reads,1≤ m < nP∣∣∣nW,m+nR,m = max1≤i<nP(nW,i+nR,i) (4.16)116Table 4.8: Multi-switched-ports conversion (example from Figure 4.8).Multi-switched Fixed [1, 11] True [56] Single-switched [2]PortAssignmentR1,0W0,0R2,0R0,0P0FixedP1 SwitchedP2SwitchedR1/W1R2/W2W1,0W2,0W2,1R1,0W0,0R2,0R0,0P0FixedR1/W1R2/W2W1,0W2,0W2,1(N/C)(N/C)WWRRWWR R1/W1W0,0 P1TrueW RR/WP2TrueW RW1,0 P3TrueW RW2,0 P4TrueW RW2,0 P5TrueW RR0,0R1,0R2,0‘0’‘1’R2/W2R/WR/WR/WR/W(N/C)(N/C)(N/C)W0,0 R0,0P0FixedW RW1,0 R1,0R2,0P1SwitchedR2/W2W2,0W2,1WWRR1/W1R/WR/W(N/C)R/W1,00,02,00,00i1 it2it1/ 12/ 21,02,02,11,00,02,00,00i1/ 12/ 21,02,02,1( / )( / ) 1/ 10,0 1r/2r1,0 3r2,0 4r2,0 5r0,01,02,0‘ ’‘ ’2/ 2////( / )( / )( / )0,0 0,00i1,0 1,02,01it2/ 22,02,11/ 1//( / )/R1,00,0R2,0R0,0P0FixedP1 S itchedP2S itchedR1/ 1R2/ 21,02,02,1R1,00,0R2,0R0,0P0FixedR1/ 1R2/ 21,02,02,1(N/C)(N/C)RRR R1/ 10,0 P1TrueRR/P2TrueR1,0 P3TrueR2,0 P4TrueR2,0 P5TrueRR0,0R1,0R2,0‘0’‘1’R2/ 2R/R/R/R/(N/C)(N/C)(N/C)0,0 R0,0P0FixedR1,0 R1,0R2,0P1S itchedR2/ 22,02,1RR1/ 1R/R/(N/C)R/1174.3.2.3 Test CasesA number of random multi-switched-ports test-cases are listed in Table 4.9. Tenrandom test cases, TC1 to TC10, have been generated for illustrative purposes;real cases are difficult to extract from applications without a precise understandingof their use of multi-port RAMs and how their FSMs specify (possibly mutuallyexclusive cases of) read/write behaviour. While our method can synthesize anynumber ports, the test-cases are limited to 8 switched ports to avoid accidentallyexceeding device resources. These random test-cases use 4 to 8 switched ports,where each switched port has 1 to 4 writes or reads. For each test case, we specifya multi-port RAM that the multi-switched ports approach proposed in this chapter.In addition, we show other multi-port RAM designs with fixed ports [1], true ports[56], and a single switched-port [2] that address the same requirements.4.3.2.4 ResultsTable 4.10 lists the experimental results of the ten random test-cases defined inTable 4.9, implemented using the four design styles. All synthesized test-cases haveone byte of data width and 8k-lines in depth, new-data-RAW bypassing and use abinary-coded I-LVT [1]. BRAM consumption, ALMs and Fmax are given directlyin the table. Table 4.11 lists the change percentage in these parameters compared toour proposed method. Compared to the single switched-port [2] and the fixed-port[1] methods, ALM consumption and Fmax are similar while BRAM consumptionreduced by 18% on average. On the other hand, comparing our proposed methodto the true-ports method shows a 42% BRAM reduction, 53% fewer ALMs, and11815% higher Fmax.119Table 4.9: Heterogeneous multi-ported RAM testcases.Testcase#Writes and reads for each port;(nW,i,nR,i)from Equation 4.4Multi-switched Fixed True Single-switchedP0 P1 P2 P3 P4 P5 P6 P7 P0 P0 P0 · · ·Pnt nt P0 P1TC1 (1,1) (1,1) (2,1) (2,1) (1,2) (1,2) (2,3) (2,3) (11,13) (0,0) (1,1) 15 (9,10) (2,3)TC2 (1,1) (1,1) (2,1) (1,2) (2,2) (2,3) (3,2) (3,3) (13,14) (0,0) (1,1) 15 (10,11) (3,3)TC3 (1,2) (1,1) (1,2) (2,1) (2,2) (2,3) (3,2) - (12,12) (0,0) (1,1) 15 (9,10) (3,2)TC4 (2,1) (1,1) (1,1) (1,2) (1,3) (2,3) (3,3) - (9,15) (0,0) (1,1) 15 (6,12) (3,3)TC5 (1,1) (1,3) (2,1) (2,1) (2,2) (2,3) - - (10,10) (0,0) (1,1) 13 (8,7) (2,3)TC6 (1,3) (1,1) (1,1) (1,3) (2,2) (3,3) - - (8,13) (0,0) (1,1) 13 (5,10) (3,3)TC7 (2,2) (1,1) (2,4) (2,1) (1,3) - - - (8,11) (0,0) (1,1) 14 (6,7) (2,4)TC8 (2,1) (1,1) (1,4) (2,1) (3,1) - - - (8,10) (0,0) (1,1) 14 (7,6) (1,4)TC9 (2,3) (2,1) (1,4) (2,4) - - - - (7,12) (0,0) (1,1) 15 (5,8) (2,4)TC10 (3,1) (1,2) (2,4) (3,4) - - - - (9,11) (0,0) (1,1) 14 (6,7) (3,4)120Table 4.10: Experimental results.Testcase# Multi-switched Single-switched [2] True [56] Fixed [1]BRAMsALMsF max(MHz)BRAMsALMsF max(MHz)BRAMsALMsF max(MHz)BRAMsALMsF max(MHz)TC1 826 3417 247.4 1054 3349 242.25 1290 6222 215.84 1034 3271 235.29TC2 1044 4781 224.16 1316 4828 228.1 1290 6222 215.84 1352 4840 225.33TC3 904 3793 232.56 1104 3717 237.3 1290 6222 215.84 1080 3634 241.9TC4 726 2732 250.75 882 2653 253.87 1290 6222 215.84 918 2617 245.04TC5 628 2434 248.32 756 2441 244.38 962 4503 235.68 780 2414 253.94TC6 568 2031 258 700 1948 260.42 962 4503 235.68 704 1916 246.37TC7 540 1801 257.67 608 1746 265.32 1120 5483 214.13 608 1703 260.96TC8 524 1675 265.82 576 1629 270.42 1120 5483 214.13 592 1614 266.03TC9 504 1499 279.17 556 1502 270.27 1290 6222 215.84 560 1444 275.71TC10 586 2321 252.33 690 2205 257.33 1120 5483 214.13 702 2154 247.46Avg. 685 2648.4 251.62 824.2 2601.8 253 1173.4 5656.5 219.3 833 2560.7 249.8121Table 4.11: Results comparison.Testcase# Reduction compared to: ALM Reduction Compared to: Fmax Increase Compared to:Single-switched[2]True[56]Fixed[1]Single-switched[2]True[56]Fixed[1]Single-switched[2]True[56]Fixed[1]TC1 22% 36% 20% -2% 45% -4% 2% 15% 5%TC2 21% 19% 23% 1% 23% 1% -2% 4% -1%TC3 18% 30% 16% -2% 39% -4% -2% 8% -4%TC4 18% 44% 21% -3% 56% -4% -1% 16% 2%TC5 17% 35% 19% 0% 46% -1% 2% 5% -2%TC6 19% 41% 19% -4% 55% -6% -1% 9% 5%TC7 11% 52% 11% -3% 67% -6% -3% 20% -1%TC8 9% 53% 11% -3% 69% -4% -2% 24% 0%TC9 9% 61% 10% 0% 76% -4% 3% 29% 1%TC10 15% 48% 17% -5% 58% -8% -2% 18% 2%Avg. 17% 42% 18% -2% 53% -3% -1% 15% 1%1224.4 ConclusionsIn this chapter, we have proposed a novel, modular, BRAM-based switched-multi-ported RAM architecture. In addition to unidirectional ports with fixed read/write,this switched architecture allows a group of write ports to switch with anothergroup of read ports dynamically. The proposed switched-ports architecture is lessflexible than a true-multi-ported RAM where each port is switched individually.Nevertheless, switched memories can dramatically reduce BRAM consumptioncompared to true or simple ports for systems with alternating port requirements.In addition, we propose a new idea of having multiple switched ports in multi-ported RAM design. This method requires a memory compiler to create a specificdesign instance, and solving a set cover problem to optimize its implementation.Our Computer-Aided Design (CAD) approach always finds a minimal implemen-tation for all of our test cases, but there is opportunity for further CAD research toimprove run-time while still being optimal. On average out of 10 random test-cases,the suggested multi-switched-ports method reduces BRAM use by 18% comparedto the best of previous methods, while maintaining ALM count and Fmax. Futureresearch may address the RAM port assignment problem to more complex caseswhere there are more than two states governing memory port usage.The fully parameterized and generic Verilog modules, the algorithmic scripts,the multi-switched-ports memory compiler, and the flow manager are availableonline as an open source contribution [106, 107].123Chapter 52-Dimensional Hierarchical SearchBCAMs (2D-HS-BCAMs)In this chapter, a novel and efficient technique for constructing BCAMs out ofstandard SRAM blocks in FPGAs is proposed. The new technique divides a CAMinto wide rows or sets, instead of just being a single pattern wide. Using Altera’sStratix V device, the traditional design method achieves up to a 64K-entry BCAMwhile the proposed technique achieves up to 4M entries. For the 64K-entry test-case, the traditional method consumes 43 times more ALMs, 18 times longermapping runtime, and achieves only one-third of the Fmax of the proposed method.A fully parameterized Verilog implementation is being released as an open sourcehardware library. The library has been extensively tested using ModelSim andAltera’s Quartus tools.1245.1 IntroductionIn this section, a modular SRAM-based BCAM suitable for many storage entries isproposed. The CAM storage is divided into equal-sizes sets. CAM lookup dependsupon two steps. First, a RAM structure stores hit or miss information for each set,where a set stores several patterns. Second, the specific set is searched in parallelfor a match. The set data (the patterns themselves) are stored in a second RAMstructure. To achieve a write and a match of the same pattern in the same cycle, aCAM bypassing mechanism is also provided.The proposed method is device-independent; hence, it can be applied to anyFPGA device containing standard dual-ported BRAMs. The proposed approachdramatically improves CAM storage efficiency and operation frequency comparedto conventional methods. In contrast to other attempts that require several cyclesto write or match [111–113], the proposed approach is high-throughput and canperform a pattern read (match) every cycle and a pattern write every two cycles.Major contributions of this work are:• A novel BCAM architecture capable of producing millions of CAM entries.Compared to other conventional BCAM approaches, the proposed techniqueexhibits the highest storage efficiency while providing improved overallperformance. To the authors’ best knowledge, research and patent literaturedo not have similar BCAM techniques.• A CAM bypassing mechanism is also provided to write and match the samepattern in the same cycle.125• A parameterized Verilog implementation of the suggested methods, togetherwith previous standard approaches. A flow manager to simulate and syn-thesize various designs with various parameters in batch using Altera’sModelSim and Quartus II is also provided. The Verilog modules and theflow manager are available online [114].To verify correctness, the proposed BCAM architecture is fully implementedin Verilog, simulated using Altera’s ModelSim, and compiled using Quartus II[57]. A large variety of BCAM architectures and parameters, e.g., BCAM depth,pattern width, bypassing, and pipelining are simulated in batch, each with overone million random BCAM write and match cycles. Stratix V, Altera’s high-endperformance-oriented FPGA [55], is used to implement and compare the proposedarchitecture with previous approaches.The rest of this chapter is organized as follows. The motivation and methodof the proposed 2-Dimensional Hierarchical Search BCAM (2D-HS-BCAM) ap-proach to generate deep BCAMs is described in detail in Section 5.2. BCAMbypassing techniques are described in Section 5.3. Discussion of the suggestedmethod and comparison to previous techniques are provided in Section 5.4. Theexperimental framework, simulation and synthesis results, are discussed in Sec-tion 5.5. Conclusions are drawn in Section 5.6.1265.2 The 2-Dimensional Hierarchical SearchBCAM (2D-HS-BCAM) ApproachAs shown in Figure 5.1, pattern lookup works in two stages. First, a RAM structure,called the Set Transposed Indicators RAM (STIRAM), is indexed by the pattern;for each pattern, it stores match information about where the pattern is present ineach possible set. This can be used by a priority encoder to identify the addressor ID of a set containing the pattern; the set ID forms the upper bits of the matchaddress. The set ID also indexes into a second RAM structure, called the SetsRAM (SetRAM), to produce the set content. Second, the wide set is searched inparallel for a match to the pattern; the location of the match within the set producesthe lower bits of the match address.While pattern-cascadable BCAMs require indicators from every address loca-tion at every stage, this requirement can be alleviated if the BCAM will not becascaded in pattern width. Instead of storing pattern indicators for each addresslocation separately, an indicator is generated for a group of addresses, indicating ifthe pattern exists at any of these addresses. An address set of width SW is a groupof successive SW addresses, hence an address range of depth SW . For a CD-entryBCAM, and a SW set width,⌈CDSW⌉sets exist, and can be enumerated asS ={0,1, · · · ,⌈CDSW⌉−1}. (5.1)A set indicator Ip,s indicates if any of the addresses in set s, hence addresses127===PWMPattlog2(CD/SW)CD/SWMAddrlog2(SW)log2(CD)PWPWPWSW·PWIntra-set CompareSetRAM (CD/SW)X(SW·PW) AddrRDataMPattMIndcSTIRAM (CD/SW)XP WSegmentsPriority Encoder MatchIntra-SegmentPriority EncoderFigure 5.1: The match portion of the 2D-HS-BCAM system.SW · s upto SW · (s+1)−1 contains the pattern p, namely,∀s ∈ S, p ∈ P : IP,S =a=SW ·(s+1)−1∨a=SW ·s(RAM [a] EQ p). (5.2)Where P is the patterns set. The STIRAM is thereforeST IRAM =I0,0 I0,1 · · · I0,|S|−1I1,0 I1,1 · · · I1,|S|−1...... . . ....I|P|−1,0 I|P|−1,1 · · · I|P|−1,|S|−1. (5.3)128STIRAM provides information for matched patterns within a set, not the exactlocation. To detect the exact pattern location, an auxiliary RAM stores the patternsassociated with each set, with all patterns for the set in one RAM address, henceit is called the Sets RAM (SetRAM). If a match is found in a specific set, thisset will be fetched from the SetRAM and patterns within this set are comparedconcurrently to the match pattern.Figure 5.2 illustrates the complete 2D-HS-BCAM system, whereas an exampleof an 8-entry, 2-bit BCAM with SW = 2 is given in Figure 5.3. Two RAM struc-tures are required, the first is STIRAM, a transposed RAM, which is addressedby patterns and stores set indicators. The second is SetRAM, which stores thedata patterns for each set in one RAM line. The reference RAM in Figure 5.3describes the content of the BCAM, but is not required for the 2D-HS-BCAMimplementation. Instead, BCAM content is stored in the SetRAM as sets of datapatterns.129Write ControllerPWMPattMAddrPWPWMPattMIndcSTIRAM (CD/SW)XPWWEnbWPattAddrWr/ErWr/ErMUXSetsPriority Encoder MatchWData==WAddrWrite Controller==SetPriority Encoderlog2(CD/SW)log2(SW)PWSW·PWSW·PWPWPPWSWlog2SWlog2CDlog2(CD/SW)MultiPattsegment Match CompareOne-hotDecoderPattern toRemove MUXOccurrencesIndicatorsMaskingAddrRDataWDataAddrW/RRDataByteEnEncoder= =Intra-SetPriority EncoderPWPWPWWIntra-segment Match ComparePattern toRemove MUXOccurrencesIndicatorsMaskingMatchWriteFigure 5.2: The complete 2D-HS-BCAM system.1301 D0 APatternsSetRAM 4×4AddressesDB set0set1BCAM Content as 8×2Reference RAM3 B2 BSetRAM AddressesBD1100000100100 1 2 3ABCPatternsAddressesSTIRAM 4×4 (SW=2)set2set3BBDDDBAPatternAddresses321054set 0set 1set 20 1 1 0DPatternsSTIRAM set 0set 1set 2set 3B76set 3Figure 5.3: 8×2; SW = 2 example of the 2D-HS-BCAM.The match operation checks STIRAM for a match within a set, detects the firstmatching set using a priority encoder, then fetches the corresponding set patterns(with a match) from SetRAM, then compares all patterns with the match patternin parallel to detect the exact match location. The match operation is described indetails as follows.1. Detect match among sets:1.1. MPatt is provided to STIRAM for search.1.2. STIRAM detects which sets contains MPatt, outputting a match indica-tor for each set.1.3. Sets priority-encoder (PE) generates the binary address for the firstset with a matching pattern. This address composes the higher part ofMAddr.1311.4. Sets PE also generates the binary Match signal, which indicates that amatch was found.2. Detect match exact location within the set:2.1. The address of the first matched set (step 1.3) is provided to theSetRAM to fetch the entire set.2.2. Each pattern within the set is compared to MPatt2.3. A priority-encoder (intra-set PE) detects the first matching pattern. Theaddress of the first matching pattern produces the lower part of MAddr.While detecting a match is completed by reading the STIRAM in one cycle,computing the exact match address requires another cycle to read the SetRAM.Hence, the match operation latency is two cycles. The match operation throughputis a single cycle since both STIRAM and SetRAM are read concurrently.Similar to the Brute-Force Transposed Indicators (BF-TI) writing mechanism(see Appendix A), writing to the 2D-HS-BCAM requires two cycles, one cycle fornew pattern insertion, and a second cycle for old pattern deletion. However, beforeclearing the old data indicator in the STIRAM, the set should be checked to detectother occurrences of the old data. If another occurrence of the deleted patternis found in the same set, the set indicator in the STIRAM should not be cleared.Figure 5.2 illustrates the additional circuitry required for writing the BCAM. Indetail, the 2D-HS-BCAM writing is performed as follows.1. Cycle 1: STIRAM write; SetRAM read1321.1. Write STIRAM with WPatt (also called WData) and the higher⌈log2CDSW⌉bits of WAddr to set the corresponding set indicator.1.2. Read the entire corresponding set for SetRAM (addressed by the higher⌈log2CDSW⌉bits of WAddr)1.2.1. The Pattern to remove MUX selects the pattern that is being rewrit-ten from the corresponding set. The selector is the lower⌈log2 SW⌉bits of WAddr.1.2.2. The Occurrences Indicators are a compare of the currently rewrit-ten pattern with all the other patterns in the set to detect otheroccurrences.1.2.3. The final masking stage masks the indicator of the currently rewrit-ten pattern, since only other occurrences should be detected. Allthe indicators are OR’ed to detect any other occurrence2. Cycle 2: SetRAM write; STIRAM conditional erase2.1. The SetRAM is written with WPatt. Byte-enable is used to write onlythe corresponding pattern in the set.2.2. If no other occurrences of the currently rewritten pattern are detected(stage 1.2.3 above), the STIRAM indicator for the replaced pattern andthe current address is cleared.The read and write operations described above rely upon an efficient imple-mentation of a very wide priority encoder. The design we use is described in133Appendix B.To implement a BCAM with CD entries and PW pattern width, namely a CD×PWBCAM, the 2D-HS-BCAM approach requires⌈CDSW⌉×PW ·SW SRAM cells for theSetRAM and 2PW ×⌈CDSW⌉SRAM cells for the STIRAM, a total of⌈CDSW⌉×PW ·SW +2PW ×⌈CDSW⌉. (5.4)Assuming a wide RAM of RW,max, an upper bound estimate for the BRAMsneeded to construct the STIRAM is⌈2PWRD,min⌉·⌈CWSW⌉RW,max . (5.5)The SetRAM is a true-dual-port RAM. RW,byte describes the width of data con-trolled by a single byte-enable. A single BCAM accommodates RW,maxRW,byte individualbyte-enable controlled data, and each pattern requires⌈PWRW,byte⌉of them. Finally,SW lines of this structure are required. Therefore, the number of BCAMs neededto construct the SetRAM isRW,maxRW,byte·⌈PWRW,byte⌉·⌈SWRD,min⌉. (5.6)1345.3 BCAM BypassingWriting a new pattern to a register-based BCAM is immediate on the triggeringclock edge and it is ready to be matched immediately. In contrast, BRAM-basedBCAM methods, namely BF-TI and 2D-HS-BCAM, require two cycles for writing.Hence, matching a pattern that is being written in the same cycle will match oldindicators, introducing a read-after-write hazard. However, some applications, e.g.,caches and TLBs, require immediate matching of the recently written patterns,hence, pattern bypassing is required.Figure 5.4 shows the bypassing circuitry for both BF-TI and 2D-HS-BCAM. Inboth, the writing address (WAddr) is one-hot encoded; the insertion mask (bitwiseOR) forces ‘1’ into the matched indicators (MIndc) in location MAddr. Similarly,the removal mask (bitwise AND) forces ‘0’ into the matched indicators (MIndc)in location MAddr. In the TIRAM approach, if the matched pattern (MPatt) isequivalent to the written pattern (WPatt), the insertion mask output is passed to thematching indicators output (MIndc); otherwise the removal mask output is passed.In the 2D-HS-BCAM approach, there is another case. The removal mask output isallowed to be passed only if MPatt equals to the removed pattern (RmPatt), andthere are no other occurrences of the removed pattern in the same set (negatedMultiPatt).5.4 Comparison and DiscussionThe BCAM storage efficiency µs is first introduced in this dissertation, and isdefined as the SRAM cell utilization for a specific BCAM implementation. In135WPattOne-hotencoder=Removal MaskInsertion Mask01enbenbWEnbMIndcWAddrMIndcMPatt(bypassed)(unbypassed)WPattOne-hotencoder=Removal MaskInsertion Mask01enbenbWEnbMIndcWAddrMIndcMPatt(bypassed)(unbypassed)01=RmPattMultiPattFigure 5.4: Bypassing logic for (top) Brute-Force Transposed Indicators(BF-TI) approach, and (bottom) the proposed 2D-HS-BCAM approach.other words, µs is the ratio between the total BCAM bits and the total SRAM cellsused to implement this BCAM. Using Equation 2.11 and Equation 2.13, µs for the136Brute-Force Transposed Indicators (BF-TI) is estimated as follows.µs (BFT I)≈11+ 2PWPWuncascaded11+RD,minlog2(RD,min)pattern− cascaded(5.7)Equation 5.7 shows that the storage efficiency of the uncascaded BF-TI imple-mentation is inversely proportional to the exponent of PW , hence, decays rapidlywith PW increase as shown in Figure 5.5.The pattern width cascaded Brute-Force Transposed Indicators (BF-TI) is notrelated to PW , hence the efficiency is not affected when PW increases. However, theefficiency is affected by an intrinsic BRAM characteristic, namely, the minimaldepth RD,min (associated with the maximum width). The efficiency is inverselyproportional to RD,min, hence, shallow and wide BRAMs with smaller RD,min (andlarger RW,max) will exhibit higher storage efficiency.Using Equation 5.4, µs for the 2D-HS-BCAM is estimated as follows.µs (2DHS)≈ 11+ 2PWSW ·PW(5.8)Similar to the uncascaded BF-TI, the efficiency of the proposed 2-DimensionalHierarchical Search BCAM (2D-HS-BCAM) approach is inversely proportionalto the exponent of PW . However, the exponential relation is mitigated by the setwidth SW , an external and user-driven parameter.For example, with SW = 4K, RD,min = 512 (as in Altera’s M20K), and a pattern137width of PW = 12, the storage efficiency of the proposed 2D-HS-BCAM is µs =0.923 while using the pattern width cascaded BF-TI approach provides µs =0.017. The 2D-HS-BCAM approach provides the highest efficiency up to PW = 22,compared to the BF-TI. To generalize, 2D-HS-BCAM is superior to the BF-TImethod up toPW =−W−1(− ln2a)ln2, a = SW · RD,minlog2(RD,min) , (5.9)Where W−1 is the lower branch of the Lambert Product Logarithm function(also called Omega function).Practically, the set width SW has a limitation due to the minimal width of theBlock-RAM used in the SetRAM. The SetRAM depth is⌈CDSW⌉and is limited byRD,min, hence, to achieve maximum storage efficiency, SW is bounded bySW ≤ CDRD,min . (5.10)For deep memories, SW can be set to higher values, allowing higher storageefficiency. Furthermore, providing shallow and wide BRAM, namely a lowerRD,min, will allow higher SW values, hence a higher storage efficiency.To reduce ALM consumption it is recommended to divide the priority-encodersin the 2D-HS-BCAM design equally, hence, SW ≈√CD. Each priority encoder willbe of width√CD. Furthermore, dividing the priority-encoders equally will balancethe priority-encoders depth, hence increasing Fmax and reducing the maximum13800.20.40.60.818 12 16 20 242D-HS(S   =4096)2D-HS(S   =2048)2D-HS(S   =1024)2D-HS(S   =512)2D-HS(S   =256)2D-HS(S   =128)2D-HS(S   =64)2D-HS(S   =32)BF-TIPattern Width (PW)Storage Efficiency (µs) WWWWWWWWFigure 5.5: Storage efficiency (µs) as function of pattern width (PW ) for theuncascaded BF-TI and 2D-HS-BCAM.pipe stages in the pipelined design.5.5 Experimental ResultsTo verify and simulate the suggested approach and compare to standard tech-niques, fully parameterized Verilog modules have been developed. Register-based,pattern width cascaded and uncascaded brute-force TIRAM, and the proposed2D-HS-BCAM methods have been implemented. To simulate and synthesize thesedesigns with various parameters in batch using Altera’s ModelSim and Quartus II,a run-in-batch flow manager has also been developed. The Verilog modules andthe flow manager are available online [114].To verify correctness, the proposed architecture is simulated using Altera’sModelSim. A large variety of different BCAM architectures and parameters,e.g., bypassing, depth, pattern width, and set width, are swept and simulatedin batch, each with over one million random cycles. All different BCAM de-139sign modules were implemented using Altera’s Quartus II on Altera’s Stratix V5SGXMA7N1F45C1 device [55]. This is a high-performance device with 235kALMs and 2560 M20Ks.Figure 5.6 and Figure 5.7 plot feasible BCAM depth and pattern width sweepsimplemented on Altera’s Stratix V device. Within the device limitation, the pro-posed 2D-HS-BCAM approach is able to reach 4M entries of CAM, while theBF-TI and the register-based BCAMs cannot exceed 64K and 32K entries, re-spectively. The number of Altera’s M20K blocks used to implement each BCAMconfiguration is plotted in Figure 5.6 (bottom). Even for shallow memories of64K and 32K, the proposed approach demonstrates lower BRAM consumptioncompared to the other methods. The columns in Figure 5.6 show the BRAMconsumption of the two RAM structures that compose the 2D-HS-BCAM; theSetRAM and the STIRAM. The SetRAM stores the data patterns; hence, if theSetRAM BRAM consumption is dominating the STIRAM consumption, the stor-age efficiency will be higher.The proposed 2D-HS-BCAM method exhibits significantly lower ALM countand higher Fmax due to splitting the priority-encoder as shown in Figure 5.6 (middleand top). The register-based BCAM consumes the highest ALMs due to massiveregister usage. To achieve high Fmax, all testcases are fully pipelined. Pipeliningand BRAM access latency increase the overall system latency in both traditionaland proposed approaches. The latency overhead is reported in Figure 5.7 (bottom).Brute-force TIRAM approach has a lower latency by maximum one cycle in someshallow testcases. The latency of both approaches in these shallow testcases is 9 or14010 cycles. The growth of latency as depth increases is logarithmic, since PE depthis logarithmic. The latency of the deepest 4M-entry 2D-HS-BCAM is 13 cycles.Figure 5.7 (lower middle) plots the optimal set width. As the depth increases,optimal set width is larger. Similar with pattern width; if pattern width increases,optimal set width increases to overcome the increase in STIRAM BRAM consump-tion to wide patterns.The storage efficiency is plotted in Figure 5.7 (upper middle). The 2D-HS-BCAMovercomes the brute-force TIRAM in 32K and 64K CAMs. Furthermore, storageefficiency is inversely related to pattern width as expected from 2D-HS-BCAM.As 2D-HS-BCAM goes deeper, the efficiency increases. The 4M × 9 BCAM testcase demonstrates a high storage efficiency of 0.9.Figure 5.7 (top) plots the full Quartus II flow runtime. 2D-HS-BCAM synthesisis faster than register-based or BF-TI BCAMs. 2D-HS-BCAM synthesis runs upto 3 hours in its largest 4M testcase.14101234W 7 8 9 1011121314151617181920 7 8 9 10111213141516171819 7 8 9 101112131415161718 7 8 9 1011121314151617 7 8 9 10111213141516 7 8 9 101112131415 7 8 9 1011121314 7 8 9 10D 32K 64K 128K 256K 512K 1M 2M 4MM20Ks (Thousands)SetRAM BlocksSTIRAM BlocksDevice Limit050100150200250ALMs (Thousands)Reg-basedBF-TIRAM2D-HSDevice Limit100200300400500600Fmax (MHz) T=0˚C Figure 5.6: Results for several BCAM depth and pattern width sweeps (top) Fmax (middle) ALMs count(bottom) M20K count.1426412825651210242048Set Width -SW910111213W 7 8 9 1011121314151617181920 7 8 9 10111213141516171819 7 8 9 101112131415161718 7 8 9 1011121314151617 7 8 9 10111213141516 7 8 9 101112131415 7 8 9 1011121314 7 8 9 10D 32K 64K 128K 256K 512K 1M 2M 4MMatch Latency (Cycles)Reg-basedBF-TIRAM2D-HS0200400600800Runtime (Minutes)00.10.20.30.40.50.60.70.80.9Storage Effeciency -µsFigure 5.7: Additional results for several BCAM depth and pattern width sweeps (top) Runtime (upper middle)Storage efficiency (lower middle) Set width (bottom) Match latency.1435.6 ConclusionsIn this chapter, a novel BCAM architecture for FPGAs is proposed. The approach isfully BRAM-based and employs hierarchical search to reduce BRAM consumption.While traditional brute-force approach have a pattern match indicator for eachsingle address, the proposed approach maintains a single pattern match indicatorfor each address set. The suggested method is capable of implementing a 4M-entryCAM and significantly improves area and performance. In contrast, traditionalmethods cannot exceed 64K in depth. For the 64K-entry test-case, the traditionalmethod consumes 43 times more ALMs, 18 times longer mapping runtime, andachieves only one-third of the Fmax of the proposed method. The suggested2D-HS-BCAM design completely dominates all past designs in BRAM and ALMconsumption, Fmax and runtime.A fully parameterized and Verilog implementation of the suggested methods isprovided as open source hardware [114].144Chapter 6Indirectly Indexed HierarchicalSearch BCAMs (II-HS-BCAMs)In this chapter, a novel, efficient and modular technique for constructing BCAMsout of standard SRAM blocks in FPGAs is proposed. Hierarchical search is em-ployed to achieve high storage efficiency. While pattern width cascading requiresproducing the match indicator of every single address in every cascaded stage, theprevious 2-Dimensional Hierarchical Search BCAM (2D-HS-BCAM) approach inChapter 5 provides a single matching address only. Hence, the 2-Dimensional Hi-erarchical Search BCAM (2D-HS-BCAM) approach cannot be cascaded in patternwidth; this incurs an exponential increase of RAM consumption as pattern widthincreases. The Indirectly Indexed Hierarchical Search BCAM (II-HS-BCAM) ap-proach, however, efficiently regenerates a match indicator for every single addressby storing indirect indices for address match indicators. Hence, the proposed145method can be cascaded in pattern width and exponential growth is alleviated intolinear. Our method exhibits high storage efficiency and is capable of implementingup to nine times wider BCAMs compared to the previous 2D-HS-BCAM approach.Compared to brute-force techniques, this method requires a maximum of 18% theRAM storage, while enhancing clock speed by 45% on average. A fully parame-terized Verilog implementation is being released as an open source library. Thelibrary has been extensively tested using Altera’s Quartus and ModelSim.6.1 IntroductionIn this chapter, a modular SRAM-based BCAM is proposed. Similar to 2-DimensionalHierarchical Search BCAM (2D-HS-BCAM), our approach arranges the mem-ory into two-dimensional data sets. Our approach, however, is superior to the2D-HS-BCAM approach since it efficiently regenerates match indicators for everysingle address by storing indirect indices for address match indicators. Thus, unlikehierarchical search, the proposed method can support wide patterns by patternwidth cascading; this transforms exponential RAM growth into linear.The proposed method is device-independent; hence, it can be applied to anyFPGA device containing standard dual-ported BRAMs. The proposed approachdramatically improves CAM area efficiency compared to conventional methods.In contrast to algorithmic approaches (e.g., hashes and tries) or other BCAMtechniques that require several nondeterministic cycles to write or match [111–113], our approach is high-throughput and can perform a pattern read (match) everycycle and a pattern write every two cycles.146Major contributions of this chapter are:• A novel highly efficient BCAM architecture. Compared to other BCAMapproaches, the proposed technique provides up to nine times wider BCAMs.To the authors’ best knowledge, research and patent literature do not havesimilar BCAM techniques.• A parameterized Verilog implementation of our method, together with otherapproaches. A flow manager to simulate and synthesize designs with variousparameters in batch using Altera’s ModelSim and Quartus II is also provided.The Verilog modules and the flow manager are available online [115].To verify correctness, the proposed BCAM architecture is fully implemented inVerilog, simulated using Altera’s ModelSim, and compiled using Quartus II [57].A large variety of BCAM architectures and parameters, e.g., BCAM depth andpattern width are simulated in batch, each with over one million random BCAMwrite and match cycles. Stratix V, Altera’s high-end FPGA, is used to implementand compare the proposed architecture with previous approaches.The rest of this chapter describes in detail the 2-Dimensional HierarchicalSearch BCAM (2D-HS-BCAM) approach and is organized as follows. The moti-vation and key idea for this work are explained in Section 6.2. The design methodis described in Section 6.3. Section 6.4 describes a device-specific instance forAltera’s Stratix device family. Discussion of the suggested method and comparisonto previous techniques are provided in Section 6.5. The experimental framework,simulation and synthesis results, are discussed in Section 6.6. Conclusions are147drawn in Section 6.7.6.2 Motivation and Key IdeaAs shown in Chapter 5, the previous 2-Dimensional Hierarchical Search BCAMs(2D-HS-BCAMs) cannot be cascaded in pattern width. Hence, the required mem-ory size grows exponentially with pattern width. This limitation is crucial sincethe vast majority of applications require wide patterns. To support BCAM patternwidth cascading, all match indicators for every single BCAM address shall begenerated, as in the brute-force approach (Section 2.3.2). However, storing allmatch indicators requires wide RAM and incurs high memory overhead.Our proposed Indirectly Indexed Hierarchical Search BCAM (II-HS-BCAM)is based on 2-Dimensional Hierarchical Search BCAMs (2D-HS-BCAMs) andutilizes the sparsity of the Set Transposed Indicators RAM (STIRAM) to storeonly the required match indicators. It is able to regenerate all match indicators forevery BCAM address, allowing it to be cascaded in pattern width. The followingtheorem is the basic keystone of our technique.Lemma 6.2.1 (TIRAM column match indicators bound). The number of binary‘1’ (matches) in each RAM column (CAM location) of the TIRAM matrix fromEquation 2.10, is exactly 1, namely∀a ∈ A : ∑p∈PIp,a = 1. (6.1)Proof. Similar to RAM, each address of a BCAM contains one and only one valid148pattern. A pattern p′ located at address a′ means Ip′,a′ = 1 and Ip,a′ = 0 for everyp 6= p′. Hence, the corresponding column of address a′ in the TIRAM matrix has abinary ‘1’ only in Ip′,a′ and zeros for all the other patterns.Theorem 6.2.2 (STIRAM column match indicators bound). The number of binary‘1’ (matches) in each column (set) of the STIRAM matrix from Equation 5.3 islimited to the set width SW , namely,∀s ∈ S : ∑p∈PISWp,s ≤ SW . (6.2)Proof. A match indicator of a set is defined in Equation 5.2 and Equation 5.3and indicates if any of the addresses in the set has a match. Given a set s′ ofSW addresses, Equation 5.2 and Equation 5.3 provides ISWp,s′ =∨a=SW ·(s′+1)−1a=SW ·s′ Ip,a.Lemma 6.2.1 ensures that each address a ∈{SW · s′, · · · ,SW ·(s′+1)−1} has oneand only one p such that Ip,a = 1. Since the set s′ has SW addresses, exactly SWdifferent Ip,a = 1 exist. Hence, in the entire set, a maximum of SW patterns havea match (less than SW in the case where two addresses or more have the samepattern).The significance of Theorem 6.2.2 lies in the measurement it provides for theSTIRAM matrix sparsity. Namely, it provides an upper bound of the number ofbinary ‘1’ (matches) for a set (a column in STIRAM). Instead of storing matchindicators for every address and pattern pair as in the brute-force approach (Fig-ure 6.1 (left)), or set indicators as in the hierarchical search approach (Figure 6.1(middle)), we store all address indicators only for sets with a match, as they are149PatternsA d d r e s s e s S e t s S e t sPatternsPatternsPatternsPatternsPatternsBrute-force TIRAMI n d i c a t o r sBF-TIFigure 6.1: Indicators arrangement for three different approaches.limited to SW . To reduce memory consumption, address match indicators are savedin another auxiliary structure, while the original STIRAM will hold indices to theauxiliary structure (Figure 6.1 (right)).6.3 Design and FunctionalityAs depicted in Figure 6.2, a single-stage of the proposed II-HS-BCAM consistsof three parts. First, the match RAM where the set indices and match indicatorsare stored; this is the most memory consuming structure. Second, the status RAMwhere the system status is stored and feeds the control logic with system status.Finally, the control and steering logic, which generates control signals to controlthe entire structure (based on system status), feeds the match RAM with indicatorsand indices, and updates the status RAM.1. Match RAM: As described in the previous subsection, instead of storing150set match indicators in the STIRAM, we store only indirect indices for anauxiliary RAM, which holds the match indictors for all the addresses in theset, hence it is called the indicators RAM (IndRAM). Theorem 6.2.2 showsthat a maximum of SW different patterns can have a match in a set; hence,the depth of IndRAM is SW at most. Each set has SW addresses, thereforeIndRAM should have SW address indicators for each set and its width shouldalso be SW . The address space consists of⌈CDSW⌉sets; hence,⌈CDSW⌉IndRAMSW ×SW blocks are required.The STIRAM holds indices for all pattern and set pairs. To represent allpatterns the required depth is 2PW . For each of the⌈CDSW⌉sets,⌈log2 SW⌉bitsare required for each index. In total, the STIRAM width is⌈CDSW⌉·⌈log2 SW⌉bits.When a match operation is performed, all indices associated with the matchpattern are read from the STIRAM (a single STIRAM row). Next, these setindices will address the IndRAM to fetch the complete address indicators forevery CAM location. Unlike STIRAM, every set in the IndRAM is stored ina separate RAM structure. Hence, rows with different indices can be fetchedfrom each set, depending on the corresponding index from the STIRAM.Figure 6.2 (bottom) shows an example where pattern ‘4’ is matched. TheSTIRAM reveals one index in set1 pointing to address ‘2’ of the IndRAM(shaded field in the STIRAM). Reading address ‘2’ from the IndRAM in set1(shaded field in the IndRAM) shows a single match in offset address ‘2’ of151set1 (circled indicator in the IndRAM).2. Status RAM: The status RAM is updated at each write to reflect the systemstatus and consists of three RAM structures as follows.(a) Sets RAM (SetRAM): Similar to the hierarchical search method, thisRAM holds all CAM patterns, with each set’s patterns packed in onerow that can be fetched in a single cycle. Its size is therefore⌈CDSW⌉×(PW ·SW ). At each write, the SetRAM will be updated with the newpattern.(b) Indices RAM (IdxRAM): The Indices RAM stores the index of eachpattern in the BCAM, arranged similar to the SetRAM, namely eachset’s indices in one row. Its size is⌈CDSW⌉×(⌈log2 SW⌉ ·SW)At each write, the IdxRAM will be updated with the index that havebeen assigned to the new pattern. For instance, Figure 6.2 (bottom)shows an example where pattern ‘4’ is located in the SetRAM in set1,the corresponding field in IdxRAM shows index ‘2’; the correspondingindex in STIRAM.The IdxRAM is mainly used when two identical patterns are located inthe same set. In this case, the same index should be assigned these twoidentical patterns. Hence, the index of the old pattern will be read fromthe IdxRAM and assign back to the new (and identical) pattern.(c) Vacancy RAM (VacRAM): The Vacancy RAM indicates for each rowof the IndRAM whether it is vacant or holds valid indicators. The152IndRAM consists of⌈CDSW⌉RAM blocks (for each set), each with SWrows. The Vacancy RAM hold the status of each IndRAM block inone row, hence its depth is⌈CDSW⌉and marks the validity of all the SWIndRAM rows, hence its width is SW .At each write, if a new index needs to be assigned to the new pattern,the VacRAM will be used to detect the first available row of the cor-responding set in IndRAM. The VacRAM will be updated to indicatethat this IndRAM row has been taken and is not vacant anymore.For instance, the example in Figure 6.2 (bottom) shows that the corre-sponding field in VacRAM of the recently written pattern ‘4’ hold abinary ‘1’ (shaded field in VacRAM) to indicate that the correspondingrow in IndRAM has been taken.3. Control and steering logic: Based on current system status and the requiredwrite pattern and write address, it generates write indicators to IndRAM andindices to IndRAM and STIRAM. Furthermore, it updates the IdxRAM andVacRAM status.153Table 6.1: BRAM usage of a singlea stage II2D-BCAM.Structure Blocks# Depth WidthSTIRAM 1 2PW⌈CDSW⌉·⌈log2 SW⌉IndRAM⌈CDSW⌉SW SWSetRAM 1⌈CDSW⌉PW ·SWIdxRAM 1⌈CDSW⌉ ⌈log2 SW⌉ ·SWVacRAM 1⌈CDSW⌉SWaFor pattern width cascaded II2D-BCAM, cascading method from Section 2.3.3. is employed;PW = PW,opt =⌊log2 RD,min⌋is used for each stage, while nc =⌈Pw,totalPw,opt⌉stages are required.154WIndcWIndxMIndcMPattWPattWAddrRAddrWAddrWDataRDataDual-port RAMConnectivityMatch RAMStatus RAMS T I R A MIndRAMIndRAMIndRAMSet RAMVac RAMIdx RAMControl and steering logic44212FEDCBA9876543210PatternAddressesset 0set 1set 2set 33210SetsIntra-Set Locat ions3210SetsIntra-Set Locat ions3210SetsSetRAM(contains patterns)IdxRAMVacRAM(vacancy status)3210IndicesIndRAM(match indicators) C A M   L o c a t i o n s  ( a d d r e s s e s )76543210PatternsS e t sSTIRAM(indices to IndRAM rows)302210775550000317000502502750211000001001300111111101000100--------21--03--------101--0----------------0--------01000001001000100010001000100010010000100010001001000010001000RAM Content (reference)Set0 Set1 Set2 Set3Set0 Set1 Set2 Set30 1 2 30 1 2 30 1 2 310Figure 6.2: II2D-BCAM single-stage (top) high-level architecture (bottom) 8×3; SW = 4 example; pattern ‘4’is highlighted with all related RAM content.1556.4 Feasibility on Altera’s Stratix DevicesAn area efficient implementation of the proposed II2D-BCAMs requires hetero-geneous BRAMs on the FPGA. The relatively small BRAMs will be used toconstruct IndRAM and store match indicators. A perfect candidate is the LUTconfiguration RAM, known as LUT RAM, and is supported by both Altera andXilinx devices. On the other hand, the relatively large BRAMs will be used toconstruct other structures; M20K BRAMs will be used for this purpose.Altera’s Stratix V memory architecture provides a 640-bit LUT RAM memorycalled MLAB (Memory Logic Array Block) as well as 20Kb BRAMs called M20K.Both MLABs and M20Ks are used in their shallowest//widest configurationmodes. Hence, the MLABs are 32× 20, and the M20Ks are 512× 40. Sincethe depth of MLABs is 32, and they are used to implement the IndRAM, we setSW = 32. With an index width of dlog2 SW e= 5, a single M20K line can store upto 8 indices. To write a single 5-bit index in the STIRAM, M20K’s mixed-widthport is used to write a single 5-bit field [55].6.5 Comparison and DiscussionTable 6.2 shows storage efficiency estimation for different BCAM architecturesand is derived from Equation 2.11, Equation 2.13, Equation 5.4 and Table 6.1. Thestorage efficiency of the uncascaded (in pattern width) Brute-Force TransposedIndicators (BF-TI) implementation is inversely proportional to the exponent of PW ,hence, decays rapidly with PW increase, as shown in Figure 5.5.The efficiency of BF-TI when cascading the pattern width is independent156Table 6.2: Storage efficiency µs (inversed).Uncascaded Pattern Width CascadedBrute-forceTIRAM (BF-TI) µs(BFT I)−1 ≈ 1+ 2PWPW µs(BFT I)−1 ≈ 1+RD,minlog2 RD,minHierarchicalSearch (HS) µs(2DHS)−1 ≈ 1+ 2PWPW ·SW µs(IIHS)−1 ≈ 1+1+SW+log2 SW ·(1+RD,minSW)log2 RD,minof the value of PW . However, the efficiency is affected by an intrinsic BRAMcharacteristic, the minimal depth RD,min (associated with the maximum width). Asshown in Figure 6.3 (top), the efficiency is inversely proportional to RD,min, hence,shallow and wide BRAMs with smaller RD,min (and larger RW,max) will exhibithigher storage efficiency.Similar to the uncascaded BF-TI, the efficiency of the 2-Dimensional Hierar-chical Search BCAM (2D-HS-BCAM) is inversely proportional to the exponent ofPW . However, the exponential relation is mitigated by the set width SW , an externaland user-driven parameter. On the other hand, the efficiency of our II-HS-BCAMapproach is not related to the pattern width PW , but is augmented by SW .The 2D-HS-BCAM is more efficient than the II-HS-BCAM for narrow patterns,namely, for patterns narrower than a specific pattern width threshold PW,th asfollowsµs(2DHS)≥ µs(IIHS)⇐⇒ PW ≤ PW,th. (6.3)To solve the left side inequality, storage efficiency of the 2D-HS-BCAM and157the II-HS-BCAM in Table 6.2 are used, hence,µs(2DHS) ≥ µs(IIHS)⇐⇒ µs(2DHS)−1 ≤ µs(IIHS)−1(substitute storage efficiencies from Table 6.2, where 2D-HS-BCAM has aset width of SW,2D and II-HS-BCAM has a set width of SW,II)⇐⇒ 1+ 2PWPW ·SW,2D ≤ 1+1+SW,II+log2 SW,II ·(1+RD,minSW,II)log2 RD,min(subtract 1)⇐⇒ 2PWPW ·SW,2D ≤1+SW,II+log2 SW,II ·(1+RD,minSW,II)log2 RD,min(multiply by PW ·SW,2D)⇐⇒ 2PW ≤ PW ·SW,2D ·1+SW,II+log2 SW,II ·(1+RD,minSW,II)log2 RD,min(isolate constants)⇐⇒ 2PW ≤ a ·PW , a = SW,2D ·1+SW,II+log2 SW,II ·(1+RD,minSW,II)log2 RD,min(exponential inequality of the form bx ≤ ax can be solved with theLambert Product Logarithm function (Omega function), where W−1denotes the lower branch of the Lambert Product Logarithm function)⇐⇒ PW ≤−W−1(− ln2a)ln2 , a = SW,2D ·1+SW,II+log2 SW,II ·(1+RD,minSW,II)log2 RD,min.(6.4)From Equation 6.4 and the right side of Equation 6.3, the 2D-HS-BCAM ismore efficient than the II-HS-BCAM for pattern narrower than the thresholdPW,th =−W−1(− ln2a)ln2, a = SW,2D ·1+SW,II + log2 SW,II ·(1+ RD,minSW,II)log2(RD,min) . (6.5)For the 2D-HS-BCAM approach, the set width SW has a limitation due to theminimal width of the SetRAM it is stored in. The SetRAM depth is dCDSW e and is158limited by RD,min. Hence, to achieve maximum efficiency, SW is bounded bySW ≤ CDRD,min . (6.6)On the other hand, the set width SW for our II-HS-BCAM is bounded byinternal RAM parameters. Specifically, the depth of the MLAB (LUTRAM) andthe BRAM allowable write port width in mixed-width mode. As described inSection 6.4, Altera’s Stratix V MLAB depth (for the widest configuration) is 32,and the mixed-width mode of the M20K supports a native width oflog2 (32) = 5for write data, so SW = 32 is a suitable set width. This restriction is in agreementwith Figure 6.3 (bottom) where the storage efficiency µs is given as function ofthe set width SW . Figure 6.3 (bottom) shows that the storage efficiency decreasesfor narrow sets since the STIRAM portion of the II-HS-BCAM is not efficientlycompressed using narrow sets. On the other hand, storage efficiency decreases forwide sets due to the increase of the IndRAM portion.Applying typical parameters of Altera’s Stratix V M20K BRAM and MLABLUTRAM, i.e., RD,min = 512, SW,II = 32, and SW,2D = 4k, into Equation 6.5 showsthat the previous 2D-HS-BCAM is more efficient than the II-HS-BCAM for upto PW = 20 bits only. For these parameter settings, the storage efficiency of theII-HS-BCAM is µs = 0.08. A sweep of different values is shown in Figure 6.3.15900.020.040.060.080.10.12128 256 512 1024 2048 4096II-HS(S   =128)II-HS(S   =64)II-HS(S   =32)II-HS(S   =16)II-HS(S   =8)BF-TIBRAM Shallowest Depth (RD,min)Storage Efficiency (µs) WWWWWAltera'sM20K00.020.040.060.080.10.128 16 32 64 128 256 512II-HS(R        =128)II-HS(R        =256)II-HS(R        =512)II-HS(R        =1024)II-HS(R        =2048)II-HS(R        =4096)Set Width (SW)Storage Efficiency (µs)D,minD,minD,minD,minD,minD,minFigure 6.3: Storage Efficiency µs as function of (top) BRAM shallowest depth (RD,min), and (bottom) set width(SW ) for 2D-HS-BCAM and the pattern width cascaded BF-TI.1606.6 Experimental ResultsTo verify and simulate the suggested II-HS-BCAM approach and compare tostandard and previous techniques, fully parameterized Verilog modules have beendeveloped. Register-based, pattern width cascaded and uncascaded Brute-ForceTransposed Indicators (BF-TI), Hierarchical search BCAM, and the proposedII-HS-BCAM methods have been implemented. A run-in-batch flow managerhas also been developed to simulate and synthesize these designs with variousparameters in batch using Altera’s ModelSim and Quartus II. The Verilog modulesand the flow manager are available online [115].To verify correctness, the proposed architecture is simulated using Altera’sModelSim. A large variety of different BCAM architectures and parameters, e.g.,BCAM depth and pattern width, are swept and simulated in batch, each with overa million random cycles. All different BCAM design modules were implementedusing Altera’s Quartus II on Altera’s Stratix V 5SGXMABN1F45C2 device. Thisis a speed grade 2 device with 360k ALMs and 2640 M20Ks. Half of the ALM’scan be used to construct MLABs, while a single MLAB consists of 10 ALMs.Figure 6.4 plots feasible BCAM depth and pattern width sweeps (both PWand CD are the x-axis). Within the device limits, the proposed II-HS-BCAMapproach is able to reach a 153-bit pattern width, while other approaches cannotexceed 45-bits for 16K entries. The number of Altera’s M20K blocks used toimplement each BCAM configuration is plotted in Figure 6.4 (bottom). The BF-TIand our II-HS-BCAM exhibit a linear growth of M20K consumption as pattern161width increases since both methods are pattern width cascaded. However, theII-HS-BCAM growth rate is lower since addresses are grouped as sets. On theother hand, the 2D-HS-BCAM suffers from exponential growth, hence it cannotexceed a very narrow pattern width of 20 bits.As shown in Figure 6.4 (middle), the proposed II-HS-BCAM method and theBF-TI also exhibit linear ALM count growth as pattern width increases. Thepriority-encoder in the 2D-HS-BCAM approach is split in two, hence the ALMcount is lower. Furthermore, our II-HS-BCAM approach uses ALMs as MLABs,hence it has higher ALM consumption. The register-based BCAM consumes themost ALMs due to massive register usage.Figure 6.4 (top) plots the Fmax of all BCAM architectures. The 2D-HS-BCAMexhibits the highest Fmax for very narrow patterns, where it is feasible. However,Fmax drops dramatically as the pattern width increases due to a massive increase ofM20K usage. On the other hand, II-HS-BCAM performs better than BF-TI andregister-based BCAM for shallow memory. As can be seen, Fmax decreases mildlyas pattern width increases due to pattern width cascading.Similar to BF-TI and 2D-HS-BCAM, II-HS-BCAM is capable of matching apattern every cycle and writing a pattern every two. Pipelining is employed toincrease Fmax and this adversely increases latency. The longest combinational pathgoes through the output priority-encoder, and is pipelined every stage. For thedevice-specific PE described in Appendix B,⌈log4⌈CDSW⌉⌉pipe stages are required.2D-HS-BCAM benefits from higher SW , hence it can exhibit lower latency. On theother hand, sets are not used in BF-TI, hence its match latency is⌈log4CD⌉.16200.511.522.53918273645546372819099108117126135144153 918273645546372 9182736 916K 32k 64k 128kM20Ks (1000's)050100150200250300350ALMs (1000's)0100200300400500Fmax (MHz)Reg-basedBF-BCAMHS-BCAMII2D-BCAMPWCDFigure 6.4: Results for several BCAM depth and pattern width sweeps (bottom) M20K count (middle) ALMscount (top) Fmax at T=0◦C.1636.7 ConclusionsIn this chapter, a novel BCAM architecture for FPGAs is proposed. The approach isfully BRAM-based and employs hierarchical search to reduce BRAM consumption.While traditional brute-force approaches have a pattern match indicator for everysingle address, the proposed approach groups addresses into sets and maintainsa single pattern match indicator for every set. The hierarchical search BCAMsfrom Chapter 5 cannot be cascaded in pattern width since they provide a singlematching address; this incurs an exponential increase of RAM consumption aspattern width increases. On the other hand, the approach in this chapter efficientlyregenerates a match indicator for every single address by storing indirect indices foraddress match indicators. Hence, the proposed method can be cascaded in patternwidth and the exponential growth is alleviated into linear. The storage efficiencyof our approach is five times the storage efficiency of the brute-force approach.Furthermore, our technique supports up to nine times wider patterns compared tothe 2D-HS-BCAM approach in Chapter 5. Compared to brute-force techniques,this method requires a maximum of 18% the RAM storage, while enhancing clockspeed by 45% on average.A fully parameterized Verilog implementation of the suggested methods isprovided as open source hardware [115].164Chapter 7ConclusionsIn this chapter, we summarize the results of this dissertation, which show that theproposed I-LVT dominates and should replace existing LVT and XOR methodsfor MPRAM. Also, key results and contributions for CAMs are discussed. Weconclude with suggestions for future work.7.1 Dissertation SummaryThis dissertation provides an attempt to resolve the memory bottleneck of massivelyparallel reconfigurable systems by providing efficient, parallel and customizableembedded memory structures. Although concurrent multi-ported memories areimportant, their high implementation cost means they are used sparingly. Asa result, FPGA vendors only provide standard dual-ported memories to handlethe majority of usage patterns. This dissertation describes a novel, efficient andmodular approach to construct multi-ported memories out of basic dual-ported165Block-RAMs.Another massively parallel memory structure that is investigated in this disser-tation is the content-addressable memory (CAM), a hardware implementation ofassociative arrays. Despite their importance, FPGAs lack an area-efficient CAM im-plementation. This dissertation proposes new methods to construct SRAM-based,efficient and modular binary CAMs (BCAMs).In Chapter 3, an invalidation-based live-value-table, or I-LVT, is used to buildmodular SRAM-based multi-ported memories. The invalidation-based live-value-table (I-LVT) determines the latest written data bank. The I-LVT generalizes andreplaces two prior techniques, the LVT and XOR-based approaches. A generalI-LVT is described, along with two specific implementations: binary-coded andthermometer-coded. Both methods are purely SRAM based, so they scale wellwith memory depth. The original LVT approach can use an infeasible numberof registers. In contrast, the I-LVT register usage is not directly proportional tomemory depth; hence it requires orders of magnitude fewer registers. Furthermore,the proposed I-LVT method can reduce BRAM consumption up to 44% andimprove Fmax by up to 76% compared to the previous XOR-based approach. Thethermometer-coded I-LVT method exhibits the highest Fmax, while keeping BRAMconsumption within 6% of the minimal required BRAM count. Meanwhile, thebinary-coded I-LVT uses fewer BRAMs than the thermometer-coded when thereare more than 3 write ports. Based on our results, past approaches of XOR andLVT are only recommended for narrow data widths or shallow depths, respectively.In all other cases, the new I-LVT approaches are superior.166In Chapter 4, we have proposed a novel, modular, BRAM-based and switched-multi-ported RAM architecture. In addition to unidirectional ports with fixedread/write, this switched architecture allows a group of write ports to switch withanother group of read ports dynamically. The proposed switched-ports architectureis less flexible than a true-multi-ported RAM where each port is switched individu-ally. Nevertheless, switched memories can reduce BRAM consumption comparedto true or simple ports for systems with alternating port requirements.When multiple switched ports exist in a multi-ported RAM design, a memorycompiler can be used to create a specific design instance. The compiler must solvea set cover problem to optimize the implementation. Our CAD approach alwaysfinds a minimal implementation for all of our test cases, but there is opportunity forfurther CAD research to improve run-time while still being optimal. On averageout of 10 random test-cases, the suggested multi-switched-ports method reducesBRAM use by 18% compared to the best of previous methods, while maintainingALM count and Fmax. Future research may address the RAM port assignmentproblem to more complex cases where there are more than two states governingmemory port usage.In Chapter 5, a novel BCAM architecture for FPGAs is proposed. The approachis fully BRAM-based and employs hierarchical search to reduce BRAM consump-tion. While the traditional brute-force approach has a pattern match indicatorfor each single address, the proposed approach maintains a single pattern matchindicator for a set of addresses. The suggested method is capable of implementinga 4M-line CAM and significantly improves area and performance. In contrast,167traditional methods cannot exceed 64K in depth. The suggested 2-DimensionalHierarchical Search BCAM (2D-HS-BCAM) design completely dominates all pastdesigns in BRAM and ALM consumption, Fmax and runtime. However, this hi-erarchical search BCAM cannot be cascaded in pattern width since it provides asingle matching address; this incurs an exponential increase of RAM consumptionas pattern width increases. Hence, it only supports narrow pattern widths.In Chapter 6, the narrow pattern width restriction is addressed. While traditionalbrute-force approaches have a pattern match indicator for every single address, thehierarchical approach groups addresses into sets and maintains a single patternmatch indicator for every set. The new approach here efficiently regenerates amatch indicator for every single address by storing indirect indices for addressmatch indicators. Hence, the proposed method can be cascaded in pattern widthand the exponential growth is alleviated into linear. Our technique supports up tofour times wider patterns compared to the brute-force BCAM and up to nine timeswider patterns compared to the hierarchical search BCAM.7.2 Future DirectionsFuture directions of the parallel memory structures described in this dissertationare explored in this section.7.2.1 Invalidation-Table Multi-Ported MemoriesThe suggested multi-ported memories can be tested with various other FPGAvendors’ tools and devices. Furthermore, these methods can also be tested for ASIC168implementation using dual-ported RAMs as building blocks, and compared againstmemory compiler results. Also, to improve Fmax, time-borrowing techniques canbe utilized. The goal would be to recover the frequency drop due to the multi-portedRAM additional logic, feedback and bank selection logic. One possible approachuses shifted clocks to provide more reading and writing time [116]. However,adapting this method to multi-ported memories is not trivial due to internal timingpaths across the I-LVT.The true-multi-ported RAM proposed by others [56, 73] can utilize our I-LVTmethod to implement BRAM-based LVT instead of register-based LVT, whicheliminates the need of register-based memories and allows higher RAM capacities.There is opportunity for further CAD research to improve run-time of themulti-switched-ports RAM compiler while still being optimal. Future researchmay address the RAM port assignment problem to more complex cases wherethere are more than two states governing memory port usage.7.2.2 BRAM-Based Content-Addressable MemoriesThe suggested BCAMs can be tested with other FPGA vendors’ tools and devices.Furthermore, these methods can be tested for ASIC implementation using dual-ported RAMs as building blocks, and compared against custom-designed BCAMs.Increased pipelining and time-borrowing techniques can be used to improveFmax. The goal would be to recover the frequency drop due to the comparatorsand priority-encoder. One possible approach uses shifted clocks to provide morereading and writing time [116]. However, adapting this method to BCAMs is not169trivial due to internal timing paths across the BCAM.To fit some applications that require single-cycle writing, e.g., caches and TLBs,the proposed technique can be enhanced to support single cycle write by usingmulti-write RAM [1].7.2.3 Parallel Reconfigurable Computing with CustomizableConcurrent MemoriesSince FPGAs must fix their RAM block designs for generic designs, it is too costlyto provide highly specialized RAMs with a large number of ports. Due to thisrestriction, state-of-the-art parallel reconfigurable systems support reconfigurabledistributed dual-ported memory blocks only. As future research, reconfigurable sys-tems that can take advantage of the multi-ported memories and content-addressablememories proposed in this dissertation can be investigated. Future research couldalso perform a design space exploration for new parallel reconfigurable architec-tures with customizable concurrent memories.170Bibliography[1] A. M. S. Abdelhadi and G. G. F. Lemieux. Modular Multi-PortedSRAM-Based Memories. In ACM/SIGDA International Symposium onField-programmable Gate Arrays (FPGA), pages 35–44, February 2014. →pages iv, v, 10, 81, 109, 111, 115, 117, 118, 121, 122, 170[2] Ameer M.S. Abdelhadi and Guy G.F. Lemieux. Modular SwitchedMulti-Ported SRAM-Based Memories. ACM Transactions onReconfigurable Technology and Systems (TRETS), 9(3):22:1–22:26, July2016. → pages iv, vi, 11, 96, 115, 116, 117, 118, 121, 122[3] A. M. S. Abdelhadi and G. G. F. Lemieux. A Multi-Ported MemoryCompiler Utilizing True Dual-Port BRAMs. In IEEE InternationalSymposium on Field-Programmable Custom Computing Machines (FCCM),pages 200–207, May 2016. → pages v, vi[4] A. M. S. Abdelhadi and G. G. F. Lemieux. Deep and Narrow BinaryContent-Addressable Memories Using FPGA-Based BRAMs. InInternational Conference on Field-Programmable Technology (FPT), pages318–321, December 2014. → pages v, vi, 12[5] A. M. S. Abdelhadi and G. G. F. Lemieux. Modular SRAM-Based BinaryContent-Addressable Memories. In IEEE International Symposium onField-Programmable Custom Computing Machines (FCCM), pages207–214, May 2015. → pages iv, v, vi, 13[6] J. H. Tseng and K. Asanovic´. Banked Multiported Register Files forHigh-Frequency Superscalar Microprocessors. In ACM/IEEE InternationalSymposium on Computer Architecture (ISCA), pages 62–71, June 2003. →pages 3, 22171[7] J. A. Fisher. Very Long Instruction Word Architectures and the ELI-512. InACM/IEEE International Symposium on Computer Architecture (ISCA),pages 140–150, June 1983. → pages 3[8] E. S. Fetzer and J. T. Orton. A Fully-Bypassed 6-Issue Integer Datapath andRegister File on an Itanium Microprocessor. In IEEE InternationalSolid-State Circuits Conference (ISSCC), pages 420–478, February 2002.→ pages 3[9] H. Bajwa and X. Chen. Low-Power High-Performance and DynamicallyConfigured Multi-Port Cache Memory Architecture. In InternationalConference on Electrical Engineering (ICEE), pages 1–6, April 2007. →pages 3[10] Z. Kwok and S. J. E. Wilton. Register File Architecture Optimization in aCoarse-Grained Reconfigurable Architecture. In IEEE InternationalSymposium on Field-Programmable Custom Computing Machines (FCCM),pages 35–44, April 2005. → pages 3[11] C. E. LaForest and J. G. Steffan. Efficient Multi-Ported Memories forFPGAs. In ACM/SIGDA International Symposium on Field-programmableGate Arrays (FPGA), pages 41–50, February 2010. → pages 4, 10, 23, 26,83, 117[12] C. E. Laforest, M. G. Liu, E. R. Rapati, and J. G. Steffan. Multi-PortedMemories for FPGAs via XOR. In ACM/SIGDA International Symposiumon Field-programmable Gate Arrays (FPGA), pages 209–218, February2012. → pages 4, 10, 23, 26, 83[13] Sateh M. Jalaleddine. Associative Memories and Processors: The ExactMatch Paradigm. Journal of King Saud University Computer andInformation Sciences, 11:45–67, January 1999. → pages 5[14] M. Peng and S. Azgomi. Content-Addressable Memory (CAM) and itsNetwork Applications. In International IC–Taipei Conference, pages 1–3,May 2001. → pages 5[15] Qutaiba Ali. A Flexible Design of Network Devices Using ReconfigurableContent Addressable Memory. International Arab Journal of InformationTechnology (IAJIT), 8(3):235–243, July 2011. → pages172[16] H. Chen, Y. Chen, and D. H. Summerville. A Survey on the Application ofFPGAs for Network Infrastructure Security. IEEE CommunicationsSurveys Tutorials, 13(4):541–561, April 2011. → pages[17] K. McLaughlin, N. O’Connor, and S. Sezer. Exploring CAM Design ForNetwork Processing Using FPGA Technology. In Advanced InternationalConference on Telecommunications and International Conference onInternet and Web Applications and Services (AICT-ICIW), pages 84–84,February 2006. → pages 5[18] I. Y.-L. Hsiao and C.-W. Jen. A New Hardware Design and FPGAImplementation for Internet Routing Towards IP over WDM and TerabitRouters. In IEEE International Symposium on Circuits and Systems(ISCAS), pages 387–390 vol.1, May 2000. → pages 5, 40[19] W. Jiang, Q. Wang, and V. K. Prasanna. Beyond TCAMs: AnSRAM-Based Parallel Multi-Pipeline Architecture for Terabit IP Lookup.In IEEE Conference on Computer Communications (INFOCOM), April2008. → pages[20] H. Le, W. Jiang, and V. K. Prasanna. A SRAM-Based Architecture forTrie-based IP Lookup Using FPGA. In IEEE International Symposium onField-Programmable Custom Computing Machines (FCCM), pages 33–42,April 2008. → pages[21] H. Le, W. Jiang, and V. K. Prasanna. Scalable High-ThroughputSRAM-Based Architecture for IP-Lookup Using FPGA. In InternationalConference on Field-Programmable Logic and Applications (FPL), pages137–142, September 2008. → pages[22] H. Le and V. K. Prasanna. Scalable High Throughput and Power EfficientIP-Lookup on FPGA. In IEEE International Symposium onField-Programmable Custom Computing Machines (FCCM), pages167–174, April 2009. → pages 40[23] D. Unnikrishnan, R. Vadlamani, Y. Liao, A. Dwaraki, J. Crenne, L. Gao,and R. Tessier. Scalable Network Virtualization Using FPGAs. InACM/SIGDA International Symposium on Field-programmable Gate Arrays(FPGA), pages 219–228, February 2010. → pages173[24] N. Mudaliar. Design and Implementation of MobilityFirst Router on theNetFPGA Platform. Master’s thesis, Dept. of Electrical and ComputerEngineering, Rutgers, The State University of New Jersey, New Brunswick,NJ, May 2013. → pages 5[25] L. Bu and J. A. Chandy. FPGA Based Network Intrusion Detection UsingContent Addressable Memories. In IEEE International Symposium onField-Programmable Custom Computing Machines (FCCM), pages316–317, April 2004. → pages 5[26] B.-K. Kim, Y.-J. Heo, and J.-T. Oh. High-Performance Intrusion Detectionin FPGA-Based Reconfiguring Hardware. In Asia-Pacific NetworkOperations and Management Symposium (APNOMS), pages 563–573,September 2005. → pages[27] A. Kaleel Rahuman and G. Athisha. Reconfigurable Hardware Architecturefor Network Intrusion Detection System. American Journal of AppliedSciences, 9(10):1618–1624, October 2012. → pages[28] H. Song and J. W. Lockwood. Efficient Packet Classification for NetworkIntrusion Detection Using FPGA. In ACM/SIGDA International Symposiumon Field-programmable Gate Arrays (FPGA), pages 238–245, February2005. → pages[29] F. Yu. High Speed Deep Packet Inspection with Hardware Support. Phdthesis, Dept. of Electrical Engineering and Computer Sciences, Universityof California, Berkeley, Berkeley, CA, November 2006. → pages 5[30] V. Pusˇ and J. Korenek. Fast and Scalable Packet Classification UsingPerfect Hash Functions. In ACM/SIGDA International Symposium onField-programmable Gate Arrays (FPGA), pages 229–236, February 2009.→ pages 5, 40[31] Y. Qi, J. Fong, W. Jiang, B. Xu, J. Li, and V. Prasanna. Multi-DimensionalPacket classification on FPGA: 100 Gbps and Beyond. In InternationalConference on Field-Programmable Technology (FPT), pages 241–248,December 2010. → pages[32] R. Wei, Y. Xu, and H. Chao. Block Permutations in Boolean Space toMinimize TCAM for Packet Classification. In IEEE Conference on174Computer Communications (INFOCOM), pages 2561–2565, March 2012.→ pages 5[33] U. Dhawan and A. DeHon. Area-Efficient Near-Associative Memories onFPGAs. In ACM/SIGDA International Symposium on Field-programmableGate Arrays (FPGA), pages 191–200, February 2013. → pages 5, 40[34] H. Wong, V. Betz, and J. Rose. Comparing FPGA vs. Custom CMOS andthe Impact on Processor Microarchitecture. In ACM/SIGDA InternationalSymposium on Field-programmable Gate Arrays (FPGA), pages 5–14,February 2011. → pages 5[35] H. Wong, V. Betz, and J. Rose. Quantifying the Gap Between FPGA andCustom CMOS to Aid Microarchitectural Design. IEEE Transactions onVery Large Scale Integration (VLSI) Systems, 22(10):2067–2080, October2014. → pages 5[36] H. Wong, V. Betz, and J. Rose. Efficient Methods for Out-of-OrderLoad/Store Execution For High-Performance Soft Processors. InInternational Conference on Field-Programmable Technology (FPT), pages442–445, December 2013. → pages 5[37] Pedro Alcocer and Colin Phillips. Using Relational Syntactic Constraints inContent-Addressable Memory Architectures for Sentence Parsing. SpecialIssue of Topics in Cognitive Science on Computational Psycholinguistics,April 2012. → pages 5[38] K. Suman N. Manonmani and C.Udhayakumar. Dynamic BasedReconfigurable Content Addressable Memory for Fast String Matching.International Journal of Engineering Science and Innovative Technology(IJESIT), 3(1):429–435, January 2014. → pages[39] G. Nilsen. A Variable Word-Width Content Addressable Memory (CAM)for Fast String Matching. Master’s thesis, Dept. of Informatics, Universityof Oslo, Oslo, Norway, May 2013. → pages 5[40] R.-Y. Yang and C.-Y. Lee. High-Throughput Data Compressor DesignsUsing Content Addressable Memory. In IEEE International Symposium onCircuits and Systems (ISCAS), pages 147–150 vol.4, May 1994. → pages 5175[41] J. V. Oldfield, R. D. Williams, N. E. Wiseman, and M. R. Brule.Content-Addressable Memories for Quadtree-Based Images. InEurographics Conference on Advances in Computer Graphics Hardware(EGGH), pages 67–84, September 1988. → pages 5[42] Y. C. Shin, R. Sridhar, V. Demjanenko, P. W. Palumbo, and S. N. Srihari. ASpecial-Purpose Content Addressable Memory Chip for Real-Time ImageProcessing. IEEE Journal of Solid-State Circuits (JSSC), 27(5):737–744,May 1992. → pages 5[43] D. Agrawal and A. E. Abbadi. Hardware Acceleration for DatabaseSystems Using Content Addressable Memories. In International Workshopon Data Management on New Hardware (DaMoN), June 2005. → pages 5[44] A. Goel and P. Gupta. Small Subset Queries and Bloom Filters UsingTernary Associative Memories, with Applications. In ACM SIGMETRICSInternational Conference on Measurement and Modeling of ComputerSystems, pages 143–154, June 2010. → pages 5[45] S. A. Guccione and E. Keller. Gene Matching Using JBits. In InternationalConference on Field-Programmable Logic and Applications (FPL), pages1168–1171, September 2002. → pages 5[46] R. V. Satya, A. Mukherjee, and U. Ranga. A Pattern Matching Algorithmfor Codon Optimization and CpG Motif-Engineering in DNA ExpressionVectors. In IEEE Computer Society Conference on Bioinformatics (CSB),pages 294–305, August 2003. → pages 5[47] S. Ahmad and R. Mahapatra. TCAM Enabled On-chip Logic Minimization.In Design Automation Conference (DAC), pages 678–683, June 2005. →pages 5[48] Y. Tatsumi and H. J. Mattausch. Fast Quadratic Increase ofMultiport-Storage-Cell Area with Port Number. Electronics Letters, 35(25):2185–2187, December 1999. → pages 7[49] K. Pagiamtzis and A. Sheikholeslami. Content-Addressable Memory(CAM) Circuits and Architectures: a Tutorial and Survey. IEEE Journal ofSolid-State Circuits (JSSC), 41(3):712–727, March 2006. → pages 8, 29176[50] J. P. Wade and C. G. Sodini. A Ternary Content Addressable Search Engine.IEEE Journal of Solid-State Circuits (JSSC), 24(4):1003–1013, August1989. → pages[51] A. J. McAuley and C. J. Cotton. A Reconfigurable Content AddressableMemory. In IEEE Custom Integrated Circuits Conference (CICC), pages24.1/1–24.1/4, May 1990. → pages[52] K. J. Schultz and P. G. Gulak. Fully Parallel Integrated CAM/RAM UsingPreclassification to Enable Large Capacities. IEEE Journal of Solid-StateCircuits (JSSC), 31(5):689–699, May 1996. → pages[53] K. J. Schultz and P. G. Gulak. Fully-Parallel Multi-Megabit IntegratedCAM/RAM Design. In IEEE International Workshop on MemoryTechnology, Design and Testing (MTDT), pages 46–51, August 1994. →pages 8, 29[54] Altera Corp. APEX 20K Programmable Logic Device Family Data Sheet.San Jose, CA, USA, March 2004. Version 5.1. → pages 8, 29, 41[55] Altera Corp. Stratix V Device Handbook. San Jose, CA, USA, May 2013.→ pages 9, 14, 15, 30, 32, 38, 58, 115, 126, 140, 156, 188[56] J. Choi, K. Nam, A. Canis, J. Anderson, S. Brown, and T. Czajkowski.Impact of Cache Architecture and Interface on Performance and Area ofFPGA-Based Processor/Parallel-Accelerator Systems. In IEEEInternational Symposium on Field-Programmable Custom ComputingMachines (FCCM), pages 17–24, April 2012. → pages 11, 25, 83, 96, 115,117, 118, 121, 122, 169[57] Altera Corp. Quartus II Handbook. San Jose, CA, USA, November 2013.Version 13.1. → pages 14, 126, 147[58] A. M. S. Abdelhadi. GitHub Repository, 2014. URLhttps://github.com/AmeerAbdelhadi. Accessed September 2016. → pages14[59] G. A. Malazgirt, H. E. Yantir, A. Yurdakul, and S. Niar. ApplicationSpecific Multi-Port Memory Customization in FPGAs. In InternationalConference on Field-Programmable Logic and Applications (FPL), pages1–4, September 2014. → pages 19177[60] H. E. Yantir and A. Yurdakul. An Efficient Heterogeneous Register FileImplementation for FPGAs. In International Parallel and DistributedProcessing Symposium Workshops and PhD Forum (IPDPSW), pages293–298, May 2014. → pages[61] H. E. Yantir. A Systematic Approach for Register File Design in FPGAs.Master’s thesis, Dept. of Computer Engineering, Bog˘azic¸i University,Istanbul, Turkey, January 2014. → pages[62] H. E. Yantir, S. Bayar, and A. Yurdakul. Efficient Implementations ofMulti-pumped Multi-port Register Files in FPGAs. In EuromicroConference on Digital System Design (DSD), pages 185–192, September2013. → pages 19[63] A. Muddebihal. Area Efficient Multi-Ported Memories with Write ConflictResolution. Master’s thesis, Dept. of Electrical Engineering and ComputingSystems, University of Cincinnati, Cincinnati, OH, April 2014. → pages 19[64] B. A. Chappell, T. I. Chappell, M. K. Ebcioglu, and S. E. Schuster. VirtualMulti-Port RAM Employing Multiple Accesses During Single MachineCycle, July 1996. US Patent 5,542,067. → pages 19[65] H. Yokota. Multiport Memory System, May 1990. US Patent 4,930,066. →pages 19[66] G. S. Ditlow, R. K. Montoye, S. N. Storino, S. M. Dance, S. Ehrenreich,B. M. Fleischer, et al. A 4R2W Register File for a 2.3GHz Wire-SpeedPOWERTM Processor with Double-Pumped Write Operation. In IEEEInternational Solid-State Circuits Conference (ISSCC), pages 256–258,February 2011. → pages 19, 21[67] R. E. Kessler. The Alpha 21264 Microprocessor. IEEE Micro, 19(2):24–36,March 1999. → pages 21[68] K. R. Townsend, O. G. Attia, P. H. Jones, and J. Zambreno. A ScalableUnsegmented Multiport Memory for FPGA-Based Systems. InternationalJournal of Reconfigurable Computing, December 2015. → pages 22[69] H. J. Mattausch. Hierarchical N-Port Memory Architecture Based on 1-PortMemory Cells. In European Solid-State Circuits Conference (ESSCIRC),pages 348–351, September 1997. → pages178[70] W. Ji, F. Shi, B. Qiao, and H. Song. Multi-Port Memory DesignMethodology Based on Block Read and Write. In IEEE InternationalConference on Control and Automation (ICCA), pages 256–259, May 2007.→ pages[71] W. Zuo, Q. Zuo, and J. Li. An Intelligent Multi-Port Memory. InInternational Symposium on Intelligent Information Technology ApplicationWorkshops (IITAW), pages 251–254, December 2008. → pages 22[72] D. Alpert and D. Avnon. Architecture of the Pentium Microprocessor.IEEE Micro, 13(3):11–21, May 1993. → pages 22[73] C.E. LaForest, Z. Li, T. O’Rourke, M.G. Liu, and J.G. Steffan. ComposingMulti-Ported Memories on FPGAs. ACM Transactions on ReconfigurableTechnology and Systems (TRETS), 7(3):16:1–16:23, September 2014. →pages 23, 25, 26, 169[74] V. R. K. Naresh, D. J. Palframan, and M. H. Lipasti. CRAM: CodedRegisters for Amplified Multiporting. In IEEE/ACM InternationalSymposium on Microarchitecture (MICRO), pages 196–205, December2011. → pages 27[75] K. Locke. Parameterizable Content-Addressable Memory, 2011.Application Note XAPP1151. → pages 34, 36, 41[76] J.-L. Brelet. Using Block RAM for High Performance Read/Write CAMs,2000. Application Note XAPP204. → pages 41[77] J.-L. Brelet and L. Gopalakrishnan. Using Virtex-II Block RAM for HighPerformance Read/Write CAMs, 2002. Application Note XAPP260. →pages[78] J. L. Brelet. Methods for Implementing CAM Functions Using Dual-PortRAM, March 2002. US Patent 6,353,332. → pages 34, 36, 41[79] Altera Corp. Implementing High-Speed Search Applications with AlteraCAM, July 2001. Application Note 119, Version 2.1. → pages 34, 36[80] C. A. Zerbini and J. M. Finochietto. Performance Evaluation of PacketClassification on FPGA-Based TCAM Emulation Architectures. In IEEEGlobal Communications Conference (GLOBECOM), pages 2766–2771,December 2012. → pages 34179[81] W. Jiang. Scalable Ternary Content Addressable Memory ImplementationUsing FPGAs. In ACM/IEEE Symposium on Architectures for Networkingand Communications Systems (ANCS), pages 71–82, October 2013. →pages 35[82] Z. Ullah, K. Ilgon, and S. Baeg. Hybrid Partitioned SRAM-Based TernaryContent Addressable Memory. IEEE Transactions on Circuits and SystemsI: Regular Papers, 59(12):2969–2979, December 2012. → pages 35[83] Z. Ullah, M. K. Jaiswal, Y. C. Chan, and R. C. C. Cheung. FPGAImplementation of SRAM-Based Ternary Content Addressable Memory. InInternational Parallel and Distributed Processing Symposium Workshopsand PhD Forum (IPDPSW), pages 383–389, May 2012. → pages[84] Z. Ullah, M. K. Jaiswal, and Cheung R. C. C. Design Space Explorations ofHybrid-Partitioned TCAM (HP-TCAM). In International Conference onField-Programmable Logic and Applications (FPL), pages 1–4, September2013. → pages[85] Z. Ullah, M. K. Jaiswal, and R. C. C. Cheung. Z-TCAM: An SRAM-BasedArchitecture for TCAM. IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems, 23(2):402–406, February 2015. → pages 35[86] S. Guccione, D. Levi, and D. Downs. A Reconfigurable ContentAddressable Memory. In IEEE International Parallel and DistributedProcessing Symposium (IPDPS), pages 882–889, May 2000. → pages 38[87] S. A. Guccione, D. Levi, and D. J. Downs. Content-Addressable MemoryImplemented Using Programmable Logic, February 2002. US Patent6,351,143. → pages[88] J. Ditmar, K. Torkelsson, and A. Jantsch. A Dynamically ReconfigurableFPGA-Based Content Addressable Memory for Internet ProtocolCharacterization. In International Conference on Field-ProgrammableLogic and Applications (FPL), pages 19–28, August 2000. → pages[89] Z. Baruch and C. Savin. Reconfigurable Content-Addressable Memory. InIEEE International Conference on Intelligent Engineering Systems (INES),pages 459–463, September 2004. → pages180[90] P. B. James-Roxby and D. J. Downs. An Efficient Content-AddressableMemory Implementation Using Dynamic Routing. In IEEE InternationalSymposium on Field-Programmable Custom Computing Machines (FCCM),pages 81–90, March 2001. → pages[91] A. Jamadarakhani and S. K. Ranchi. Implementation and Design of HighSpeed FPGA-Based Content Addressable Memory. International Journalfor Scientific Research and Development (IJSRD), 1(9):1835–1842,December 2013. → pages[92] Q. Ibrahim. Design and Implementation of High Speed Network DevicesUsing SRL16 Reconfigurable Content Addressable Memory (RCAM).International Arab Journal of e-Technology (IAJeT), 2(2):72–81, June 2011.→ pages 38[93] S. Guccione, D. Levi, and P. Sundararajan. JBits: Java Based Interface forReconfigurable Computing. In International Conference on Military andAerospace Applications of Programmable Devices and Technologies(MAPLD), September 1999. → pages 38[94] H. Qin, T. Sasao, and J. T. Butler. Reconfigurable Computing: Architecturesand Applications, chapter Implementation of LPM Address Generators onFPGAs, pages 170–181. Springer Berlin Heidelberg, Berlin, Heidelberg,2006. ISBN 978-3-540-36863-2. → pages 39[95] H. Qin, T. Sasao, and J. T. Butler. On the Design of LPM AddressGenerators Using Multiple LUT Cascades on FPGAs. InternationalJournal of Electronics, 94(5):451–467, May 2007. → pages[96] T. Sasao and J. T. Butler. Implementation of Multiple-Valued CAMFunctions by LUT Cascades. In IEEE International Symposium onMultiple-Valued Logic (ISMVL), pages 11–11, May 2006. → pages[97] H. Nakahara, T. Sasao, and M. Matsuura. A CAM Emulator UsingLook-Up Table Cascades. In IEEE International Parallel and DistributedProcessing Symposium (IPDPS), pages 1–8, March 2007. → pages 39[98] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction toAlgorithms. The MIT Press, Cambridge, MA, 3rd edition, 2009. ISBN978-0-262-03384-8. → pages 40181[99] A. Broder and M. Mitzenmacher. Network Applications of Bloom Filters:A Survey. Internet Mathematics, 1(4):485–509, 2003. → pages 40[100] Y. Qiao, T. Li, and S. Chen. Fast Bloom Filters and Their Generalization.IEEE Transactions on Parallel and Distributed Systems (TPDS), 25(1):93–103, January 2014. → pages 40[101] J.-L. Brelet. An Overview of Multiple CAM Designs in Virtex FamilyDevices, 1999. Application Note XAPP201. → pages 41[102] J.-L. Brelet and B. New. Designing Flexible, Fast CAMs With VirtexFamily FPGAs, 1999. Application Note XAPP203. → pages 41[103] Lattice Semiconductor Corp. Content Addressable Memory (CAM)Applications for ispXPLD Devices, 2002. Application Note AN8071. →pages 41[104] Actel Corp. Content-Addressable Memory (CAM) in Actel Devices, 2002.Application Note AC194. → pages 41[105] A. M. S. Abdelhadi and G. G. F. Lemieux. Multi-Ported RAM VerilogSource Code, 2015. URLhttps://github.com/AmeerAbdelhadi/Multiported-RAM. Accessed September2016. → pages 44, 68, 76[106] A. M. S. Abdelhadi and G. G. F. Lemieux. Switched Multi-Ported RAMVerilog Source Code, 2015. URLhttps://github.com/AmeerAbdelhadi/Switched-Multiported-RAM. AccessedSeptember 2016. → pages 82, 123[107] A. M. S. Abdelhadi and G. G. F. Lemieux. Multi-Ported Memory CompilerVerilog Source Code, 2016. URLhttps://github.com/AmeerAbdelhadi/Multiported-RAM-Compiler. AccessedSeptember 2016. → pages 97, 114, 123[108] R. M. Karp. Complexity of Computer Computations, chapter Reducibilityamong Combinatorial Problems, pages 85–103. Springer US, Boston, MA,1972. ISBN 978-1-468-42001-2. → pages 104[109] R. Fourer, D.M. Gay, and B.W. Kernighan. AMPL: A Modeling Languagefor Mathematical Programming. Scientific Press series. Duxbury Press,182Pacific Grove, CA, 2nd edition, 2003. ISBN 978-0-534-38809-6. → pages114[110] GLPK (GNU Linear Programming Kit, 2012. URLhttps://www.gnu.org/software/glpk/. Accessed September 2016. → pages114[111] S. J. E. Wilton, C. W. Jones, and J. Lamoureux. An Embedded FlexibleContent-Addressable Memory Core for Inclusion in a Field-ProgrammableGate Array. In IEEE International Symposium on Circuits and Systems(ISCAS), pages II–885–8 Vol.2, May 2004. → pages 125, 146[112] C. W. Jones and S. J. E. Wilton. Content-Addressable Memory withCascaded Match, Read and Write Logic in a Programmable Logic Device,September 2003. US Patent 6,622,204. → pages[113] G. R. Schlacter. Emulation of Content-Addressable Memories, June 2004.US Patent 6,754,766. → pages 125, 146[114] A. M. S. Abdelhadi and G. G. F. Lemieux. 2D Binary Content-AddressableMemory (BCAM) Verilog Source Code, 2015. URL https://github.com/AmeerAbdelhadi/2D-Binary-Content-Addressable-Memory-BCAM.Accessed September 2016. → pages 126, 139, 144[115] A. M. S. Abdelhadi and G. G. F. Lemieux. Indirectly Indexed 2D BinaryContent-Addressable Memory (BCAM) Verilog Source Code, 2016. URLhttps://github.com/AmeerAbdelhadi/Indirectly-Indexed-2D-Binary-Content-Addressable-Memory-BCAM.Accessed September 2016. → pages 147, 161, 164[116] A. Brant, A. Abdelhadi, A. Severance, and G. G. F. Lemieux. PipelineFrequency Boosting: Hiding Dual-Ported Block RAM Latency UsingIntentional Clock Skew. In International Conference onField-Programmable Technology (FPT), pages 235–238, December 2012.→ pages 169183Appendix ABrute-Force TransposedIndicators (BF-TI) BCAM WritingMechanismWriting to the Brute-Force Transposed Indicators (BF-TI) structure requires settingthe new indicator and clearing the old indicator. As shown in Figure 2.11, aRefRAM is used in parallel to the Transposed Indicators RAM (TIRAM) in orderto track the BCAM content and provide for a given address what pattern is alreadystored and should be removed. This is useful for the BCAM writing operationwhere the old indicator should be cleared; RefRAM will provide the old pattern inthe current written address. BCAM writing will consume two cycles as follows.1. Set cycle:1841.1. Set (write ’1’) a new pattern indicator to TIRAM1.2. Read old data (pattern) from RefRAM2. Clear cycle:2.1. Clear (write ’0’) the old indicator form the TIRAM (location is alreadyprovided by step 1.2)2.2. Write new data (pattern) to RefRAMWriting to the TIRAM structure requires writing to a single bit in the TIRAMto set or clear a pattern indicator. As described in Figure A.1 (top), this can beachieved by employing the BRAM mixed-width capability in simple dual-portedmode supported by most FPGA vendors. The writing port width is set to a singlebit, while the reading port is set to the maximum available width. The byte-enablefunctionality can also be utilized to write part of the data line as described inFigure A.1 (middle); however, usually byte-enables do not control fine-grainedparts of the data, which make it impractical for TIRAM implementation. Bothmixed-width and byte-enable methods can be combined as shown in Figure A.1(bottom).185⌈log2(RD·DW)⌉⌈log2CD⌉=⌈log2DW⌉PW=⌊log2RD⌋CD=DWMatchPW=⌊log2RD⌋MPattRD·RW Lines X 1 Bit RD Lines X RW BitsDinAddrDoutAddrDual-port RAMWrite/EraseWPatt(base)WAddr(offset)Write PortRead PortPW=⌊log2RD⌋⌈log2CD⌉=⌈log2DW⌉ CD=DWCD=DWBin 21-hot CD=DWMatchPW=⌊log2RD⌋MPattWrite/EraseWPattWAddrRD Lines X RW BitsDinAddrByteEnb(one-hot)Write PortRead PortDoutAddrDual-port RAM⌈log2(RD·PW/b)⌉PW=⌊log2RD⌋WPattWAddr(base)(offset)⌈log2CD⌉=⌈log2DW⌉bbWrite/EraseBin 21-hotCD=DWMatchPW=⌊log2RD⌋MPattlog2bAddrDinByteEnbRD·DW/b LinesX b BitsRDLinesX DW BitsDual-port RAM(one-hot)DoutAddrWrite PortRead PortFigure A.1: Transposed Indicators RAM (TIRAM) implementation using (top) Mixed-width BRAM (middle)Byte-enable (bottom) Combined methods.186Appendix BWide Priority Encoders in FPGAsThe proposed 2D-HS-BCAM technique successfully reduces the width of theCD wide priority-encoder used by the brute-force approach and splits it into twonarrower priority-encoders. However, priority-encoder delay still affects overallperformance. Hence, a fast priority-encoder design is essential.Our proposed BCAM architecture consists of a⌈CDSW⌉wide priority-encoderand another SW width priority-encoder. For deep BCAMs, the priority-encoderdelay considerably affects the overall performance; hence, a fast priority-encoderdesign is essential.A priority-encoder, also called Leading Zero Detector (LZD) or Leading ZeroCounter (LZC), receives an n-bit input vector and detects the index of the firstbinary ‘1’ in the input vector. A valid signal indicates if any binary ‘1’ was detectedin the input vector, hence the index is valid.As depicted in Figure B.1, the suggested priority-encoder is recursively con-187n⌈log2n⌉invld idxPEnPEknn/kn/kn/kn/k⌈log 2(n/k)⌉⌈log2k⌉⌈log2(n/k)⌉⌈log2n⌉vld idxin#k-1PEn/ k#2PEn/ k#1PEn/ k#0PEn/ kk-1 2 1 0Figure B.1: Priority-encoder (left) symbol (right) recursive definition.structed. The input vector is split into k equal fragments with nk bits. A priorityencoder PE nkwith a narrower width of nk is applied for each fragment. The validbit of each of the k PE nk’s goes to a k bit PEk to detect the first valid fragment. Thelocation of this fragment is the higher part of the overall index, and steers the exactlocation within the fragment itself to produce the lower part of the overall index.The depth of the proposed structure is⌈logk n⌉, while the hardware area com-plexity is O(n). If Altera’s Stratix V [55] or equivalent device is used, k = 4 isrecommended to achieve higher performance and area compression, since the muxcan be implemented using 6-LUT, hence an entire ALM.188Appendix CVerilog IPs User GuideThis appendix is a user guide for the Switched Multi-ported RAM (SMPRAM)module Verilog package provided with this dissertation. The SMPRAM module iscompatible with Verilog-2001. The provided Verilog is generic, however, it hasbeen tested using Altera’s ModelSim (version 10.0d) only. The SMPRAM module,including interface signals and configuration parameters is described in Figure C.1.Table C.1 lists all interface ports while Table C.2 lists all configuration parametersfor the SMPRAM module. The code in Listing 1 describes an SMPRAM moduleinstantiation. Furthermore, to instantiate the SMPRAM module, all *.v & *.vh filesin this package should present in your work directory. The following command-lineclones the package from a GitHub repository.git clone https://github.com/AmeerAbdelhadi/Switched-Multiported-RAM.git189(nWPF+nWPS)·DATWWData RDataWAddr RAddrWEnb(nRPF+nRPS)·DATW(nWPF+nWPS)·log2MEMD(nRPF+nRPS)·log2MEMDclk rst RdWrSMPRAMMEMD: Memory depthDATW: Data widthnRPF: # Read  ports / fixednWPF: # Write ports / fixednRPS: # Read  ports / switchednWPS: # Write ports / switchedARCH: Architecture typeBYPS: Bypass modeFILE: Initialization mif file nWPF+nWPSConfiguration ParametersWrite PortsRead PortsFigure C.1: Switched Multi-ported RAM (SMPRAM) module block.190Table C.1: List of SMPRAM module interface portsPort I/O width Descriptionclk Input 1 Global clock.rst Input 1 Global synchronous reset.rdWr Input 1 If high, enables switched read ports and disablesswitched write ports; vice versa otherwise.WEnb Input nWPF+nWPS Write enable for nWPF (LSB) fixed write ports andnWPS switched write ports.WAddr Input (nWPF+nWPS) · log2(MEMD) Write addresses: packed from nWPF (LSB) fixedwrite ports and nWPS switched write ports;log2(MEMD) bits each.WData Input (nWPF+nWPS) ·DATW Write data: packed from nWPF (LSB) fixed writeports and nWPS switched write ports; DATW bitseach.RAddr Input (nRPF +nRPS ) · log2(MEMD) Read addresses: packed from nRPF (LSB)fixed read ports and nRPS switched read ports;log2(MEMD) bits each.RData Output (nRPF +nRPS ) ·DATW Read data: packed from nRPF (LSB) fixed readports and nRPS switched read ports; DATW bitseach.191Table C.2: List of SMPRAM module parametersParameter Type Default value Value range DescriptionMEMD Integer N/A Power of 2 Memory depth.DATW Integer N/A 1≤ DATW Data width.nRPF Integer N/A 1≤ nRPF Number of fixed read ports.nWPF Integer N/A 0≤ nWPF Number of fixed write ports.nRPS Integer 0 0≤ nRPS≤ nRPF Number of switched read ports.nWPS Integer 0 0≤ nWPS1≤ nWPS+nWPFNumber of switched write ports.ARCH String ”AUTO” ”AUTO”,”REG”,”XOR”,”LVTREG”,”LVTBIN”, or”LVTTHR”Multi-port RAM architecture: use ”AUTO”to choose automatically, ”REG” for register-based RAM, ”XOR” for XOR-based,”LVTREG” for register-based LVT, ”LVTBIN”for binary-coded I-LVT-based, or ”LVTTHR”for thermometer-coded I-LVT-based.BYPS String ”RAW” ”NON”,”WAW”,”RAW”, or”RDW”Bypassing type: use ”NON” to prevent ad-ditional bypassing circuit, ”WAW” to allowWrite-After-Write, ”RAW” to read new datawhen Read-After-Write, or ”RDW” to readnew data when Read-During-Write.FILE String Not initialized ”mif” file name /without extensionInitialization file in ”mif” format, optional.192Listing C.1: Switched Multi-ported RAM (SMPRAM) module instantiation/ / i n s t a n t i a t e a mu l t i po r ted−RAMsmpram #(.MEMD (MEMD ) , / / i n t ege r : memory depth.DATW (DATW ) , / / i n t ege r : data width. nRPF (nRPF ) , / / i n t ege r : # f i x ed read por ts.nWPF (nWPF ) , / / i n t ege r : # f i x ed w r i t e po r t s.nRPS (nRPS ) , / / i n t ege r : # switched read por ts.nWPS (nWPS ) , / / i n t ege r : # switched w r i t e po r t s.ARCH (ARCH ) , / / s t r i n g : mu l t i−po r t RAM a r ch i t e c t u r e.BYPS (BYPS ) , / / s t r i n g : bypass mode. FILE ( ” ” ) / / s t r i n g : I n i t i a l i z a t i o n f i l e , op t i ona l) smpram inst (. c l k ( c l k ) , / / g l oba l c lock. r s t ( r s t ) , / / g l oba l rese t. rdWr ( rdWr ) , / / enables read or w r i t e switched por ts.WEnb (WEnb ) , / / w r i t e enables [ (nWPF+nWPS)−1:0 ].WAddr (WAddr ) , / / w r i t e addresses [ (nWPF+nWPS)∗ log2 (MEMD)−1:0].WData (WData ) , / / w r i t e data [ (nWPF+nWPS)∗DATW −1:0]. RAddr (RAddr ) , / / read addresses [ ( nRPF+nRPS)∗ log2 (MEMD)−1:0]. RData (RData ) / / read data [ ( nRPF+nRPS)∗DATW −1:0]) ;193

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0314219/manifest

Comment

Related Items