UBC Faculty Research and Publications

IO and data management for infrastructure as a service FPGA accelerators Moorthy, Theepan; Gopalakrishnan, Sathish Aug 22, 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-13677_2017_Article_89.pdf [ 4.03MB ]
JSON: 52383-1.0354695.json
JSON-LD: 52383-1.0354695-ld.json
RDF/XML (Pretty): 52383-1.0354695-rdf.xml
RDF/JSON: 52383-1.0354695-rdf.json
Turtle: 52383-1.0354695-turtle.txt
N-Triples: 52383-1.0354695-rdf-ntriples.txt
Original Record: 52383-1.0354695-source.json
Full Text

Full Text

Journal of Cloud Computing:Advances, Systems and ApplicationsMoorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systemsand Applications  (2017) 6:20 DOI 10.1186/s13677-017-0089-9RESEARCH Open AccessIO and data management forinfrastructure as a service FPGA acceleratorsTheepan Moorthy* and Sathish GopalakrishnanAbstractWe describe the design of a non-operating-system based embedded system to automate the management,reordering, and movement of data produced by FPGA accelerators within data centre environments. In upcomingcloud computing environments, where FPGA acceleration may be leveraged via Infrastructure as a Service (IaaS), endusers will no longer have full access to the underlying hardware resources. We envision a partially reconfigurableFPGA region that end-users can access for their custom acceleration needs, and a static “template” region offered bythe data centre to manage all Input/Output (IO) data requirements to the FPGA. Thus our low-level softwarecontrolled system allows for standard DDR access to off-chip memory, as well as DMA movement of data to and fromSATA based SSDs, and access to Ethernet stream links. Two use cases of FPGA accelerators are presented asexperimental examples to demonstrate the area and performance costs of integrating our data-management systemalongside such accelerators. Comparisons are also made to fully custom data management solutions implementedsolely in RTL Verilog to determine the tradeoffs in using our system in regards to development time, area, andperformance. We find that for a class of accelerators in which the physical data rate of an IO channel is the limitingbottleneck to accelerator throughput, our solution offers drastically reduced logic development time spent on datamanagement without any associated performance losses in doing so. However, for a class of applications where theIO channel is not the bottle-neck, our solution trades off increased area usage to save on design times and to maintainacceptable system throughput in the face of degraded IO throughput.Keywords: Infrastructure as a service, FPGA acceleration in data centres, Heterogeneous computingIntroductionParallel and heterogeneous computing with the use ofdiscrete device accelerators like GPUs, FPGAs, and evenASICs, are some of the recent system level attempts thathave addressed an increasing demand for computationalthroughput [1, 2]. However, heterogeneous computing ischallenging at both the systems level implementation andresource management aspects of deployment. Therefore,data centres are starting to fill a market need for comput-ing horsepower—without end users having to tackle suchchallenges directly. This sort of market has been termedInfrastructure as a Service (IaaS) within the more generalcategory of cloud computing services.Within the IaaS domain, there have been recentattempts in literature to facilitate the integration of device*Correspondence: tmoorthy@ece.ubc.caThe Department of Electrical and Computer Engineering, The University ofBritish Columbia, Vancouver, BC, Canadaaccelerators into cloud computing services [3–6]. Theseworks primarily focus on scaling the use of accelerators inIaaS in three different ways; firstly, by treating the accel-erator as a virtualized hardware resource, secondly in thecase of FPGAs, by making use of their dynamic recon-figurable capabilities directly without virtualization, orthirdly, by offloading greater portions of traditionally hostCPU based code onto accelerator based soft or hard pro-cessor cores. The work in this article is complementaryto these efforts to integrate accelerators into cloud com-puting services, but we focus on a singular aspect of thisbroader challenge. We deal with the systems level imple-mentation of integrating accelerators. We draw attentionto the fact that present day accelerators may rely on directaccess to IO data channels to offer effective acceleration.High Level Synthesis (HLS) has gained some tractionin offering acceptable levels of performance relative to© The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to theCreative Commons license, and indicate if changes were made.Moorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 2 of 23manual RTL design for FPGA based hardware accelera-tors. More specifically, HLS is presently capable of trans-forming task specifications in a high-level programminglanguage (such as C or Java) into near optimal hardwareimplementations for the compute portions of hardwareacceleration [7]. However, successful acceleration of thecompute portions alone does not translate into the desiredapplication level acceleration as well. The increases incomputational throughput must be supported by a corre-sponding increase in Input/Output (IO) data throughputto achieve system level acceleration. Unfortunately, HLSat present offers very little support or design abstrac-tion for implementing custom IO data transfers and theircorresponding organization in the memory subsystem,leaving system designers to still fall back on manual Reg-ister Transfer Level (RTL) design for these tasks. In fact,managing ingress and egress data across multi-level on-chip and off-chip memory hierarchies is an open HLSproblem [7].To mitigate some of these existing limitations of HLS invirtualizing FPGA usage in the cloud, recent work has pro-vided further innovation [8, 9]. Ma et al. in their work [8],have combined existing development on Domain SpecificLanguages and the use of FPGA overlays to offer a run-time interpreter solution to managing FPGA resources.However, because they have taken an overlay approach,as they have cited, they also must offer standard accel-erator interface templates for any IO to the FPGA. SuchIO interfaces would indeed provide a great amount offlexibility for FPGA accelerator usage, but would sufferon optimality for any specific accelerator’s data accesspattern. Chen et al. [9] in their related work on FPGAbased cloud computing, go further to recognize thatthere may be miscellaneous peripherals such as Ethernetcontrollers and various memory controllers required byspecific accelerators. As such they make reference to acontext controller switch to handle varying IO sources,but do not elaborate on whether any optimization is beingmade on this IO interface on a per accelerator applicationcase basis.In IaaS, this issue of customized IO and memorythroughput for accelerators is further exacerbated by thefact that an end-user of a cloud computing environmentdoes not have access to modify the bare underlying hard-ware’s IO in any way. This is due to security limitationsin the case of FPGAs [3], and in the case of GPUs andASICs once their IO controller data-paths are designedand deployed they are fixed. Thus even if an end-user iswilling to commit the manual RTL development hoursrequired to customize the IO interfaces for their acceler-ated application, their lack of control and visibility overthe underlying physical IO channels would prohibit themfrom doing so. Moreover, it is sometimes desirable to vir-tualize accelerators and enable time-sharing, and allowingfor significant user customization may not be possible inthis scenario.With such limitations in mind, if IaaS is to be a viableoption for those seeking FPGA based acceleration, a flexi-ble method to access IO data channels must be offered bythe infrastructure itself. Furthermore, a level of abstrac-tion must also be offered by any such method in order forthe underlying infrastructure to be upgraded or changedand not require changes in the end user’s applications as aresult.We seek to complement HLS based acceleration byabstracting the design efforts required to integrate IOand memory data channels. We do so by investigatingthe use of the existing embedded ecosystem on mod-ern FPGAs to handle IO data channels in software.Although this approach may not offer the same perfor-mance when compared to automated custom configura-tion methods [10], and far less effective relative to manualRTL design, it still offers design times that are compa-rable to an HLS approach. More importantly, it allowsfor data transfers to be handled in software. Certainclasses of FPGA devices today have an embedded ecosys-tem as hardened components readily available; makinguse of these embedded resources via software controlabstracts away the exact details of the embedded proces-sor and physical IO channels. This gives IaaS providersthe freedom to carry out infrastructure upgrades with-out disrupting their customer workloads. And for theend user customers, this provides the option to harnessany future hardware upgrades or changes in underly-ing IO technology at a software level should they wishto do so.Towards demonstrating the value of an embedded IOprocessor in FPGA-based accelerators, we use two casestudies to demonstrate the value of IO management indata-intensive applications as well as the possible perfor-mance implications of embedded IO processors. Althoughwe have focused on FPGAs as the substrate for implement-ing accelerators, we believe that the central idea of utilizingan embedded IO processor will be an attractive approachfor other accelerator implementations as well.FPGAs as managed hardware in IaaS data centresWhether FPGAs are integrated into data centres via pro-posed virtualized methods [3] or by more direct Open-CLbased methods [11, 12] the FPGA fabric itself will need toconsist of two separate regions. A static region of logic willexist in order for the FPGA to bootstrap itself and config-ure the minimum communication protocols necessary forit to interact with the host system. The second (dynamic)region will be an area of the fabric that is partially recon-figurable at runtime to allow for custom logic to be placedonto the device. This custom logic will then acceleratea specific compute-intensive task call of the application’sMoorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 3 of 23algorithm, or run the application as a whole entirely on theFPGA.The dynamic region of the FPGA will be accessible tothe end user to configure via the FPGA vendor’s partialreconfiguration tool flow. The static region’s base “tem-plate”, however, will not be accessible to the end user andmust be provided for and maintained by the data centrehost. The boundary (Fig. 1) between these two static anddynamic regions of the FPGA fabric is what this articleconcerns itself with.We refer to the static region as a template above becausethis region is what defines the structure of external IOdata channels that the FPGA end user sees and has accessto. The static region accomplishes this by instantiating thenecessary DRAM memory controllers, PCI express end-point blocks, and any other specific IO controllers that itwishes to provide. It is worthy to note here that althoughthe FPGA device itself may be physically wired to multipleIO channels on the Printed Circuit Board (PCB) that wasmanufactured for its data centre usage (Fig. 2), the staticregion itself does not necessarily need to instantiate a con-troller for all of the available IO channels.What this allowsfor is different price points within IaaS for a data centreto rent its FPGA resources at, while keeping their actualphysical FPGA infrastructure uniform with a fixed cost.For example, a data centre may have all of its FPGA PCBsdesigned with four Ethernet ports available, but intention-ally restrict an end user from only accessing one or twobased on what level of access they have paid for. In Fig. 2this is illustrated by depicting multiple IO protocols andchannels that may be physically wired to FPGA boards,however, the actual IO resources that an end user hasaccess to can easily be limited by a static region bitstreamin the SDAccel flow. The static region bitstream will nothave an IO controller for all of the available IO protocolssuch as Ethernet, SATA, JESD, or SERDES, unless the end-user has subscribed to a design flow that includes theseoptions (at a cost).Apart from the business side advantages that mandat-ing a static region on the FPGA provides, a static regionalso allows physical IO channel upgrades in technologywhile offering minimal disruption to existing applicationsand customers. Consider that the data centre upgradesthe solid state storage devices that its FPGA customershave access to from SATA 2 to SATA 3 SSDs. In thisscenario, only the SATA controllers within the static tem-plate would need to be redesigned and deployed, whilekeeping the partially reconfigurable dynamic portion ofFPGA logic unaffected. However, this begs the ques-tion as to whether this is feasible or even possible inthe actual implementation of these static and dynamicregions. Given that there is to be very tightly linkedand high throughput communication between these tworegions, if a component within the static region changeseven slightly will this not necessitate a redesign in themodules of the dynamic region that it interacts with. Theanswers to these questions rest completely on what theboundary logic between these two regions consists of.The boundary region at the implementation levelmerely represents the set of architectural choices that aremade to design and determine the type of interfaces thatthe static and dynamic regions will use to accomplish datatransfers between them. This work contributes an embed-ded method to manage data channels that strays awayfrom what conventional RTL design practices would dic-tate. We also put forth the novelty of using our embeddedFig. 1 FPGA Fabric’s Static, Boundary, and Dynamic Regions on a PCIe based Data Centre accelerator PCB cardMoorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 4 of 23Fig. 2 Varying Physical IO Channels Directly Accessible by an FPGA within a Data Centre Host Environment, for an FPGA vendor’s (Xilinx) softwaredevelopment flowmethod in data centres where FPGAs may contribute asmanaged hardware resources within IaaS environments.This domain is an emerging and rapidly evolving fieldwhere end users are given increasingly less control overthe bare-metal hardware that they seek to use. They mustalso accept certain levels of abstraction in their usageif they are to reap the economically cheaper rewards ofscaling their FPGA acceleration based needs.We explain our solution to IO data channel manage-ment in this environment by way of two use case exam-ples that follow in “Bioinformatics application study” and“Video application study” sections. Towards demonstrat-ing the value of our solution, in the first case we showthat there are no performance penalties in adopting IaaSacceleration for the majority of data intensive or IO boundapplications. In the second case, where IO performanceloss does present itself as a problem, we successfullyshow that the acceleration itself can be scaled to compen-sate/offset IO performance loss at the application level ifneed be.Bioinformatics application studyApplication characteristicsApplications that require access to, or that produce, largevolumes of data are presently loosely defined as “Big Data”applications. Big Data type applications, where FPGAs areutilized for acceleration, can be found in various applica-tion domains, ranging from Financial Computing, Bioin-formatics, and conventional Data Center usages [13–15].The particular application chosen for the experiments ofthis section falls under the Bioinformatics domain. How-ever, the IO, memory, and data volume characteristicsof this application allow for the results to be prevalentacross most FPGA accelerated Big Data applications ingeneral. Traditional, in house, computing-cluster basedinfrastructures for such applications may not be economi-cal as their data needs continue to scale, and can be costlyto maintain. The take away that we would like to stress inthis section is that, in IO constrained cases of these appli-cations, performance does not have to be sacrificed whenmoving to IaaS.The underlying common denominator, from a hardwareperspective, of such applications, is that they all requirethe use of a non-volatile storage device to handle the largedata sets. The use of non-volatile devices, such as SolidState Drives (SSDs) or other flash memory based hardware,often necessitates a memory hierarchy within the systemdesign process. A typical hierarchy will include off-chipflash memory for large data sets, followed by off-chip DRAMfor medium sized data sets, and finally on-chip SRAM.For the off-chip memory components, the physical IOdata channel by which these components are accessed alsoadds a lot of variation to the system level design. For exam-ple, in the Financial Computing domain of applications,large volumes of off-chip data might be streaming in frommultiple Ethernet channels, whereas in Data Centers theincoming data might be arriving over PCI-express linksconnected to SSD storage. This variation in the physicalIO channels creates the need for multiple IO controllersthat rely on different IO protocols for communication,which in turn results in varying interface requirements foreach controller.The FPGA application accelerators in this class willrequire the IO controllers to supply a high throughput ofdata that at least matches the targeted scale of acceler-ation throughput. This required acceleration throughputcan often be much greater than the physical bandwidthof the underlying IO device, the throughput rate of theMoorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 5 of 23controller that is accessing the underlying device, or anycombination of the two. This dynamic results in the highthroughput potential of the accelerator being limited bylow IO throughput at the systems level. For the systemarchitects that have the financial resources to do so, over-coming this problem might be a possibility by investingin higher throughput IO channels (i.e. converting from2nd generation SATA to 3rd generation, or increasingthe number of lanes in a PCI-express link). However,cases, where even state of the art physical IO channelscannot match accelerator throughput requirements, arenot rare [16].The use case application chosen for this section demon-strates the operation of two different IO protocols, Ether-net and SATA, and more importantly illustrates the effortthat is required to integrate their respective controllersinto the FPGA-accelerated data path. In doing so, wefind that indeed the limitations in controller throughputdo become the bottleneck to system level performance.Within the latter part of this section, we move to demon-strating how such IO bottlenecks in the system can beexploited to create interfaces to these controllers that aresoftware driven and less complex to design.The DIALIGN algorithmBioinformatics algorithms generally lend themselves wellto hardware acceleration due to their inherent amountof parallelization. DIALIGN [17] takes an input querysequence against another input reference sequence, andaligns/matches segments of the query sequence to theregions of the reference sequence that offer the highest(optimal) degree of similarity. The output of DIALIGN isa mapping of the query sequence coordinates to variousreference sequence coordinates (i.e. regions).The input data to DIALIGN is a series of 1-byte char-acters that represent either DNA or protein strands ofquery and reference sequences for comparison (crosscomparisons between DNA strands and protein strandsare never made). DNA representation requires only 4 dis-tinct characters thus a 2-bit data representation wouldsuffice, however, protein variations require greater than16 characters to represent. Thus an 8-bit data representa-tion is favoured to accommodate these two types of inputsequences.Acceleration in hardware is achieved by a systolic arrayarchitecture implemented by Boukerche et al. [18]. Theirsystolic array architecture (Fig. 3) stores each characterof the query sequence within each PE, then streams thereference sequence across the chain of PEs to compute amatrix scoring scheme. Ideally, this architecture performsbest when all of the characters of the query sequencecan entirely fit into the total number of synthesizablePEs. Given that query-sequence sizes of interest today arein the mega characters range, this ideal scenario is notfeasible even with relatively large FPGA devices. Thus sev-eral passes of the reference-sequencemust bemade acrossmultiple partitions of the query-sequence.DIALIGN implemented with fully RTL based IO interfacesThere are two IO channels that the DIALIGN applica-tion running on our FPGA platform relies on. The firstis the Ethernet channel for external communication, andthe second is the SATA channel for internal deep storageaccess.Between the two IO channels, the Ethernet channelhas the simpler interface to the accelerator. Here we areusing the word interface, not in the sense of a standardprotocol by which the ports of two or more commu-nicating hardware modules must be connected, but inthe more general sense of any necessary changes in fre-quency, data-width, or timing that must be performedin order to achieve a viable data-path between any twohardware components. The two primary components ofinterest in this article are the accelerator (or more specifi-cally, the systolic array for this particular application) andan IO controller. The interface is then, the mechanismby which the IO controller transfers data to the nativeinput or output ports of the accelerator. This mechanism,by definition, should at a minimum support two basicfeatures—data width conversions and cross clock domainstability.Our system was implemented on the Digilent XUP-V5development board [19], which features a 1-Gbps physi-cal Ethernet line, two SATA header/connectors, and theXilinx XC5VLX110T Virtex 5 FPGA. Our system con-sumed 88% of the device logic with the accelerator andthe (hardware) IO interfacing resources combined. Withthe accelerator synthesized to utilize 50 PEs, the system iscapable of operation at 67.689 MHz. The SSD connectedto our board is the Intel SSD_SA2_MH0_80G_1GNmodel, with a capacity of 80 GBs. It offers SATA 2(3 Gbps) line rates, and states 70 and 250 MB/s sequentialwrite and read bandwidths respectively under technicalspecifications by the manufacturer. A 1-gigabyte-reference and 200-character-query synthetic sequencewere generated on the Host side for experimentation(Fig. 4).The reference sequence is at most 1 byte per charac-ter, and at our synthesized accelerator operational speedof 67.689 MHz, this requires only 68 MB/s of Ethernetcontroller throughput to sustain the accelerator’s maxi-mum throughput level. Therefore, this required Ethernetthroughput rate of 68 MB/s does not become a bottleneckto the system, and thus the Ethernet interface specifica-tions will not be discussed further.In comparison to the Ethernet controller the SATA con-troller, however, does pose significant system bottleneckissues.Moorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 6 of 23Fig. 3 Array Architecture for Single-Partition processing [18]Intermediate data storage support between partitionsGiven that partition switches across the PEs are neces-sary to support increasing lengths of query-sequences,the demand that this process places on intermediatestorage requirements when gigabyte reference sequencesare streamed is now described. Three partition switchesacross a systolic-array architecture of 50 PEs daisy chainedtogether, effectively creates the logical equivalent of 200PEs being used in a similar daisy chain fashion [20]. Thus,all of the data being produced by the last (50th PE) at eachclock cycle of a single partition’s operation must be col-lected and stored. This stored data will then be loopedback and fed as contiguous input to the 1st PE when thenext partition is ramped up, thereby creating the requiredillusion that 200 PEs are actually linked together (Fig. 5).During the systolic-array based operation of the PEs,there are seven pieces of intermediate data that flow fromone pipeline-stage to the next. The abstract Load/StoreFIFO Blocks (Fig. 5) are implemented as 7 banks of32-bit wide FIFOs (Fig. 6), with each bank capturing 1of the 7 values of output from the terminal-PE. Thisflow of intermediate data generated by the DIALIGNaccelerator is outputted or inputted at a significantlyhigh throughput rate of 1896 MB/s. There are sevennative accelerator ports in each direction that pushand receive this intermediate data to and from exter-nal storage, with each port being 4-bytes wide. Theaccelerator operates at 67.689 MHz and at 28 bytesevery clock cycle (assuming no stalls by the Ethernetcontroller) this produces the one-way throughput of1896 MB/s.The fourteen accelerator ports mentioned above arewhat feed the Store FIFO Block and receive data fromthe Load FIFO Block in Fig. 6. At the very bottom ofFig. 6, data heading downstream is received by the SATAcontroller ports. The native port-width of our particu-lar SATA controller [21] is 16-bits. The controller clockruns at a frequency of 150 MHz such that its 2-byteinput port then offers a bandwidth of 300 MB/s, whichis the maximum bandwidth of SATA 2 devices. ObservedSATA throughput, however, is a combination of the con-troller’s implementation and the particular drive that itis connected to. With the Intel SSD used in our system,we experimentally observed maximum sequential writeand read throughput rates of 66.324 and 272.964 MB/srespectively.Fig. 4 Ethernet and SATA IO channels to the XUP 5 Development BoardMoorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 7 of 23Fig. 5 FPGA based Storage Interfaces for Query-Sequence PartitioningThe accelerator to IO controller interface, for this chan-nel, consists of all the data path and control units depictedin Fig. 6. It serves all three of the functions within ourdefinition of an interface: data-width conversion, clockdomain crossing, and flow control. Figure 6 is an inten-tionally simplified block diagram of this interface thathides the details of implementation. In this section, suchdetails will be elaborated so that they may be fairly com-pared to alternative software interfaces in future sections.Within the Store FIFO Block of Fig. 6, below the 7FIFO segments, a multiplexer is depicted to convert the224-bit (7x32-bit) data path from the accelerator to only16-bits. This is an oversimplification, and in reality, aFig. 6 Data flow to and from the Load/Store FIFOs to SSDs(Not all connections to the CLU are depicted)mux-control-FSM in conjunction with seven other 2-to-1muxes are used to accomplish this task. The FSM alsohelps to perform another mandatory task of framingthe data for SATA protocol writes. The SATA controllerwrites to disk in single sector segments of 512-bytes persector. The 28-bytes of data produced by the accelerator ateach cycle cannot be evenly packed into 512-byte frames.Every 18 cycles of accelerator operation, will only produce504-bytes of data that must be packed into a 512-byteframe. The FSM packages the payload data with 4-bytesof Start of Frame and End of Frame encoding, such that asingle frame can be successfully detected, and stripped ofits payload data, during the return backward path of dataflow.Stuffing the remaining 8 bytes of a 512-byte sector withthe first 8 bytes from another cycle of accelerator oper-ation would create the additional overhead of realigningFIFO-bank lines when the sectors are eventually readback. Moreover, if a FIFO-bank line crosses two sectors,and the controller has not returned the 2nd sector yet,this would cause some individual FIFOs within the bankto become empty while others are not, and thus furtherexacerbate the synchronization issues that would have tobe resolved.Lastly, the mentioned mux-control-FSM is designed tooperate at the faster SATA controller clock frequency of150 MHz, such that 16-bits of output can be produced atthe maximum bandwidth of 300 MB/s. Note that the 224-bits to 16-bits ratio, means that an accelerator running ateven 1/14th of the SATA frequency is capable of providingenough data to sustain a throughput of 300 MB/s by thisFSM.Moorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 8 of 23To briefly recap the primary hardware components thatwent into this interface, there were two FSMs, one atthe front end to perform data-width conversion and dataframing, and a back end FSM to issue SATA commandsand handle error recovery. Alongside these primary FSMs,non-trivial arrangements of FIFOs and muxes were uti-lized to assist with data packing and flow control. Themost important aspect of these hardware considerations isthat all of these components were designed with the archi-tectural requirement of running at the native SATA con-troller frequency of 150 MHz, such that data bandwidthwas never compromised due to interfacing. However, themuch slower SATA write throughput limitation imposedby the actual physical SSD negates the over-provisioningof the interface logic. This then allows for other simpler(and slower) IO interfacing options to be considered whilestill achieving a comparable system level runtime.The question that we are soon to resolve is that giventhat the SATA controller’s actual end write throughput ismuch lower than its input-port bandwidth capacity, canwe relax the aforementioned frequency requirements ofthe interface for a simpler architectural solution.DIALIGN implemented with software IO interfacesAdapting to software controlled IO interfaces, by defini-tion requires at least one embedded processor to executethe software control. As such, the IO controllers in the sys-tem must also support communication over whatever busstandard that is supported by the chosen processor.The development board that was used in the hard-ware interface experiments is held as a constant, and usedwithin this section as well. As such, the Xilinx MicroBlaze[22] softcore processor is the embedded processor aroundwhich our software solutionwill be explored. TheMicroB-laze ISA supports the following three bus protocols forexternal communication. They area) Local Memory Bus(LMB, low latency direct access to on-chip memoryblock-ram modules),b) Processor Local Bus(PLB, a shared bus intended for processor peripheralsand memory-mapped IO),c) Xilinx Cache-Link(XCL, similar to the PLB, however, it supports bursttransactions and is exclusive to the processor’sinstruction and data caches).The two IO controllers used for the DIALIGN applica-tion, with fully RTL interfaces, in “DIALIGN implementedwith fully RTL based IO interfaces” section were the SIRCbased Ethernet controller [23] and the correspondingGroundhog SATA controller [21]. Because the Ethernetcontroller makes use of only block-rams at its top levelfor data intake and delivery, it can be adapted to con-nect with the MicroBlaze LMB interface. However, theGroundhog SATA controller, which although allows fora lot of design flexibility to interface directly with itsSATA command layer via custom hardware, is not capa-ble of being connected directly to a PLB interface withoutsignificant design additions. As such, the SATA2 con-troller [24], another open source SATA controller that ismore amenable to embedded solutions is adopted in thissection. Both controllers are comparable in their SATA4 KiB write tests, with the SATA2 controller perform-ing slightly worse using comparable SSDs. Therefore, thealternative software solution being explored in this sectionis not unfairly biased in its favour via a better SATAcontroller.The SATA2 controller makes available the following twobus protocols for communication with the core—PLB andNative Port Interface (NPI). The PLB has been previouslyintroduced, however, the NPI is another protocol exclu-sive to Xilinx’s DRAM controller that is being introducedhere. For embedded systems’ use, Xilinx makes availabletheir Multi-Port Memory Controller (MPMC). Further-more, the MPMC supports various interfaces on its ports,with two of them being PLB and NPI interfaces in thiscase. An NPI port, as its name implies, is meant to pro-vide the closest matching signals and data-width to that ofthe physical DRAMbeing controlled. For example, the off-chip DRAM modules on the XUP5 development boardhave a 64-bit wide combined data path, and thus the NPI,in this case, would also accommodate a 64-bit port. Thisis in contrast to the PLB port, which would be set to a 32-bit port as dictated by the MicroBlaze ISA. Therefore, anMPMC instantiated with both PLB and NPI ports allowsthe MicroBlaze processor access to DRAM, and alsoallows a DirectMemory Access (DMA)module or periph-eral access to native memory widths respectively (Fig. 7).With the addition of a complementary NPI based DMA,the SATA2 core has direct access to DRAM with littleMicroBlaze intervention; thereby not having the through-put to the SSD restricted by PLB limitations as well.HW/SW partitioned architectureGiven the feasibility of interconnecting the Ethernet andSATA controllers utilized by the DIALIGN algorithm tostandard bus protocols, the existing EDA tools offeredby FPGA vendors for embedded systems design can beleveraged to manage data flow in software, while allowingfor the computationally intensive accelerator portions toremain in hardware.Newer interfaces such as AXI [25] exist to intercon-nect the programmable logic data-path to an embeddedprocessor; however, due to the limitations on IP for theVirtex 5 development board, our architecture utilizesFast Simplex Link (FSL) [26] based communication. OurMoorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 9 of 23Fig. 7 Embedded MicroBlaze access to the SATA2 Core and DRAM [24]architecture combines a varying amount of FSL channels,along with the Ethernet and SATA controllers, to forma flexible and easily controllable data-path for hardwareacceleration (Fig. 8).The number of FSL links used to connect the config-urable accelerator to the embedded system depends onthe needs of the application being accelerated. Each link,as mandated by the FSL interface, has configurable FIFOsbuilt in. Although a minimal dual cycle latency for data ispossible via the FSL links, the total bandwidth between theaccelerator and the processor will be limited by the lowerclock speed of the MicroBlaze processor. Nonetheless, thepotential for ease of automated configuration is high dueto the fact that the FSL lanes can be scaled according tothe number of input and output ports that are required bythe accelerator, and by the fact that the data-width of eachport can be set accordingly as well.The low aggregate bandwidth between the acceleratorand the processor that comes with the use of FSL lanesdoes not introduce a new throughput bottleneck into thesystem for all applications. If the application already hasan IO channel throughput bottleneck that is lower thanthe FSL channel rate, the ease of automation gained byusing the FSL lanes will have no adverse effect on theoverall system throughput.Soft-processor datamanagementAs depicted in Fig. 8 all input to the alignment acceleratoris received over the Ethernet channel. During accelera-tor operation, one DNA character (a single byte) of theFig. 8 Hardware/Software Partitioned InfrastructureMoorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 10 of 23reference sequence is required per cycle. The accelera-tor runs at a maximum frequency of 67.689 MHz andthus requires 68MB/s of input bandwidth to sustain oper-ation at full throughput. Through experimentation, wehave verified that the SIRC Ethernet controller can pro-vide a stream bandwidth rate between 58.88 MB/s and60.35MB/s. However, this does not cause the SIRC streamto become the bottleneck of the system. The greater bot-tleneck is discussed below.The SATA 2 (3 Gbps line rate) protocol allows for amaximum theoretical bandwidth of 300 MB/s. Real SSDs,however, offer much lower rates, especially on write band-width. The SSD used in our experiments revealed max-imum sequential write rates of 66 MB/s. The DIALIGNaccelerator produces seven 32-bit words per cycle of inter-mediate data at a clock rate of 67.689 MHz, or rather 1895MB/s of write data.The design goal of this section is to capture the 1895MB/s of data produced by the accelerator logic, bring itinto the soft-processor based embedded system domain,and feed it to the SSD controller. As long as this processcan be successfully implemented without dropping belowthe physical SSD data write rate of 66 MB/s, the use of anembedded system for data management does not degradesystem performance relative to custom RTL based datamanagement. Given that a conventional MicroBlaze soft-core processor can be run at a frequency of 125 MHz andthat it operates on 32-bit word operands, this provides foran upper bandwidth limit of 500 MB/s to work within.Realistic bus transfer rates along with actual DRAMwrite/read rates in implementation will be considered todetermine the degradation to this 500 MB/s upper limit.The first stage of data transfer is achieved via sevenFSL links that give the MicroBlaze core access to eachof the seven accelerator output operands using distinctFSL channel ID numbers in software. The Xilinx FSL linkinterface includes built in FIFOs on each channel. Allseven FIFOs on the links are all set to the same depth. Theaccelerator concurrently writes to all FIFOs on each clockcycle of its operating frequency, however, the processorwill read from the FIFOs in round robin fashion. Accelera-tor operation can be paused in response to any almost-fullcontrol signals from the FSL FIFOs.The XUP5 development board used in our designuses DDR2 DRAM and is controlled by the Multi-PortMemory Controller (MPMC) IP [27]. Although the PLB[28] interface supports burst-mode data transfers to theMPMC,MicroBlaze does not support burst-transfers overthe PLB. As a workaround, a data-cache can be incor-porated into the MicroBlaze design, and then the XilinxCache Link (XCL) [22] can be utilized for burst-modedata transfers to DRAM. This is the optimization wehave employed to achieve native 64-bit wide data writesinto DRAM via write bursts issued by the MicroBlazedata cache. We also ensured that the “Use Cache Linksfor all Memory Accesses” parameter was set so thatdata is pushed onto the XCL even if the data-cache wasdisabled.Software functions to read/write to the FSL channelsare provided within Xilinx’s SDK environment and 32-bitwords of data can be accessed within single cycle speeds.Thus the processor running at 125 MHz can consumedata from the FSL links at a bandwidth of 500 MB/s.From there, the data rate at which the data-cache-burstswrite to the MPMC will be the first bandwidth funnel.The upper bound on throughput from the MicroBlaze tothe MPMC is affected by several factors. From the source(MicroBlaze) side, data may not always be injected at theupper limit of 500MB/s due to the fact that someMicroB-laze overhead cycles will be used to gather the data frommultiple FSL channels. On the sink (MPMC) side, writethroughput will depend on any other concurrent readsor writes occurring to the same physical DRAM (via theSATA controller) and how effectively the arbitration pol-icy of the MPMC is executed. These complexities prohibitderivations of any practical analytical methods to inferreasonable throughput levels between theMicroBlaze andthe MPMC.After the data has been streamed by the MicroBlazefrom the FSL channels into DRAM, the SATA2 core canbegin to consume it from there. The modified SATA2core [24] that we employ uses the Native Port Interface(NPI) [27] of the MPMC for its own internal DMA basedtransfers. The DMA setup and start commands are stillissued in software by theMicroBlaze communicating withthe core over the PLB. The SATA2 core’s internal NPIbased DMA engine can obtain data from the MPMC at athroughput of 240 MB/s. This rate has been experimentallyverified, and also drops down to no less than 200 MB/swhen theMicroBlaze is simultaneously issuing data writesto the MPMC via its XCL port. In this scenario, the MPMCinternally performs round-robin arbitration between theXLC writes and the NPI reads. Nonetheless, even a 200MB/s rate of data delivery to the SATA2 controller farexceeds the 66 MB/s write rate of the actual SSD. Thusthese empirical results, which could not have beenobtained in a practical analytical manner, establish that thesystem throughput is not degraded by this software solution.In creating this embedded system based data-path, all ofthe custom logic depicted in Fig. 6 for the hardware inter-face in “DIALIGN implemented with fully RTL based IOinterfaces” section is no longer necessary. FIFOs to collectthe data from an accelerator at variably wide data-widths(up to sixteen 32-bit wide words) are automatically avail-able within the FSL IP, transfer of that data into DRAMis then controlled in software, and finally the repackingof data words for the appropriate sector write frames intoSSDs can also be handled in software.Moorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 11 of 23Furthermore, should the underlying physical SATA-core be changed to support future SATA-3 generationdevices or merely wider data-widths of the transceiversused even within SATA-2 devices, these changes can besupported by software. For example, Virtex 5 transceivers(GTP tiles) utilize 16-bit SERDES (Serializer-Deserializer)circuits for their SATA physical-layer links, however, Vir-tex 6 transceivers (GTX tiles) were upgraded to 32-bitSERDES transceivers. These changes propagate all the wayup to the command layer of the SATA HBA controllerand require the buffers or FIFOs that supply data to thecontroller to also be converted to matching 32-bit words.If our data-path was solely an RTL design (“DIALIGNimplemented with fully RTL based IO interfaces” section),then the muxes, FIFOs, and more importantly the FSMswould all have to be accordingly modified to pack widerwords. In the presently described embedded solution,not only can the 32-bit upgrade be supported entirely insoftware, but we can also maintain full backward compat-ibility with any legacy 16-bit SATA controllers, through32-bit input-word conversions into half-word data writesto DRAM in software.Resource utilization across interface solutionsIn this section, the hardware utilization associated withthe solely hardware, and the HW/SW partitioned inter-face solutions will be compared and discussed.Table 1 lists the amounts and percentages of resourcesconsumed on the Virtex 5, XUP5 development board [19],for both interface options. It can be observed that thereis roughly a 20% overhead increase in LUT and DFF logicthat is consumed when moving to a HW/SW partitionedsolution. The majority of this extra logic can be attributedto not only the MicroBlaze softcore itself but also to theMPMC, its supporting busses, the FSL channels, and theextra DMA unit within the SATA2 controller itself. Asexpected for this application, the actual run time per-formance of the DIALIGN accelerator is equal betweenthe custom HW and HW/SW interfaces, and thus is notreported in the table for comparison.What is gained by this increase in hardware is the signif-icant reduction in design time and debugging efforts. Thecustom hardware interfaces are very resource efficient,however, they can require 6 months of design to architect,implement, and verify. Furthermore, they cannot easily bereused when an IO controller update is required.Video application studyMotion estimation implemented with fully RTL based IOinterfacesAs a motivator for the case study described in this sectionthe reader is urged to consider the system level challengesof a commercial application such as Netflix [29]. Althoughan application such as Netflix in its infancy may start outwith a fully customizable computing cluster environment,as its customer base and data requirements expand, a fullwarehouse data-centre infrastructure is often inevitable[30]. In this scenario, customized acceleration for theencoding required for all standards of input video sources,and the output video stream resolutions produced, willbe challenging. Therefore, cloud-based solutions are oftensought to handle the scale and possible automation of theencoding workloads [31]. It is precisely this sort of envi-ronment that our solution aims to target. In this environ-ment, the flexibility to scale the accelerators that handlethe compute bound portions of the application’s code canbe leveraged to manage IO constraints if, and only if, theIO interface architecture is appropriately designed andmatched.Acceleration of Motion Estimation (ME), for the H.264video encoding standard [32], on FPGAs uses only amem-ory controller for off-chip DRAMaccess as its single chan-nel for IO data. We examine ME in this section because,from an IO perspective, it is fundamentally different fromthe previously covered DIALIGN application in that nei-ther its physical IO channel device (DDR2 DRAM) norits IO controller (the Multi-Port-Memory-Controller) arebottlenecks to accelerator performance. Nonetheless, MEis still a relatively IO dependent application. However,what will be demonstrated in this section is that therequirements of this application, and how these require-ments are exploited, combine to prevent IO throughputfrom becoming an issue. We show that this form of appli-cation requirement exploitation can be achieved throughthe careful customization of its interface to the memorycontroller.Table 1 DIALIGN alignment accelerator with custom HW interfaces vs. HW/SW IO interfacesAreaaLUTs DFFs LUTs % DFFs % Accelerator MicroBlaze#(K) % #(K) % Increase Increase frequency (MHz) frequency (MHz)HW Interface 51.0 73.8 36.9 53.4 67.7 N/AHW/SW Interface withMicroblaze64.5 93.3 52.0 75.2 19.5 21.8 67.7 125aXilinx’s Virtex 5 devices use 4 DFFs & 4 6-input LUTs per SliceMoorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 12 of 23External DRAM throughput requirementsA memory hierarchy (Fig. 9) is used to move portionsof a large data set stored at the non-volatile level (i.e. avideo file), into a medium data set stored in DRAM (videoframes), and then finally into small data sets of on-chipbuffering (sub-frame blocks) to support acceleration. Itwill soon be detailed that a throughput of only 6˜20 MB/sfrom off-chip DRAM to on-chip block rams suffices forthis application. Thus even using a modest DDR2 mem-ory controller will not become a bottleneck based on therequired 620 MB/s of read throughput (to supply datato the accelerator) and the required 125 MB/s of writethroughput (to load video data coming in from the SSD ata speed of at most 60 fps).A single pixel is often encoded as a single byte. Thusa single 1920 × 1088 High Definition (HD) frame occu-pies 2.09 MB of memory. Since on-chip FPGA memoryis commonly limited to be between 2 to 6 MB, videoframes must be encoded in partial segments sequentially.The 1920×1088 frame can be encoded as eight equal seg-ments of 960×272 Sub-Frame (SF) portions at a time. The960×272 SF allows for a more manageable 261.1 KB ofpixels to be encoded within on-chip memory at a time.The VBMSE algorithm [32, 33] requires the video framethat is to be encoded (the present frame) to be comparedto at least four other (reference) frames, before conclu-sive calculations on its corresponding motion vectors arereached. Therefore in addition to the 1/8th portion of thepresent frame (PSF), four other such portions of refer-ence frames (RSFs) must be held within on-chip memoryas well (Fig. 10), which brings the total memory spacethus far to 1.3 MB. Here, we introduce the design con-cept of double buffering—the first buffer is initially loadedwith data to be processed, processing then commences,simultaneously the second buffer is loaded with the nextset of sub-frame data, thereby overlapping memory accesslatency with processing time. This then brings the totalrequired on-chip memory allocation to 2.6 MB for doublebuffering, which even modest FPGA devices can support.FPGA devices with larger amounts of on-chip memorycan opt to double buffer larger sub-frame portions, butwe wish to note that doing so only reduces the requiredmemory-controller throughput rate. This is due to thefact that for VBSME an increase in data to be processedis not linearly proportional to processing times. There-fore, double-buffering larger sub-frame segments allowsfor the acceleration to be more compute-bound than IObound, and thus lowers the required external memorythroughput.The basic building blocks for which motion vectorsare produced as outputs in motion estimation are 16×16blocks of pixels, commonly referred to as a macro-block.Within the 960 × 272 sub-frame portion size that we areusing in this work there will be 1020 macro-blocks thatneed to be encoded. For reasons that will be clarified inthe next sections to come, the number of required clockcycles to encode a single macro-block against a single ref-erence frame is 99 cycles, when the accelerator is scaledto its highest level of acceleration. We intentionally usethe highest level of scaling here in these calculations todemonstrate the upper bound on the required externalmemory throughput. Next, recall that VBSME requiresthe present frame to be compared against four other ref-erence frames when calculating the motion vectors forits macro-blocks. Therefore, the 99 cycles become 396cycles in total per macro-block. We can now state theentire processing time of a sub-frame portion in cycles tobe equal to 403 920 (1020 macro-blocks x 396 cycles permacro-block).As detailed earlier in this section, the total data requiredto process a sub-frame against its four reference framesis 1.3 MB (1.3056 MB exactly, Fig. 10). The maximumFig. 9 VBSME Accelerator Memory Hierarchy at 16-PPUs Data RateMoorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 13 of 23Fig. 10Memory Controller to Accelerator Bufferingfrequency of the implemented VBSME accelerator on ourchosen FPGA device is 200 MHz. Thus the final requiredexternal memory throughput with the use of doublebuffering can now be derived as 646.5 MB/s (1.3056 MB /403 920 cycles * 200 MHz).VBSME accelerator operationFor the purposes of this article, it suffices to view theVBSME accelerator as a scalable black-box accelerator.The scalability of our black-box accelerator is measuredin terms of the number of Pixel Processing Units (PPUs)that it is instantiated (synthesized) with. In our systemimplementation, the offered set of scaling levels is 1, 2,4, 8, or 16 PPUs. Looking into this black-box slightly forthe purpose of understanding its IO requirements, eachPPU can be viewed as a single 16×16 two-dimensionalarray of Processing Elements (PEs) (Fig. 11). A singlePPU requires two separate memory busses, each being16-bytes wide. The necessity of this dual-bus architecturestems from the need to avoid any pipeline bubbles withinthe hardware architecture for VBSME [34]. These dual-buses are each independently connected to two physicallyseparate Memory Partitions A and B that contain two ver-tically segregated logical partitions of the search-windowmemory space (Fig. 12).Although each PPU requires, in total, an input path of64-bytes across its two buses, fortunately, this relation-ship is not held when scaling the VBSME architecture touse multiple PPUs in parallel. The relationship betweenthe number of bytes required when scaling upwards to nnumber of PPUs is 15 + n bytes on each bus [35]. Thus aninstance of a VBSME accelerator scaled to 16-PPUs wouldonly require 31-bytes on each bus. This derives from thefact that each additional PPU always shares 15-pixels incommon with its preceding PPU (Fig. 13).Search-window buffer scalingDue to the fact that the input bus-width requirements ofthe VBSME accelerator vary when it scales, as discussedpreviously, the search-window buffers must also be scaledFig. 11 PPU Black Box View, with 2 Input Buses [39]Moorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 14 of 23Fig. 12 Logical to Physical Segregation of Memory Partitions A and B [39](synthesized) in varying patterns of data organization toaccommodate the scaling.The VBSME accelerator processes a search window byaccessing its rows (top to bottom) one cycle at a time.Accesses to the last 15 rows of the search window, comingfrom Memory Partition B (Fig. 12), are always overlappedwith the 33 rows being accessed fromMemory Partition A[34]. Thus 33 cycles are all that is required to finish a ver-tical sweep of the search-window. However, not all of the64 columns of the pixels within a 64×48 search-windoware read per each row access.The number of columns (15 + n) that are read perrow depends on the number of PPUs that the accelera-tor is scaled to. Once a vertical sweep of these columnsis performed, a horizontal shift to the right is performedto start the next vertical sweep. The granularity of thishorizontal-shift to the right is exactly equal to the numberof PPUs being employed. Therefore, the number of verti-cal sweeps that are required to sweep a search window ofwidth w, in a top-down left-to-right manner is given by(w − (15 + n))/n.Thus in our implementation, the VBSME acceleratorwhen scaled to 16-PPUs will require a total of 3 verti-cal sweeps to process the search window ((64 − (15 +16))/16). These 3 vertical sweeps of 33 cycles each, resultin a total of 99 cycles needed to completely process thesearch window. As another example, if the VBSME accel-erator is scaled down to use only a single PPU the numberof required vertical sweeps is 48 ((64 − (15 + 1))/1),resulting in a total of 1584 cycles required to complete thesearch.Figures 14 and 15 represent the logical columns (Lx_A)of the horizontal shifts that are required when the VBSMEaccelerator is scaled to 16-PPUs and 8-PPUs respectively.In both cases, the widths of each Lx_A partition is equalto the number of PPUs being used. The number of verticalsweeps, and thus the total processing time, is also labeledin the figures for each case according to the previouslyexplained formula.In order to support the access of these logical par-titions, which are in turn based on the granularityof the PPUs being scaled, different mappings of theselogical partitions into a varying number of physicalblock-ram banks must be implemented. The numberof block-ram banks required for this logical to physi-cal translation of search window data is determined by(15 + n)/n [35].The logical to physical implementation of search win-dow buffering for the 16-PPUs and 8-PPUs cases of scal-ing are represented in Figs. 16 and 17 respectively. Asshown in the figures, the required hardware includes bothblock-rams and multiplexers to implement the requiredFig. 13 Input Bus Pixels Shared Amongst PPUs [35]Moorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 15 of 23Fig. 14 Three Vertical Sweeps (99 cycles) for a 16-PPUs Casedata path flexibility during the various clock cycles ofsearch window processing. An important componentnot depicted in the figures is the Control Logic Unit(CLU). These CLUs are implemented as Finite StateMachines (FSMs) for each level of accelerator scaling, andperform the duties of generating the appropriate block-ram addresses and mux selection control signals.A custom RTL implementation consisting of the above-mentioned block-rams, muxes, and FSMs must existwithin the “Double Buffering of Search Windows” func-tional block of Fig. 10, within each of the search windowsshown. This requires a significant amount of RTL devel-opment and debugging that must be performed for eachlevel of VBSME accelerator scaling that is to be imple-mented.In the following section, an embedded soft-coreapproach to memory organization that will obviate theneed to redesign these hardware components for eachlevel of scaling is presented.Motion Estimation implemented with Software IOinterfacesBefore any video compression algorithm can be run it is, ofcourse, necessary for the raw video frames to be availablewithin DRAM. From the details of the previous section,it is clear as to how the FPGA device itself within ourproposed embedded system can directly access raw videodata from a secondary storage device such as SATA SSDs.Once video data is held within DRAM, the 256 MB ormore of DRAM capacity is enough to sustain the bufferingof frames in a manner that does not render the SATA-coreto DRAM bandwidth to be the bottleneck [35].To support the VBSME algorithm introduced in“External DRAM throughput requirements” section interms of physical design within the embedded system,only two main parameters need to be resolved. The firstone being how many FSL channels need to be instanti-ated per each level of accelerator scaling, and secondlywhat the data-width of each channel needs to be. If this isdone correctly, the manner in which the memory subsys-tem is implemented will be transparent to the acceleratorlogic. Apart from answering these two parameter ques-tions, the data pattern by which pixels should be loadedinto the FSL FIFOs is another important algorithmicchallenge.The number of FSL channels to be instantiated can beresolved via the following formula for VBSME scaling.x = 2 × (15 + n)/4Here x refers to the number of FSL channels that needto be instantiated, and n refers to the number of PPUsthat are implemented according to the level of acceleratorscaling. 15 + n, are the number of pixels that need to beforwarded to the accelerator unit respective of its n unitsof scaling (“Search-window buffer scaling” section). Sinceeach pixel is 8-bits in representation, a single 32-bit FSLchannel can forward 4 pixels of information. The multipleof 2 derives from the accelerator requirement of havingdual memory partitions (“VBSME accelerator operation”section).If the number of pixels required per clock cycle bythe accelerator (the numerator in the above equation) isexactly divisible by 4, then each FSL channel will be setto the maximum width of 4-bytes (32-bit words). How-ever, if the quotient leaves a remainder when dividing byfour then a modulus function can be applied to deter-mine the data width of the last FSL channel. Since theFSL widths are configurable, this customization can beachieved.Fig. 15 Six Vertical Sweeps (198 cycles) for an 8-PPUs CaseMoorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 16 of 23Fig. 16 Logical to Physical Memory Mapping for 16-PPUsIn “Search-window buffer scaling” section block-ramsand muxes were introduced as the components that con-stituted the underlying memory substructure. In thissection, we will detail the use of the built in FIFOs inthe FSL channels as a transparent substitute for the cus-tom RTL-designed components. In a pure block-ramsbased buffering system, memory addressing can be usedto repetitively access data. However, if the block-ramsare instantiated as FIFOs, once a specific memory-dataread occurs, it can not be accessed again without it beingre-pushed into the FIFO queue again. Thus, although amemory interface based on FIFOs is simpler to integrateand control, the continual loading of data into these FIFOswill follow a more complex pattern. This is the trade-offthat is to be made, and this is also where we will leveragethe flexibility of soft-core based transfers from DRAM tosupport the complex patterns of FIFO loading.Once the number of FSL channels has been set accord-ing to the previous formula, columns of pixel data fromthe search window of interest must be initially trans-ferred into these channels in a left to right manner. Afterthis initial filling of the FIFOs within the FSL channels,the sweeps across the search window by the accelera-tor (“Search-window buffer scaling” section) are accom-plished by horizontal pixel-column shifts in the searchFig. 17 Logical to Physical Memory Mapping for 8-PPUswindow that equate to the number of PPUs the acceleratoris scaled to. In Figs. 14 and 15, this shift-width can be logi-cally viewed as the partition width (e.g. block L0_A, L1_A,etc.). In the case of the accelerator scaled to 16 PPUs theirpartition widths will be 16-pixels, in the case of 8 PPUs, itwill be 8-pixels and so forth.When using the FSLs for pixel-column shifts, since theirchannel width is set to the maximum of being 4-pixelswide (to optimize data transfers), shifts that occur inmultiples of 4 can be handled by relatively straight for-ward 32-bit memory word reads from DRAM. Such isthe case for the accelerator scaling levels of 16, 8, and4 PPUS. However, as the accelerator scaling falls to 2-PPUs or even a single PPU, the required FSL granularity ofshifting becomes 1/2 FSL width and 1/4 FSL width respec-tively. Since the Microblaze ISA is byte addressable, wordaccesses from DRAM can be packed/re-ordered such thata finer granularity of shifting pixels is supported. At themicro-architectural level, Microblaze may make use ofbarrel shifting within its ALU to get the desired bytefrom a 32-bit data bus read. Therefore the MicroblazeMoorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 17 of 23core was synthesized with its additional barrel-shift logicparameter enabled.Four-pixel (32-bit word) reads from memory for a 2-PPUs accelerator are realigned before being pushed intothe FSL channel FIFOs at various time steps, such thatthey hold the appropriately shifted pixel columns withinthem (Fig. 18).For simplicity’s sake Fig. 18 above shows only the logicalto physical mapping of pixel-rows 1 to 33 in a SearchWin-dow (i.e. Physical Memory partition A in Figs. 14 and 16),the reader is urged to keep in mind that another separateset of 5 FSL channels are implemented for the pixel-rows34 to 48 of partition B. In Fig. 18, a single time step refersto 33 cycles being completed, or in other words one ver-tical sweep (Search-window buffer scaling section) of theSearch Window being completed.Experimental resultsBefore we present the experimental results of the mea-sured bandwidths in our soft-core controlled system, wewould like to address why experiments are necessary inthe first place. One could argue that analytical analy-sis alone could suffice to determine the feasibility of oursystem, since the bandwidths and operational frequen-cies of our embedded system components are alreadypublished within their respective technical data sheets.This is true in that such an analysis would result in anupper theoretical bound on what level of system per-formance is possible. However, this estimate is likelyto be overly optimistic and may downplay the realis-tic result of accelerator throughput degradation in actualimplementation.Fig. 18 Logical pixel columnsmapped to 5 Physical FSL Channel FIFOsMany of the embedded system components havenon-deterministic latencies and throughput. These com-ponents included the MPMC (Multi-Port Memory Con-troller), the PLB and XCL buses to the MPMC, and eventhe Microblaze processor itself given the varying loadson the buses that it interacts with. All of these factorscombine to make the throughput of an embedded systembased data transfer mechanism highly variable in nature.This is even more pronounced when the accelerators thatdraw data from this system belong to varying applicationdomains.The Virtex 5 device used in the synthesis results for theVBSME accelerator is the XC5VLX330. This chip con-tains 51 840 Virtex 5 slices (each slice has four 6-inputLUTS and FFs). Table 2 above presents the area and per-formance results of the VBSME accelerator when pairedwith custom RTL implementations of a memory substruc-ture for each level of scaling; Table 3 then compares thearea and performance of the accelerator paired with oursoft-processor based system of memory data delivery.Similar results for the DIALIGN DNA Alignment accel-erator were previously listed in Table 1 (the table is repro-duced below as Table 4 for reader convenience). Unlike theVBSME application, DIALIGN did not suffer any perfor-mance degradation in the adoption of a HW/SWMicroB-laze interface, and thus performance metrics are not listedin the table for comparison. The DIALIGN acceleratorwas implemented on the 5vlx110tff1136-1 Virtex 5 device,on the XUP 5 development board [19]. As seen in thetable, moving to the easier development flow of a HW/SWsolution does come at the cost of roughly a 20% increaseresource utilization. This device is a relatively smaller chipwith 17 280 Virtex 5 slices.Total development hours for the embedded-systembased design of data movement for both of these appli-cation accelerators was measured in weeks, and less thana month at a maximum including debug time. In com-parison, custom RTL based interface design and mem-ory management for these accelerators was closer to 6months exclusive of debugging hours. Furthermore, theRTL designs cannot easily be reused when an IO con-troller update is required.For the embedded-system the development time isquantifiable under two different steps. The first step,which is the actual configuration of the embedded systemcan take up to a week or two to design and test, to ensurethat the FSLs are connected to the accelerator correctly.The 2nd step is the writing of the embedded softwareitself, to transfer data from the SSD to the accelerator. Thisresulted in roughly only 500 to 1000 lines of code, betweenthe two applications, with the debug hours being greaterthan the code writing hours.In contrast, the previous work on custom RTL inter-faces [35] consisted of five different Verilog modules forMoorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 18 of 23Table 2 VBSME accelerator area and performance results with a custom RTL-designed memory subsystem# of PPUsAreaa PerformanceLUTs DFFsTarget resolutionFreqfps#(K) % #(K) % (MHz)1 8.71 4.20 3.42 1.65 640×480 (VGA) 200.6 282 18.5 8.92 5.49 2.65 800×608 (SVGA) 199.0 344 37.8 18.2 9.64 4.65 1024×768 (XVGA) 198.3 428 76.4 36.8 18.0 8.68 1920×1088 (HD Video) 198.3 3116 154 74.3 34.6 16.7 1920×1088 (HD Video) 198.3 62aXilinx’s Virtex 5 devices use 4 DFFs & 4 6-input LUTs per Sliceeach level of memory subsystem scaling. Then, five othertop-level Verilog modules were also required to accom-plish the correct wiring of the scalable accelerator to thechosen memory subsystem module. Each of the Verilogmemory subsystem modules contained sub-componentspreviously described such as block rams, muxes, andthe most time consuming of all—an FSM based sub-component acting as the CLU. On average there wereclose to 600 lines of RTL across the five Verilog memorysubsystem files.For the DIALIGN accelerator use case example, thehardware interface to the SATA controller consisted ofeight Verilogmodules, each consisting of roughly 500 linesof RTL code. Since these lines of RTL code interact witha physical external device, often simulation alone is notenough to diagnose bugs during the debug process. Thusinternal logic analyzers were required to pinpoint issues atrun time. This level of hardware debugging requiresmulti-ple iterations to complete and suffers from high synthesisand place and route times between iterations. Visibilityinto simultaneous variables (i.e. registers) is also limitedand confined to within usually 1024 clock-cycle samplewindows at a time. Thus the verification process quicklybecomes on the order of months before the system isfunctional as intended.In contrast, the MicroBlaze HW/SW partitioned inter-face solutions can be implemented and verified withinless than a month, inclusive of debugging hours, sincedebugging the interfaces in software does not require theFPGA device to be synthesized nor placed and routedfor each debug iteration. It is also highly flexible towardsfuture IO controller updates, while also retaining back-wards compatibility in software. Above all of these factorsthat favour an embedded system approach—when cus-tom RTL designed access to bare metal IO controllers inIaaS environments is not permitted at all, an embeddedsolution may be the only viable option to provide endcustomers with a flexible method to manage to their IOdata.The Microblaze processor chosen for the experimentswas implemented using the minimum-area tool settingin XPS (Xilinx Platform Studio) and was clocked at 125MHz. It also included barrel-shift logic as the only addi-tional hardware component to its ALU. The VBSMEaccelerator using its stand-alone custom RTL memorysubstructure could be clocked at least at 198.3 MHz at16-PPUs scaling, and at a maximum frequency of 200.6MHz at the single PPU level of scaling. The requiredbandwidth to support its native operating frequency is farabove what the FSL channels are capable of supporting(up to 500 MB/s). Thus the accelerator was underclockeddown to 100 MHz, to run slower than the Microblazeand thus allowing for the Microblaze processor to keepup with data delivery demands. This results in a roughly50% frame rate drop in performance across all the levelsof scaling.Table 3 VBSME accelerator area and performance results with MicroBlaze (125 MHz) based data delivery# of PPUsArea∗ PerformanceLUTs DFFsLUTs % Increase DFFs % Increase Freq (MHz) fps#(K) % #(K) %1 22.7 10.9 6.70 9.11 6.70 7.46 100 152 32.5 15.7 6.78 10.1 6.78 7.45 100 164 51.8 25.0 6.80 12.1 6.80 7.45 100 208 90.4 43.6 6.80 16.1 6.80 7.42 100 1516 168 81.0 6.70 24.2 6.70 7.50 100 30*Xilinx’s Virtex 5 devices use 4 DFFs & 4 6-input LUTs per SliceMoorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 19 of 23Table 4 DIALIGN alignment accelerator with custom HW interfaces vs. HW/SW IO interfacesAreaaLUTs DFFs LUTs % Increase DFFs % Increase Accelerator frequency (MHz) MicroBlaze frequency (MHz)#(K) % #(K) %HW interface 51.0 73.8 36.9 53.4 67.7 N/AHW/SW interface withMicroblaze64.5 93.3 52.0 75.2 19.5 21.8 67.7 125aXilinx’s Virtex 5 devices use 4 DFFs & 4 6-input LUTs per SliceDiscussionFor the VBSME accelerator system, its custom memorysubstructure was never the bottleneck of the system. TheIO channel that its memory substructure relied on wasDDR2 or greater device memory, which always supplied asurplus of memory bandwidth than what was required bythe accelerator [35]. Thus, in this case, transitioning to aMicroblaze soft-core data delivery method hurt its systemperformance by 50% or more (i.e. the Microblaze basedFSL channels became a new IO bottleneck).Despite the less than acceptable frame rates (< 26 fps)of Table 3 across all but one level of resolution scaling,an important piece of insight is still gained from the datawithin Tables 2 and 3 when analyzed together. In Table 2,the RTL designed interface, at the highest 1920×1088 HDtarget resolution, scaling to 16-PPUs results in a framerate of 62 fps. The 62 fps rate, for most video applica-tions, is beyond what is necessary (263˜0 fps). However,the same level of scaling, using a software IO interfacein Table 3 still produces a useable frame rate of 30 fps.What this reveals is an important tradeoff—dialing up thecomputational performance of the accelerator can indeedcompensate for limited IO throughput, and compensatewell enough to achieve acceptable system performance.At first, this revelation seems fairly counter-intuitive.One would expect that as the computational power of anaccelerator is scaled up, this would then cause existing IObottlenecks to be exacerbated further, resulting in unac-ceptable system performance. However, what we see inthe frame rates of Table 3 is the opposite. As the acceler-ator performance is scaled higher from a single PPU upto 16 PPUs the frame rates increase correspondingly aswell (except in the 8-PPUs row, which is an outlier in bothTables 2 and 3 that will be explained shortly). The reasonfor this lies in the architecture of the VBSME accelerator,and how it correlates a doubling in computational per-formance with only a single byte increase in the requiredinput bandwidth. Based on this, one could reason that atsome unknown level of accelerator scaling even a mod-est amount of IO throughput, such as that offered via aMicroBlaze solution, will produce the required frame ratefor the system.The 8-PPUs row is an outlier with respect to correlat-ing increases in frame rates as the number of PPUs isincreased. This is due to the fact that we are not mea-suring frame rate performance while holding the frameresolution as a constant. In other words the frame rate thatwe measure is against a targeted resolution per numberof PPUs used. When video industry standards moved to1080HD video from previous VGA resolutions, the screenaspect ratio was widened. This widening of the framemeans there are manymore columns of macro-blocks thatmust be processed during VBSME. Thus the number ofcomputations is significantly higher for HD video. As aresult, even though we increase the number of PPUs to8, the increase in HD resolution workload outweighs theextra PPUs and thus we do not see an increase in framerate relative to the preceding row. However, when the HDresolution is held constant and the number of PPUs is fur-ther increased to 16, we see a significant doubling of theframe rate as expected.For the DIALIGN DNA alignment accelerator, evenwith a custom memory/data delivery system, the slowSSD write rates of 66 MB/s were the IO bottleneck. Inthis case, transitioning to a Microblaze software con-trolled data infrastructure offered significant advantagesin design time, testing, and future flexibility with no per-formance loss at all. However, the area overhead costs toinclude the Ethernet and SATA controllers as softwaredriven microprocessor peripherals (versus custom FSMbased controller logic) was just under 20% (for the LUTsused).It should also be noted that the particular Virtex 5 deviceused did not contain a hardcore processor or memorycontroller, thus the MicroBlaze softcore and its requiredIP peripherals needed to consume programmable fab-ric resources in order to be implemented. This, how-ever, is not the case with all of the modern FPGAdevices on the market today. FPGA vendors have adoptedthe System-on-Chip model, where a hardened proces-sor and memory-controller are well integrated with theprogrammable logic by default. Furthermore, volume ofsales does not necessarily place such devices at higherprice points either. In the case of such devices, theMoorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 20 of 23programmable logic overhead cost of 20% reported herewould not exist, and the reduction in development timeand effort could be gained without incurring the increasedlogic area penalty.Present advances and shortcomings to hardwareIO interfacesOn the path of making FPGA design more amenable tosoftware developers, FPGA vendors have adopted soft-ware frameworks such as OpenCL [11, 36] for acceleratordevelopment. These frameworks, as a side effect of hav-ing raised the abstraction level in the hardware designprocess, have also made it easier to integrate certain IOdevices.As an example, the Ethernet channel that was used inthis article’s DIALIGN application was integrated usingthird party open-source SIRC cores that made the con-troller and a suitable interface for Ethernet access avail-able. Today, FPGA vendors, such as Xilinx, have madeEthernet controllers directly available via their OpenCLdevelopment platforms (Fig. 19).This step certainly solves the first problem of having atleast a controller readily available to access the IO chan-nel. However, the application requirements of how the IOdata should be organized and buffered, or when back pres-sure should be asserted and released still falls into theuser design space. These design choices falling into theuser design space in itself is not a shortcoming. From anarchitectural point of view, requirements that will varyper application, should of course fall into that applica-tion user’s design space. But the fact that no abstractionsin software are available to manage the IO data streamwithin the application accelerator’s design space is the reallimitation.The present means to emulate the same level of con-trol that an embedded soft-core or hardened processorFig. 19 FPGA vendor supplied Ethernet infrastructurewould have over the data within the OpenCL environ-ment is as follows. The user must first create customdata structures such as arrays to contain the incoming IOdata. For DIALIGN the query-sequence and reference-sequence Ethernet data would be segregated, for example,into two separate arrays. Then the control sequences bywhich the data stream is fed to these two arrays would alsobe written. Since all of this is still performed in softwarethere is no drawback between our proposed solution andthis framework as of yet.After all of the data organization and control has beenprogrammed according to the available OpenCL APIs, theburden will completely fall on the High Level Synthesis(HLS) tool to implement the fine-grained data move-ment that is desired. And as discussed previously, havingHLS efficiently manage data across varying clock-domainsand interfaces is still an open-ended problem [7]. Dueto this limitation, the fine-grained data pattern that wasintended by the accelerator developermay notmatch whatis actually implemented, thereby making throughput andlatency worse off compared to manual implementation.This degradation, as shown in this article, may not affectthe overall application performance. However, in scenar-ios where it does, an embedded core solution will offerbetter IO performance relative to HLS.Using an embedded IO solution also offers other sig-nificant benefits to the FPGA vendor or the data centrethat is offering FPGA nodes as Infrastructure as a Service(IaaS). The integration of future IO controllers yet to bereleased (such as SATA controllers), will not require theirown individual FIFO and Bridge IP logic for integrationto the accelerator design. Furthermore, these IO resourcescan be virtualized over multiple users’ accelerators with-out comprising on security by allowing each user to accessthe IO controller directly.The same arguments that we have made so far towardsIO controllers also applies to the device DRAM mem-ory controller as well. And in the context of deviceDRAM being shared by multiple accelerator users, thecase for security over the bare metal controller is evenstronger.One of the more recent interfaces to FPGA vendor pro-vided memory controllers is the AXI Stream interface[25]. This type of interface, allows custom accelerators andOpenCL based accelerators alike access to DRAM mem-ory via accompanying AXI interconnect infrastructure(Fig. 20).However, as can be seen in the architecture ofFig. 20, there is no inherent support for memory databuffering organization beyond relying on HLS or cus-tom RTL development. As this article has shown, thisorganization and control of DRAM data onto on-chipbuffers can significantly affect the application levelperformance.Moorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 21 of 23Fig. 20 FPGA vendor (Xilinx) supplied Memory-Controller infrastructureConclusionThis article has highlighted the need for standardizedmechanisms to handle IO and data transfers to FPGAaccelerators within data centers that offer their Infras-tructure as a Service. This need arises from the fact thatthe end-user of such environments will never have fullaccess to the underlying FPGA hardware. Thus data trans-fer methods that can be controlled in software while stilloffering performance levels equal to that of custom RTLinterfaces are necessary.To this end corporations such as Amazon (F1 Instances[37]) and Microsoft (Azure [38]) at present, for no extracost, already offer FPGA “shells” as wrappers that devel-opers can call to gain IO functionality on their cloudbased FPGA instances. This shell is similar in concept tothe static-region “template” described within this work.It is similar in that it too eases the pain of handlingIO on FPGAs to the end-customer by providing pre-implemented IO controllers to standard interfaces suchas PCIe, DDR memory, or even Ethernet in the case ofMicrosoft’s Azure. However, this is where the similarityends. What is proposed in this article goes another stepbeyond the abstraction of providing the IO controllers, wehave presented amethod bywhich the incoming data fromthe IO controllers can be effectively tailored to the perfor-mance needs of a particular accelerated algorithm. In con-trast, Amazon and Microsoft defer the end-customer tothe use of the FPGA vendor’s HLS tool to handle the dataafter it crosses the IO controller boundary. And if HLSfails to provide optimal data partitioning andmanagementof IO at the on-die buffering level for a unique applica-tion’s data path, then the end-user is left to implementtheir own custom solution in RTL; if they have the skillsand development time to do so.We on the other hand have demonstrated an embed-ded system solution for data transfer, and more impor-tantly, have shown that for a class of applications thathave physical IO devices as their system level bottle-neck, the system level performance is not degraded. Forother classes of applications that have no such IO bottle-necks to begin with, our embedded solution does degradesystem performance, however, we have also shown thatin such cases scaling the acceleration is enough tocounterbalance the IO performance loss caused by oursolution.Within this work, we have also compared the devel-opment time effort, performance, and area tradeoffsbetween the custom RTL design methodology used whenend-users have full access to their FPGAs, and our embed-ded solution to be offered by data centres when suchlevels of access can not be granted to the end-users.Through our comparisons of experimental results, wemake an argument in favour of the embedded-systemapproach. That conclusion was derived based not onlyon the ease of automation available within the currentembedded-system EDA space of FPGA vendors but by theexperimentally verified validity in aggregate performanceand throughput of the accelerators not being compro-mised. The total effect on power consumption throughthe use of our embedded system approach to data trans-fers is predicted to be higher; however, it was beyond thescope of this paper to perform any detailed power analysis,and that remains to be completed as a future work inprogress.Moorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 22 of 23Authors’ contributionsTM performed all of the RTL and embedded systems design that werenecessary to conduct the experiments discussed in this article. SG conceivedof the study to contrast embedded systems based IO control with purely RTLbased approaches. Both authors participated in the selection and framing ofthe applications chosen for this study. Both authors read and approved thefinal manuscript.About the AuthorsTheepan Moorthy (S’04) received the B.A.Sc. in Computer Engineering fromQueen’s University in Kingston, ON, Canada in 2004. After spending two yearsin industry working on wireless chipset designs he returned to academia topursue graduate degrees, and earned the M.A.Sc. degree from RyersonUniversity, Toronto, ON, Canada in 2008.He later started his doctoral program at the University of British Columbia,Vancouver, Canada and transitioned from previously having worked on FPGAacceleration for video encoding to acceleration of bio-informatics algorithms.During his PhD he has held various internships at PMC-Sierra and XilinxResearch Labs.Sathish Gopalakrishnan is an Associate Professor of Electrical & ComputerEngineering at the University of British Columbia. His research interests centeraround resource allocation problems in several contexts including real-time,embedded systems and wireless networks. Prior to joining UBC in 2007, heobtained a PhD in Computer Science and an MS in Applied Mathematics fromthe University of Illinois at Urbana-Champaign. He has received awards for hiswork from the IEEE Industrial Electronics Society (Best Paper in the IEEETransactions on Industrial Informatics in 2008) and at the IEEE Real-TimeSystems Symposium (in 2004).Competing interestsThe authors declare that they have no competing interests.Publisher’s NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations.Received: 2 April 2017 Accepted: 8 August 2017References1. Haidar A, Cao C, Yarkhan A, Luszczek P, Tomov S, Kabir K, Dongarra J(2014) Unified development for mixed multi-gpu and multi-coprocessorenvironments using a lightweight runtime environment. In: Parallel andDistributed Processing Symposium, 2014 IEEE 28th International.pp 491–5002. Liu C, Ng HC, So HKH (2015) Quickdough: A rapid fpga loop acceleratordesign framework using soft cgra overlay. In: Field ProgrammableTechnology (FPT), 2015 International Conference on. pp 56–633. Byma S, Steffan JG, Bannazadeh H, Garcia AL, Chow P (2014) Fpgas in thecloud: Booting virtualized hardware accelerators with openstack. In:Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE22nd Annual International Symposium on. pp 109–1164. Putnam A, Caulfield AM, Chung ES, Chiou D, Constantinides K, Demme J,Esmaeilzadeh H, Fowers J, Gopal GP, Gray J, Haselman M, Hauck S, Heil S,Hormati A, Kim JY, Lanka S, Larus J, Peterson E, Pope S, Smith A, Thong J,Xiao PY, Burger D (2014) A reconfigurable fabric for acceleratinglarge-scale datacenter services. In: 2014 ACM/IEEE 41st InternationalSymposium on Computer Architecture (ISCA). pp 13–245. Lin Z, Chow P (2013) Zcluster: A zynq-based hadoop cluster. In:Field-Programmable Technology (FPT), 2013 International Conference on.pp 450–4536. Norm Jouppi, Distinguished Hardware Engineer, Google, Googlesupercharges machine learning tasks with TPU Custom Chip. [Online].Available: https://cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html Jouppi,Distinguished Hardware Engineer, Google, Google superchargesmachine learning tasks with TPU Custom Chip. [Online]. Available:https://cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html7. Cong J, Liu B, Neuendorffer S, Noguera J, Vissers K, Zhang Z (2011)High-level synthesis for fpgas: From prototyping to deployment. IEEETrans Comput. Aided Des Integr Circ Syst 30(4):473–4918. Ma S, Andrews D, Gao S, Cummins J (2016) Breeze computing: A just intime (jit) approach for virtualizing fpgas in the cloud. In: 2016International Conference on, ReConFigurable Computing and FPGAs(ReConFig). pp 1–69. Chen F, Lin Y (2015) FPGA accelerator virtualization in OpenPOWER cloud.In: OpenPower Summit10. Milford M, Mcallister J (2016) Constructive synthesis of memory-intensiveaccelerators for fpga from nested loop kernels. IEEE Trans Signal Process99:4152–416511. Munshi A (2009) The OpenCL Specification. Khronos OpenCL WorkingGroup12. Altera ALTERA SDK FOR OPENCL. [Online]. Available: https://www.altera.com/products/design-software/embedded-software-developers/opencl/overview.html13. Morris GW, Thomas DB, Luk W (2009) Fpga accelerated low-latencymarket data feed processing. In: 2009 17th IEEE Symposium on, HighPerformance Interconnects. pp 83–8914. Chrysos G, Sotiriades E, Rousopoulos C, Pramataris K, Papaefstathiou I,Dollas A, Papadopoulos A, Kirmitzoglou I, Promponas VJ, Theocharides T,Petihakis G, Lagnel J (2014) Reconfiguring the bioinformaticscomputational spectrum: Challenges and opportunities of fpga-basedbioinformatics acceleration platforms. IEEE Design Test 31(1):62–7315. Lockwood JW, Monga M (2015) Implementing ultra low latency datacenter services with programmable logic. In: 2015 IEEE 23rd Annual,Symposium on High-Performance Interconnects. pp 68–7716. Erdmann C, Lowney D, Lynam A, Keady A, McGrath J, Cullen E,Breathnach D, Keane D, Lynch P, Torre MDL, Torre RDL, Lim P, Collins A,Farley B, Madden L (2014) 6.3 a heterogeneous 3d-ic consisting of two28nm fpga die and 32 reconfigurable high-performance data converters.In: 2014 IEEE International Solid-State, Circuits Conference Digest ofTechnical Papers (ISSCC). pp 120–12117. Morgenstern B, Frech K, Dress A, Werner T (1998) DIALIGN: Finding Local Similarities by Multiple Sequence Alignment. Bioinformatics 14(3):290–29418. Boukerche A, Correa Jan M, Cristina A, de Melo MA, Ricardo Jacobi P(2010) A Hardware Accelerator for the Fast Retrieval of DIALIGN BilogicalSequence Alignments in Linear Space. IEEE Trans Comput 59(6):808–82119. Xilinx Xilinx University Program XUPV5-LX110T Development System.[Online]. Available: http://www.xilinx.com/univ/xupv5-lx110t.htm20. Moorthy T, Gopalakrishnan S (2014) Gigabyte-scale alignmentacceleration of biological sequences via ethernet streaming.In: Field-Programmable Technology (FPT), 2014 International Conferenceon. pp 227–23021. Woods L, Eguro K (2012) Groundhog - A Serial ATA Host Bus Adapter(HBA) for FPGAs. IEEE 20th Int. Symp Field-Programmable Cust. Comput.Mach:220–22322. Xilinx MicroBlaze Soft-Processor IP Protocol Specification. [Online].Available: http://www.xilinx.com/support/documentation/sw_manuals/xilinx11/mb_ref_guide.pdf23. Eguro K (2010) SIRC: An Extensible Reconfigurable ComputingCommunication API. 2010 18th IEEE Annual International Symposium onField-Programmable Custom Computing Machines:135–13824. Mendon AA, Huang B, Sass R (2012) A high performance, open sourceSATA2 core. Field Programmable Logic Appl. (FPL), 2012 22nd Int. Conf.421–42825. ARM AMBA AXI Protocol Specification. [Online]. Available:http://www.arm.com/products/system-ip/amba-specifications.php26. Xilinx Fast Simplex Link IP Protocol Specification. [Online]. Available:http://www.xilinx.com/products/intellectual-property/fsl.html27. Multi-Port Memory Controller IP Protocol Specification. [Online]. Available:http://www.xilinx.com/products/intellectual-property/mpmc.html28. Processor Local Bus IP Protocol Specification. [Online]. Available:http://www.xilinx.com/products/intellectual-property/plb_v46.html29. Summers J, Brecht T, Eager D, Gutarin A (2016) Characterizing theworkload of a netflix streaming video server. In: 2016 IEEE InternationalSymposium on, Workload Characterization (IISWC). pp 1–1230. Delimitrou C, Kozyrakis C (2013) The netflix challenge: Datacenter edition.IEEE Comput. Archit. Letters 12(1):29–3231. Aaron A, Li Z, Manohara M, Lin JY, Wu ECH, Kuo CCJ (2015) Challenges incloud based ingest and encoding for high quality streaming media.In: Image Processing (ICIP) 2015 IEEE International Conference on.pp 1732–1736Moorthy and Gopalakrishnan Journal of Cloud Computing: Advances, Systems and Applications  (2017) 6:20 Page 23 of 2332. Wiegand T, Sullivan GJ, Bjontegaard G, Luthra A (2003) Overview of theh.264/avc video coding standard. IEEE Trans. Circ. Syst. Video Technol13(7):560–57633. ITU Telecom. Standardization Sector of ITU., Advanced video coding forgeneric audiovisual services. ITU-T Recommendation H.264, May 200334. Liu Z, Huang Y, Song Y, Goto S, Ikenaga T (2007) Hardware-EfficientPropagate Partial SAD Architecture for Variable Block Size MotionEstimation in H.264/AVC. Proc 17th Great Lakes Symp. VLSI. 160–16335. Moorthy T, Ye A (2008) A Scalable Computing and Memory Architecturefor Variable Block Size Motion Estimation on Field-Programmable GateArrays. Proc. 2008 IEEE conf. Field Programmable Logic Appl:83–8836. Stone JE, Gohara D, Shi G (2010) OpenCL: A Parallel ProgrammingStandard for Heterogeneous Computing Systems. Comput. Sci. Eng12(3):66–7337. Amazon Amazon EC2 F1 Instances. [Online]. Available: https://aws.amazon.com/ec2/instance-types/f1/38. Microsoft Microsoft Azure. [Online]. Available: https://azure.microsoft.com/en-us/resources/videos/build-2017-inside-the-microsoft-fpga-based-configurable-cloud/39. Moorthy T (2008) Scalable FPGA Hardware Acceleration for H.264 MotionEstimation. Ryerson University, Theses and Dissertations


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items