UBC Theses and Dissertations
Automated space/time scaling of streaming task graphs on field-programmable gate arrays Omidian Savarbaghi, Hossein
Parallel computing platforms provide good performance for streaming applications within a limited power budget. However, these platforms can be difficult to program. Moreover, when the size of the computing platform target changes, users must manually reallocate resources and parallelism. This thesis provides a framework to retarget applications described by a Streaming Task Graph (STG) for implementation on different platforms, where the framework can automatically scale the solution size to fit available resource or performance targets. First, we explore automated space/time scaling for STGs targeting a pipelined coarse-grained architecture. We produce a tool that analyzes the degrees of parallelism in a general stream application and finds possible bottlenecks. We introduce possible optimization strategies for STGs, and two algorithmic approaches: a classical approach based upon Integer Linear Programming (ILP), and a novel heuristic approach which introduces a new optimization and produces better results (using 30% less area) with a shorter run-time. Second, we explore automated space/time scaling for STGs targeting a fine-grained architecture (Field-Programmable Gate Array (FPGA)). We propose a framework on top of a commercial High-Level Synthesis (HLS) tool which adds the ability to automatically meet a defined area budget or target throughput. Within the framework, we use similar ILP and heuristic approaches. The heuristic automatically achieves over 95% of the target area budget on average while improving throughput over the ILP. It can also meet the same throughput targets as the ILP while saving 19% area. Third, we investigate automated space/time scaling of STGs in a hybrid architecture consisting of a Soft Vector Processor (SVP) and select custom instructions. To achieve higher performance, we investigate using dynamic Partial Reconfiguration (PR) by time-sharing the FPGA resources. The performance results achieve speedups far beyond what a plain SVP can accomplish: an 8-lane SVP achieves a speedup of 5.3 on the Canny-blur application, whereas the PR version is another 3.5 times faster, with a net speedup of 18.5 over the ARM Cortex A9 processor. By increasing the dynamic PR rate beyond what is available today, we also show that a further 5.7 times speedup can be achieved (105.9x speedup versus the Cortex A9).
Item Citations and Data
Attribution-NonCommercial-NoDerivatives 4.0 International