@prefix vivo: . @prefix edm: . @prefix ns0: . @prefix dcterms: . @prefix skos: . vivo:departmentOrSchool "Science, Faculty of"@en, "Earth, Ocean and Atmospheric Sciences, Department of"@en ; edm:dataProvider "DSpace"@en ; ns0:degreeCampus "UBCV"@en ; dcterms:creator "Petrenko, Artyom"@en ; dcterms:issued "2014-04-17T17:08:21Z"@en, "2014"@en ; vivo:relatedDegree "Master of Science - MSc"@en ; ns0:degreeGrantor "University of British Columbia"@en ; dcterms:description """An implementation of seismic wave simulation on a platform consisting of a conventional host processor and a reconfigurable hardware accelerator is presented. This research is important in the field of exploration for oil and gas resources, where a 3D model of the subsurface of the Earth is frequently required. By comparing seismic data collected in a real-world survey with synthetic data generated by simulated waves, it is possible to deduce such a model. However this requires many time-consuming simulations with different Earth models to find the one that best fits the measured data. Speeding up the wave simulations would allow more models to be tried, yielding a more accurate estimate of the subsurface. The reconfigurable hardware accelerator employed in this work is a field programmable gate array (FPGA). FPGAs are computer chips that consist of electronic building blocks that the user can configure and reconfigure to represent their algorithm in hardware. Whereas a traditional processor can be viewed as a pipeline for processing instructions, an FPGA is a pipeline for processing data. The chief advantage of the FPGA is that all the instructions in the algorithm are already hardwired onto the chip. This means that execution time depends only on the amount of data to be processed, and not on the complexity of the algorithm. The main contribution is an implementation of the well-known Kaczmarz row projection algorithm on the FPGA, using techniques of dataflow programming. This kernel is used as the preconditioning step of CGMN, a modified version of the conjugate gradients method that is used to solve the time-harmonic acoustic isotropic constant density wave equation. Using one FPGA-based accelerator, the current implementation allows seismic wave simulations to be performed over twice as fast, compared to running on one Intel Xeon E5-2670 core. I also discuss the effect of modifications of the algorithm necessitated by the hardware on the convergence properties of CGMN. Finally, a specific plan for future work is set-out in order to fully exploit the accelerator platform, and the work is set in its larger context."""@en ; edm:aggregatedCHO "https://circle.library.ubc.ca/rest/handle/2429/46512?expand=metadata"@en ; skos:note """Accelerating an Iterative Helmholtz Solver UsingReconfigurable HardwarebyArt PetrenkoB.Sc. Physics, McGill University, 2008a thesis submitted in partial fulfillmentof the requirements for the degree ofMaster of Scienceinthe faculty of graduate and postdoctoral studies(Geophysics)The University Of British Columbia(Vancouver)April 2014c© Art Petrenko, 2014AbstractAn implementation of seismic wave simulation on a platform consisting of a conventional hostprocessor and a reconfigurable hardware accelerator is presented. This research is importantin the field of exploration for oil and gas resources, where a 3D model of the subsurface of theEarth is frequently required. By comparing seismic data collected in a real-world survey withsynthetic data generated by simulated waves, it is possible to deduce such a model. Howeverthis requires many time-consuming simulations with different Earth models to find the one thatbest fits the measured data. Speeding up the wave simulations would allow more models to betried, yielding a more accurate estimate of the subsurface.The reconfigurable hardware accelerator employed in this work is a field programmablegate array (FPGA). FPGAs are computer chips that consist of electronic building blocks thatthe user can configure and reconfigure to represent their algorithm in hardware. Whereas atraditional processor can be viewed as a pipeline for processing instructions, an FPGA is apipeline for processing data. The chief advantage of the FPGA is that all the instructions inthe algorithm are already hardwired onto the chip. This means that execution time dependsonly on the amount of data to be processed, and not on the complexity of the algorithm.The main contribution is an implementation of the well-known Kaczmarz row projectionalgorithm on the FPGA, using techniques of dataflow programming. This kernel is used asthe preconditioning step of CGMN, a modified version of the conjugate gradients method thatis used to solve the time-harmonic acoustic isotropic constant density wave equation. Usingone FPGA-based accelerator, the current implementation allows seismic wave simulations to beperformed over twice as fast, compared to running on one Intel Xeon E5-2670 core. I also discussthe effect of modifications of the algorithm necessitated by the hardware on the convergenceproperties of CGMN.Finally, a specific plan for future work is set-out in order to fully exploit the acceleratorplatform, and the work is set in its larger context.iiPrefaceThe suggestion that Kaczmarz row projections should be implemented on the FPGA acceleratorwas made by Felix J. Herrmann. The design of the accelerated kernel was done by the author,with feedback from Diego Oriato of Maxeler Technologies. This feedback consisted of videocalls on a semi-regular weekly basis, over the course of a few months. Diego Oriato also assistedin the debugging of the design during a two week visit by the author to Maxeler Technologies,in London. Simon Tilbury, also of Maxeler Technologies provided a crucial contribution tothe design by implementing the buffer that enables “accelerator ordering” of the rows of theHelmholtz matrix.The MATLAB function implementing CGMN, CARPCG.m was initially written by Tristan vanLeeuwen and modified by the author to interface with the automatically generated MATLABclass that is used to run the accelerator. The C MEX implementation of the Kaczmarz sweepsused in the reference implementation mentioned in Chapter 4 was initially also written byTristan van Leeuwen, and modified by the author to optimize memory access. Tristan vanLeeuwen wrote the MATLAB function that generates the Helmholtz matrix, this code was usedunmodified. The design and execution of the numerical experiments measuring performance ofthe accelerated CGMN implementation were done entirely by the author, with minor suggestionsby Rafael Lago. Analysis of the profile and solution data, including making the plots presented,were done entirely by the author.This work (in its entirety) has led to the following two conference submissions, which havebeen accepted:A. Petrenko, D. Oriato, S. Tilbury, T. van Leeuwen, and F. J. Herrmann. Accel-erating an iterative Helmholtz solver with FPGAs. In OGHPC, 03 2014. URL http://rice2014oghpc.blogs.rice.edu/files/2014/03/petrenkoOGHPC2014aih_poster.pdfA. Petrenko, D. Oriato, S. Tilbury, T. van Leeuwen, and F. J. Herrmann. Ac-celerating an iterative Helmholtz solver with FPGAs. In EAGE, 06 2014The first has been presented and published as a poster at the 2014 Rice University Oil &Gas High-Performance Computing Workshop in Houston, Texas. The second will be presentedas an oral presentation at the June 2014 European Association of Geoscientists and EngineersiiiConference and Exhibition in Amsterdam, the Netherlands, at which time it will also be pub-lished as an expanded abstract. The contributions of the co-authors to the work presented isthe same as outlined above for the present thesis. The first author wrote the text and madethe figures for both submissions.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Full-waveform inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Accelerating algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Solving the wave equation: Mathematical background . . . . . . . . . . . . . 52.1 The Helmholtz system: Discretizing the wave equation . . . . . . . . . . . . . . . 62.2 The normal equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Row projections: The Kaczmarz algorithm . . . . . . . . . . . . . . . . . . . . . . 82.4 Preconditioning conjugate gradients: The CGMN algorithm . . . . . . . . . . . . 103 Implementation of the Kaczmarz algorithm on reconfigurable hardware . . 123.1 Related work on accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Data transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3 Buffering: Reducing access to slow memory . . . . . . . . . . . . . . . . . . . . . 163.4 Parallelizing the Kaczmarz sweeps: Dealing with row projection latency . . . . . 193.5 Number representation and storage . . . . . . . . . . . . . . . . . . . . . . . . . . 213.6 The backward sweep: Double buffering . . . . . . . . . . . . . . . . . . . . . . . . 233.7 Bandwidth requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.8 Related work: computed tomography . . . . . . . . . . . . . . . . . . . . . . . . . 25v4 Performance results and discussion . . . . . . . . . . . . . . . . . . . . . . . . 284.1 Execution time and FPGA resource usage . . . . . . . . . . . . . . . . . . . . . . 304.2 Effects of reordering rows of the Helmholtz matrix . . . . . . . . . . . . . . . . . 324.3 Multiple double Kaczmarz sweeps per CGMN iteration . . . . . . . . . . . . . . . 324.4 Planned improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40A An alternative conjugate gradient formulation . . . . . . . . . . . . . . . . . 47viList of TablesTable 3.1 Possible bitwidths for complex elements . . . . . . . . . . . . . . . . . . . . . . 23Table 3.2 Kaczmarz double sweep communication bandwidth requirements . . . . . . . . 26Table 4.1 CGMN (future) communication bandwidth requirements . . . . . . . . . . . . 36viiList of FiguresFigure 2.1 The Kaczmarz algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Figure 3.1 High-level overview of the compute node used . . . . . . . . . . . . . . . . . . 13Figure 3.2 Distribution of computation time before acceleration . . . . . . . . . . . . . . 15Figure 3.3 Storing a 3D volume in 1D memory . . . . . . . . . . . . . . . . . . . . . . . 17Figure 3.4 The Helmholtz linear system of equations . . . . . . . . . . . . . . . . . . . . 18Figure 3.5 Overcoming latency of memory access . . . . . . . . . . . . . . . . . . . . . . 19Figure 3.6 Matrix row storage efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Figure 3.7 Effect of sweep precision on CGMN convergence . . . . . . . . . . . . . . . . 24Figure 3.8 Double buffering during the backward Kaczmarz sweep . . . . . . . . . . . . 25Figure 4.1 A part of the SEG/EAGE Overthrust velocity model . . . . . . . . . . . . . . 29Figure 4.2 An example pressure wavefield solution . . . . . . . . . . . . . . . . . . . . . 29Figure 4.3 Time to execute 100 CGMN iterations . . . . . . . . . . . . . . . . . . . . . . 30Figure 4.4 Distribution of computation time before and after acceleration . . . . . . . . 31Figure 4.5 FPGA resource usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Figure 4.6 Impact of matrix row ordering on CGMN convergence . . . . . . . . . . . . . 33Figure 4.7 Effect of multiple sweeps per CGMN iteration on CGMN convergence . . . . 34Figure 4.8 Communication bandwidth requirements . . . . . . . . . . . . . . . . . . . . . 37viiiAcknowledgementsFirst of all I would like to thank my supervisor Felix Herrmann, whose intuition that imple-menting a wave equation solver on FPGAs would pay off turned out to be correct. Thankyou also to Michael Friedlander and Christian Schoof, members of my committee, and SteveWilton, my external examiner.My gratitude goes out to Henryk Modzelewski for his administration and maintenance notonly of the Maxeler compute node used for the bulk of this research, but of the entire hardwareand software environment used by the SLIM group.I received guidance about using Maxeler’s dataflow computing machine from Diego Oriatoof Maxeler Technologies; his help was very useful. Simon Tilbury, also with Maxeler Technolo-gies, developed a crucial piece of code (the row-reordering buffer) that enables the algorithmdescribed in this work to run at full speed. I thank the entire Maxeler team that graciouslywelcomed me for a three week visit to their London office in the fall of 2013. I learned a lotfrom this partnership.Thank you to Eddie Hung for stimulating discussions during the early stages of imple-mentation and to Tristan van Leeuwen for developing the MATLAB versions of many of thealgorithms mentioned in this thesis. His work served as a launchpad for my own modifications.Thank you also to Rafael Lago, Lina Miao and Brendan Smithyman for proof-reading versionsof the manuscript.Finally thank you to the entire SLIM group for sustaining an environment where I couldsucceed.This work was financially supported in part by the Natural Sciences and Engineering Re-search Council of Canada Discovery Grant (RGPIN 261641-06) and the Collaborative Researchand Development Grant DNOISE II (CDRP J 375142-08). This research was carried out as partof the SINBAD II project with support from the following organizations: BG Group, BGP, BP,Chevron, ConocoPhillips, CGG, ION GXT, Petrobras, PGS, Statoil, Total SA, WesternGeco,Woodside.ixChapter 1IntroductionSystems of linear equations are ubiquitous in science and engineering. Being able to solvelarge systems in a short amount of time is a mathematical and computational challenge. Thesubject of this work is an implementation of a particular linear system solver on unconventionalhardware, namely a system consisting of a CPU host and a reconfigurable hardware-basedaccelerator. The design, construction, testing and analysis of this implementation are themain contributions of the work. In this chapter I describe the context which motivated theimplementation: the field of seismic exploration for oil and gas resources. I then motivate thechoice of the reconfigurable computing platform.1.1 Full-waveform inversionSeismic exploration aims to characterize the earth’s subsurface to assist the discovery andproduction of hydrocarbon reservoirs. In a seismic survey, vibrational energy is sent into theground using a source. This could be a dynamite explosion, a vibroseis truck, or in the caseof marine exploration, an air gun. Receivers, which may be geophones or hydrophones, recordthe ground (or water) motion and pressure changes that result from waves coming back to thesurface. The raw data is then analysed, and subsurface properties like acoustic velocity, densityor anisotropy are inferred.Seismic waves are modelled by a partial differential wave equation (PDE) where the inputis medium parameters and a source signature, and the solution is a wavefield. In practice,the source signature is not known and must be estimated, although this is not treated inthe current work. The problem faced in seismic exploration (and many other disciplines) isthe inverse problem: given some measurement of the wavefield, find medium parameters thatexplain the data. The most scientifically rigorous method of analysing seismic data is full-waveform inversion (FWI), first proposed by Tarantola [58]. FWI starts from an initial guessat the medium model m and solves the wave equation for each source qs. The resultingsynthetic wavefields us are sampled at the receiver locations and compared to the measuredwavefield data, ds. Based on the difference, the model m is updated and the algorithm iterates1until a satisfactory model has been produced. Mathematically this corresponds to solving theoptimization problemminmΦ({ds},m, {qs}), (1.1)where {ds} is the set of results for all the source experiments {qs}. A traditional choice for themisfit function Φ is the sum of the squared `2-norms of the data residual vectors:Φ =1nsrcnsrc∑s=112‖ds −F(m,qs)‖2 . (1.2)The symbol F represents the wave equation solution operator that generates the syntheticwavefields and samples them at the receivers. There are nsrc source experiments in total.Other misfit functions could be chosen, as described by Aravkin et al. [3].A full exposition of FWI is beyond the scope of this work, the reader is referred to Virieuxand Operto [63] for details. Here it suffices to note that to calculate the model update δm, it isnecessary to compute the first derivative of the misfit function Φ. By applying the chain rule,we find that∇mΦ = (∇mF∗)(1nsrcnsrc∑s=1(ds −F(m,qs))), (1.3)where ∗ denotes the adjoint of an operator (complex conjugate transpose for matrices). Theoperator∇mF∗ maps a perturbation in the measurement of the wavefield at the receivers to oneof the many possible perturbations in the model consistent with that measurement. Two waveequation solves are needed to implement ∇mF∗, as mentioned by Pratt et al. [49]. Dependingon the particular algorithm for minimizing Φ in Expression 1.1, further PDE solves may berequired to compute the second derivative (the Hessian) of the misfit function, or during aline-search along the gradient direction ∇mΦ. Thus, each iteration of FWI must solve a waveequation at least twice; this forms most of the computational burden of the method, as shownfor example by Brossier et al. [5].When run in industry on modelled domains of 1000 × 1000 × 1000 or larger, FWI cantake up to a month to converge to an acceptable earth model (first author of Li et al. [33],private communication). The high computational cost limits how many FWI iterations can beperformed, to the detriment of the final result. Research is on-going on various dimensionalityreduction techniques to process less of the data without sacrificing accuracy, including work byLi et al. [33] and Aravkin et al. [3]. In addition to fundamental algorithmic improvements inFWI, it is possible to accelerate existing methods with a view to increasing the number of PDEsolves that can be performed within a given time budget. Virieux and Operto [63] remark that“efficient numerical modelling of the full seismic wavefield is a central issue in FWI, especiallyfor 3D problems.” The authors go on to say that “fields of investigation should address theneed to speed up the forward problem by means of providing new hardware [implementations].”This is the direction of the present work.21.2 Accelerating algorithmsThe use of many-core and GPU (graphic processing unit) based accelerators is growing inhigh-performance computing, as evidenced by the “top 500” list of the world’s most capablemachines, compiled by Meuer et al. [38]. As of 2013, four of the top ten machines on that listhave a CPU host + accelerator architecture.Commodity machines using GPUs typically do not have very large memories dedicatedto the accelerators, relying instead on the PCIe (peripheral component interconnect express)bus to exchange data between the CPU (central processing unit) and accelerator. On the onehand, Johnsen and Loddoch [26] overcome the PCIe bottleneck in data transfer by exploitingparallelism and re-use of data in the time domain method of solving the wave equation. On theother hand, Knibbe et al. [30] report that using GPUs in the frequency domain method did notresult in significant improvements over using the CPU host exclusively, citing the bandwidthof the PCIe bus as the bottleneck. (See the next chapter for a discussion of the relative meritsof the time and frequency domain approaches to solving the wave equation.)The use of reconfigurable hardware accelerators such as FPGAs (field programmable gatearrays) has been more reluctant, due in part to the different programming paradigms necessaryto take advantage of such units. However accelerating an algorithm such as FWI by usingFPGAs to handle the computationally intensive kernel of the algorithm is a possibility worthinvestigating, due in part to the large dedicated memories available for FPGA-based accelerators(see Chapter 3 for a specification of the system used in this work).Consider the differences between FPGAs and conventional CPU cores. A CPU core is ageneral-purpose piece of hardware: no physical part of the core corresponds to the algorithmbeing run on it. CPUs boast high operational frequency (about 3 GHz), and come with ready-to-use components like caches for frequently used data. Such mechanisms greatly simplify thedesign of programs since “low level” details are taken care of for the programmer. Conse-quently it is possible to use a very high level of abstraction when designing an algorithm for aconventional CPU. On the other hand, as mentioned by Dally [8] among others, the increasein processor frequencies that used to guarantee an increase in performance every few years hascome to a halt. Increase in performance on today’s processors (this includes GPUs) comes bymaking use of many parallel processor cores, a task that is often far from trivial.In contrast to a CPU, an FPGA is (almost) a blank slate. It contains building blocks ofdigital logic that can be configured (and reconfigured) by the end user to execute the task athand in hardware. Once an FPGA is configured it is a hardware representation of the algo-rithm. FPGAs operate at frequencies an order of magnitude lower than CPUs (about 200 MHz)but consequently consume far less power. Until recently, algorithms for an FPGA had to bedesigned at a relatively low level of abstraction, which made the development process arduous.However the introduction by Maxeler Technologies [36] of a set of Java classes offering commonfunctionality (like memory controllers) has made FPGA programming much more accessible.(I note that the recent announcement of support of the OpenCL programming framework by3some FPGA manufacturers (Singh [52]) is also progress in the direction of accessibility.) Never-theless, in the scope of the current project, it was not practical to re-write a full full-waveforminversion algorithm to run on an FPGA. Instead, the Kaczmarz sweep, a basic and frequentlyused routine in a particular implementation of the wave equation solution operator F and itsadjoint, was selected for porting.CPU cores work by processing a stream of instructions. Modern processors have longinstruction pipelines to keep computational units supplied with work, and process several suchpipelines in parallel, as documented by Intel Corporation [25]. Nevertheless, an algorithm thathas more instructions will generally take longer than a simpler algorithm with fewer instructions.This is not the case for FPGAs. FPGAs work by processing a stream of data, rather than astream of instructions. All the instructions of the algorithm are hard-wired onto the chip andhence all of them are executed on every tick of the FPGA clock. (I use the terms FPGA and chipinterchangeably throughout this work.) The FPGA can be thought of as a filter for data thatflows through it. Hence, unless the algorithm is limited by the bandwidth of an interconnect(such as a link to memory), speed of execution is directly proportional to the amount of datato be processed. Apart from their power efficiency, this is the chief advantage of FPGAs incontrast to CPUs.There are other alternatives besides FPGAs to using CPU or GPU cores. Application-specific integrated circuits (ASICs) are hardware representations of an algorithm that are fasterand more efficient in logic-gate use than FPGAs (Kuon and Rose [32]). Unlike FPGAs however,they cannot be reconfigured. Furthermore, due to the fixed, unchangeable nature of an ASIC,development times for ASIC algorithms are longer than for FPGAs (Hauck and DeHon [20]).The rest of this thesis is organized as follows. Chapter 2 outlines the method I use to solvethe wave equation. Chapter 3 describes the implementation of that method on reconfigurablehardware; this is the main contribution of this work. Chapter 4 presents the results of usingthe implementation, and Chapter 5 presents conclusions and a list of improvements that shouldbe accomplished when the project continues.4Chapter 2Solving the wave equation:Mathematical backgroundThis chapter starts with a comparison of two approaches to solving the wave equation and thenproceeds to describe in greater detail the approach taken in the present work: treating theproblem as a system of linear equations. I then describe the chosen solution method, due toBjörck and Elfving [4], which is based on a combination of three ideas: the normal equations,projections onto matrix rows, and the method of conjugate gradients.There are two main approaches to finite difference modelling of seismic waves: formulatingthe problem in the time domain (TD) or the frequency domain (FD) (see review by Virieuxand Operto [63]). Solving the wave equation in the time domain corresponds to an initial valueproblem where the wavefield is stepped from one point in time to the next, using informa-tion local to each point in the numerical grid. Solving the problem in the frequency domaincorresponds to solving a large, sparse system of linear equations for each frequency ωf to bemodelled.There are two main factors to consider when choosing between these approaches. The firstis how the choice of forward modelling scheme effects the overlying full-waveform inversionalgorithm. FWI is prone to failing to converge to a reasonable earth model because of gettingstuck in local minima of the misfit function Φ, as noted by van Leeuwen and Herrmann [60]among others. To reduce the chance of this happening, the problem can be solved progressively,starting from the easier parts. One way to do this is to invert low frequency data first, followedby higher frequencies. This is particularly simple if the forward modelling step of FWI isperformed in the frequency domain, since in that case each solution of the forward modelcorresponds to one frequency. Another way to regularize the FWI problem, mentioned byVirieux and Operto [63], is to window the data by arrival time, something for which forwardmodelling in the time domain is appropriate. The time domain method needs storage of thewavefields of some intermediate time steps (a procedure known as checkpointing, describedrecently by Symes [56]), which is avoided in the frequency domain method. Furthermore,5implementation of the adjoint of the Jacobian of the modelling operator (∇F∗) is non-trivialin the time domain; see however the work of Gockenbach et al. [15].The second factor to consider when choosing between time domain and frequency domainmethods is computational expense. In the time domain computational complexity is well definedby the number of time-steps simulated, and is O(nsrcN), where N is the total number of grid-points in the three-dimensional discretized earth model m. This has been shown by Plessix[47]; see also Table 1 in the work of Knibbe et al. [30]. However in the frequency domainformulation, there are two different ways to solve the associated system of linear equations:with a direct or an iterative solver. Direct solvers are theoretically well-suited to problems witha large number of sources, however they use a prohibitively large amount of memory for storingintermediate factors of the matrix that are not as sparse as the original matrix. This limits theiruse on large 3D FWI problems. Iterative solvers only need storage for the original matrix anda few intermediate vectors, however they typically need a preconditioner to be effective. Whencomparing execution time for the time domain approach and the frequency domain approachusing an iterative solver, Plessix [48] concludes that both approaches have approximately thesame performance in terms of execution time requirements for 3D problems. Finally, Virieuxand Operto [63] recommend a hybrid method due to Sirgue et al. [53]: model in the time domainbut progressively build up the Fourier transform of the wavefield so that when the modellingstep is complete, full-waveform inversion can be performed in the frequency domain.In this work I formulate the forward modelling problem in the frequency domain.2.1 The Helmholtz system: Discretizing the wave equationWhen simulating waves in the frequency domain, the PDE that describes the motion of thewave through a heterogeneous medium can be written as(ω2m + ∆)u = q, (2.1)and is known as the Helmholtz equation. As written above, the Helmholtz equation representsthe special case of a constant density isotropic medium which only supports acoustic waves.Damping effects of viscosity are modelled heuristically by allowing m to be complex-valued. Iignore the case of elastic and anisotropic media to keep the resulting implementation relativelysimple. The symbol ∆ represents the Laplacian operator. I take the subsurface earth modelto be the slowness squared, m = 1/v2, where v is the sound speed of the medium. u is the(complex) Fourier transform of the pressure with respect to time, and q is the amplitude of thesource at angular frequency ω. The Laplacian operator here also implements perfectly matchedlayer (PML) boundary conditions that eliminate reflection artefacts from the boundaries of thedomain by setting a damping layer consisting of complex velocities. (See Equation (2) in thework by Operto et al. [42].) In the frequency domain, Equation 2.1 must be solved for eachfrequency ω that is to contribute to the final wavefield us for a given source.6When m, u and q represent quantities that have been discretized on a three-dimensionalCartesian grid with N grid-points, Equation 2.1 can be represented as a large system of linearequations and succinctly written in matrix notation,A(m, ω)u = q (2.2)where A is the N ×N Helmholtz matrix. The elements of A are calculated by the method offinite differences, following the strategy of Operto et al. [42], which consists of two main parts.First, the Laplacian operator is discretized using a 3× 3× 3 cube stencil. This 27-point stencilis a weighted average of eight different second-order 7-point star stencils, each on their owncoordinate grid. The coordinate grids are rotated and scaled versions of the original Cartesiangrid. The grids are designed in such a way that even though their axes are not parallel to eachother, the grid-point locations of all eight grids coincide. This allows the cube stencil to useall 27 points in the three-dimensional neighbourhood of a given central point. The weightingcoefficients are tuned to minimize numerical anisotropy, as described by Operto et al. [42].Second, the value of the earth model m at each grid-point is re-distributed to the 27 neigh-bouring points that make up the cube stencil for that point, a process known as mass-averaging.Mass-averaging is done using a second set of weighting coefficients, and results in a matrix withthe same pattern of non-zero entries as the discretized Laplacian. By choosing optimal valuesfor the mass-averaging coefficients, Operto et al. [42] showed that numerical dispersion of thestencil is minimized, which enhances (by a constant factor) the stencil’s accuracy. This allowsto use as little as 4 grid-points per wavelength [42], although 6 grid points per wavelength areused in this work. The need for accuracy by using finer grid spacings (more grid-points perwavelength) must be balanced against the computational expense of simulating on a larger grid.As noted by Operto et al. [42], 4 grid-points per wavelength is the limit at which the modellingstep is accurate, without modelling wavefield features that are in any case too small to be ofuse to full-waveform inversion in resolving the earth model. A further advantage of the stencilintroduced by Operto et al. [42] is that it is compact. A stencil with a large extent in thelast dimension (for example as in 5-point star stencil) implies that more intervening grid-pointsneed to be buffered in short-term memory (see Section 3.3 for details).The mass matrix and the discretized Laplacian are added together to make the Helmholtzmatrix A, which is very sparse: while N is at least 107 for a realistic model, the number ofnon-zeros per row, determined by the finite difference stencil, is at most only 27. A is alsovery structured: its non-zeros are arranged in 27 diagonals. This means that the locations ofthe non-zero elements, taken together, of each matrix row, do not repeat. In other words, thesupport of each matrix row is unique (although the support of rows that correspond to gridpoints adjacent to the edge of the grid is a subset of the support of the rows that correspondto adjacent points). These special properties mean that the rows of the Helmholtz matrix arelinearly independent and hence theoretically (disregarding round-off errors) A is invertible.72.2 The normal equationsA simple method of solving invertible linear systems is the method of conjugate gradients (CG)(Hestenes and Stiefel [24]), which requires the matrix to be symmetric positive definite (SPD)to converge to the unique solution. However the matrix A is not symmetric for arbitrarysubsurface models m, and can be indefinite, so CG cannot be applied to it directly. (See Ernstand Gander [13] for a discussion of how the indefiniteness of the Helmholtz equation makes itdifficult to solve.) A preconditioner must be used to transform the system into an equivalentSPD form. As has been set out in the work of van Leeuwen et al. [62], desired characteristicsfor a Helmholtz preconditioner for FWI are independence from the matrix (because the matrixchanges on each iteration of FWI) and low storage requirements. These conditions are satisfiedby an approach based on the normal equations, as is explained below.The normal equations for the error (NE) for A are given byAA∗y = q,where u = A∗y. (2.3)For invertible matrices the normal equations form a symmetric positive definite system, (seeSaad [51] for a comprehensive treatment). In this case the solution of Equation 2.3 will be thesame as the solution to the original system, Equation 2.2. However working with the matrixAA∗ explicitly is prohibitively expensive because it will be much less sparse than A itself, andthus take up more memory. Instead, it is possible to solve Equation 2.3 without forming AA∗using a row projection method, as described by Saad [51].2.3 Row projections: The Kaczmarz algorithmThe method of solving a system of linear equations Ax = b by successively projecting an iteratexk onto the hyperplane associated with each row ai of the matrix A was discovered early on byKaczmarz [27]. An English translation is available in [28]. A projection onto the ith hyperplane〈ai,x〉 = bi is given byxk+1 = xk + λ(bi − 〈ai,xk〉)a∗i‖ai‖2 , (2.4)where bi is the corresponding element of the right hand side vector b. If the Kaczmarz algorithmis directly applied to solve the Helmholtz system (Equation 2.2) b would correspond to q andx would correspond to u. The relaxation parameter λ determines whether the projection isexactly onto the hyperplane (λ = 1), falls short (0 < λ < 1) or extends beyond it (1 < λ < 2).It is set to 1.5 in this work, following the discussion by Gordon and Gordon [16]. The rowprojections are conventionally done in either a forward sweep from first to last, a backwardsweep, or in a double sweep from first to last and back again (k: 1 → 2N ; i: 1 → N,N → 1).In practice, the rows of A are normalized once at the beginning of the algorithm. The iterationsof the Kaczmarz algorithm are shown schematically in Figure 2.1.8solutioninitialguessFigure 2.1: The Kaczmarz algorithm solves a linear system by successively projectingthe initial guess onto each of the hyperplanes (shown here as dashed lines) associ-ated with the rows of the matrix. Adapted from a similar diagram by the authorof van Leeuwen [59].The connection between the Kaczmarz algorithm and solving the normal equations was madeby Björck and Elfving [4], who noted that each minor step in the forward Gauss-Seidel method(see Saad [51], Sections 4.1 and 8.2) applied to Equation 2.3 is equivalent to one Kaczmarzrow projection using the original system with λ set to 1. The authors went on to show thatvarying the relaxation parameter λ away from 1 makes a Kaczmarz row projection equivalentto one minor step of the method of successive over-relaxation (SOR) applied to Equation 2.3.If a forward full SOR iteration is followed by a backward iteration, the result is known as thesymmetric SOR (SSOR) method, or SSOR-NE when applied to the normal equations (NE,Equation 2.3) for a matrix. Therefore, one iteration of SSOR-NE is equivalent to a doubleKaczmarz sweep: projecting onto the rows of the original matrix from first to last and backagain.The order in which the rows of A are selected to project onto will affect the convergencepath of the Kaczmarz algorithm, as well as the overlying CGMN algorithm (discussed in thenext section). Kak and Slaney [29] recommend choosing orthogonal (or nearly orthogonal) rowsfor successive Kaczmarz projections. The impact of different row orderings on the Helmholtzsystem is examined in Chapter 4. Saad [51] notes that the effect of row-ordering on Kaczmarzalgorithm convergence for general sparse matrices is an open problem.Placed in a modern linear algebra context, the Kaczmarz algorithm is best viewed as a wayof implementing SSOR on the normal equations (Equation 2.3) that reduces memory use bynot forming the matrix AA∗ explicitly. All the convergence properties of SSOR-NE carry overdirectly. The advantage of the Kaczmarz algorithm lies in the fact that it allows us to workwith an SPD system (the normal equations) by using only the non-zeros of the original matrixA. Since it is not necessary to form or store AA∗, the Kaczmarz algorithm is appealing frommemory usage and computational complexity points of view.92.4 Preconditioning conjugate gradients: The CGMNalgorithmDespite the advantages mentioned in the last section, the Kaczmarz algorithm (SSOR-NE)converges slowly, thus it is not suitable for direct application to the Helmholtz system (Equa-tion 2.2). Instead, Björck and Elfving [4] showed that it can be used to accelerate the methodof conjugate gradients, calling the resulting algorithm CGMN. Recent studies of how CGMNfares in solving the Helmholtz equation include work by van Leeuwen [59] and Gordon andGordon [18]. In the latter case, CGMN is equivalent to the sequential (non-parallel, runningon only one processor core) version of the algorithm CARP-CG, introduced by Gordon andGordon [17]. I now describe the CGMN algorithm.First, it is useful to represent the Kaczmarz sweeps in matrix notation. Following Tanabe[57], let Qi be the projection matrix onto the hyperplane defined by 〈ai,x〉 = 0:Qi = I −λ‖ai‖2a∗i ai.The double sweep can then be written asDKSWP(A,u,q, λ) = Q1 · · ·QNQN · · ·Q1u +Rq= Qu +Rq.(2.5)Since A is invertible, SSOR-NE will converge to the solution of Equation 2.2, as mentioned byBjörck and Elfving [4]. At that point, the iterate u will be a fixed point of the relation 2.4,which means that Equation 2.5 can be re-written as a linear system:(I −Q)u = Rq, (2.6)where I is the identity matrix. As mentioned by Björck and Elfving [4] and proved by, forexample, Gordon and Gordon [17], the system in Equation 2.6 is consistent, symmetric andpositive semi-definite. Björck and Elfving [4] show in their Lemma 5.1 that this is sufficientfor CG to converge to the pseudoinverse (minimum `2-norm) solution of Equation 2.6, which is(by construction) the same as the solution of the original system (Equation 2.2). Note that thematrices Q and R do not have to be formed explicitly, as their action on a vector is calculatedusing a double Kaczmarz sweep, as in Equation 2.5.Thus, CGMN is the use of the method of conjugate gradients to solve the SSOR-NE iterationsystem (Equation 2.6) for the fixed point of that iteration. SSOR-NE is implemented efficientlyusing Kaczmarz row projections. Björck and Elfving [4] also note that it is possible to viewCGMN as solving a variant of the Helmholtz system (Equation 2.2), preconditioned from theleft by a matrix derived from a decomposition of AA∗.Pseudo-code for the CGMN algorithm is given below. Note that the double Kaczmarz sweepon line 1 of the algorithm is performed with an initial guess of zero because only the action of R10is required. Similarly, the double sweep on line 2 is done with a right hand side of zero becauseonly the action of Q is required. Also note that although u was used as the Kaczmarz iteratein deriving the equivalent system (Equation 2.6), the double sweep in the main while-loop ofCGMN (line 5 of the algorithm) is performed on the CGMN search direction p, and not onthe wavefield. The double sweep would only be performed on the wavefield if the Kaczmarzalgorithm were used to solve the Helmholtz system directly, which is not being done. This doublesweep is equivalent to the matrix-vector product with the system matrix of Equation 2.6 thatwould otherwise be required. References to the Kaczmarz iterate indicate intermediate vectorsin the calculation of s, in contrast to the CG iterate, which is the wavefield u.Algorithm 1 CGMN (Björck and Elfving [4])Input: A, u, q, λ1: Rq← DKSWP(A,0,q, λ)2: r← Rq − u + DKSWP(A,u,0, λ)3: p← r4: while ‖r‖2 > tol do5: s← (I −Q)p = p−DKSWP(A,p,0, λ)6: α← ‖r‖2 / 〈p, s〉7: u← u + αp8: r← r− αs9: β ← ‖r‖2curr / ‖r‖2prev10: ‖r‖2prev ← ‖r‖2curr11: p← r + βp12: end whileOutput: uI remark that it is possible to perform more than one double Kaczmarz sweep, one afterthe other, in CGMN (line 5 of Algorithm 1). If the Kaczmarz sweeps are viewed as a precondi-tioner for CG, as mentioned above, multiple sweeps correspond to a more exact preconditioner.Fewer outer CGMN iterations would have to be performed, at the cost of several invocationsof DKSWP in the main while-loop. Because the Kaczmarz algorithm takes more sweeps toconverge to a solution than CG takes iterations, such a trade-off might be viewed as exchanginga precise tool for a less precise one, and hence undesirable. However as I show in Section 4.1,the characteristics of the Kaczmarz algorithm as implemented on an FPGA-based acceleratormight encourage such an adjustment.In this chapter I have described the mathematical formulation of the approach I take tosolving the seismic wave simulation problem. The reader will recall that this problem is relevantbecause of the role it plays in full-waveform inversion, a key algorithm in seismic exploration.The next chapter will focus on a description of the implementation of this problem on anFPGA-based hardware accelerator.11Chapter 3Implementation of the Kaczmarzalgorithm on reconfigurablehardwareThis chapter starts with an introduction to FPGAs and a description of the particular CPU host+ FPGA-based accelerator platform used in the work. I then survey previous implementationsof linear solvers and seismic processing algorithms on accelerators (both FPGAs and GPUs).The core of the chapter gives details of the implementation of the Kaczmarz algorithm on theFPGA-based accelerator, and describes the technical challenges overcome. This is my maincontribution. The chapter concludes with a comparison of this implementation with closelyrelated work by Grüll et al. [19].The functional unit in an FPGA consists of one or more lookup tables (LUTs) and one ormore flip-flops (FFs). LUTs implement functions. For example a lookup table with six inputs,each of which is a wire carrying a binary signal, has 26 outputs. Flip-flops implement stor-age: one flip-flop can store one bit of information. In addition to these logic elements, FPGAsalso have blocks dedicated for certain common operations. These include digital signal pro-cessing (DSP) blocks for multiply-and-accumulate operations, and bulk storage memory blocks(BRAM). Together these components are known as the resources of the FPGA. The blocks areembedded in an interconnect fabric that carries signals between them. The (automated) pro-cess of making sure the signals carried by the interconnect fabric arrive at each computationalstage at the right time is known as meeting timing. The interested reader is referred to thetext by Hauck and DeHon [20] for details. As mentioned in section 1.2, the key advantage ofFPGAs is that all of the operations that have been hard-wired onto the chip are executed atonce, on different points of the data-stream. Making an algorithm more complex by addingarithmetical operations does not affect the execution time. Of course whether a design willsuccessfully build on an FPGA is subject to resource availability on the particular chip model.FPGAs can be used on their own, or as accelerators alongside conventional CPUs. In this case12FPGAacceleratorMemory(LMem)24 GBIntel CPU coresMemory130 GB PCIe2 GB/s 38 GB/sFigure 3.1: High-level overview of the compute node used in this work. Notethat the effective PCIe bandwidth is 2 GB/s in each direction independently. Thememory bandwidth is total in both directions, and is 96 bytes per memory clocktick. At the maximum memory clock frequency (400 MHz, Diamond [9]) thistranslates to 38 GB1/s. The memory clock is run slower in this work, giving anLMem bandwidth of 29 GB/s.the CPU handles most of the source code lines of the program, and supplies the FPGA withdata to execute in hardware the few source code lines responsible for most of the time spent.The target platform for our implementation is described by Pell et al. [44] and shown inFigure 3.1. Four accelerators (only one of which is shown in the figure), consisting of FPGAsand their associated dedicated large memory (LMem), are connected via a PCIe bus to theCPU host. The CPU host is an Intel Xeon E5-2670. The accelerator model used in this workis called Vectis. Each Vectis board has two FPGAs. The interface FPGA (Xilinx Virtex-6XC6VLX75T) is used to communicate with the CPU host, while the compute FPGA (XilinxVirtex-6 XC6VSX475T) is used for calculation. For a detailed specification of the FPGAs thereader is referred to documentation by Xilinx [64]. The operation of the interface FPGA iscompletely transparent to the user. When referring simply to the FPGA, I mean the computeFPGA. In addition to the dedicated large memory of the accelerator, each compute FPGA alsohas 4.6 MB of on-chip BRAM (also termed FMem, for fast memory), which has lower latencyof access. The accelerator is operated via a compiled binary that is called from a high levelnumerical processing environment (MATLAB) using an automatically generated C intermediatelayer known as a MEX file.Conceptually, an algorithm on an FPGA can be described as a directed graph, as is done byMaxeler Technologies [36]. The nodes of the graph are operators, and the edges represent dataflowing from one operation to the next along the interconnect fabric. Practically, this graph isdescribed using the Java programming language as documented in [36, 37].3.1 Related work on acceleratorsFPGA-based accelerators have been used in seismic processing for the last decade. An earlyexample of the acceleration of the kernel of pre-stack Kirchoff time migration with FPGAs isgiven by He et al. [21]. Migration is an alternative to full-waveform inversion for estimating the1Recall that 1 GB = 109 bytes.13Earth model m, and He et al. do not solve a linear system but rather use the FPGA as a filter toperform operations on a stream of seismic data. Fu et al. [14] have also used FPGAs to accelerateseismic migration, focusing on reducing floating point precision to increase throughput. Morerecently, Pell et al. [44] implement the time domain acoustic wave equation on the CPU host+ FPGA-based accelerator platform used in this work, and report a performance equivalent tothat of over 50 conventional CPU cores.Solving linear systems using FPGA-based accelerators has also seen recent developments.The Jacobi algorithm (described by Saad [51] and others), being highly parallel and simple toimplement, was ported to an FPGA accelerator early on by Morris and Prasanna [39]. Thatimplementation is applicable to general sparse matrices stored in compressed sparse row format.Conjugate gradients is more complex from a computational point of view and work has focusedon porting the matrix-vector multiplication that forms the kernel of CG to accelerators. Lopesand Constantinides [35] present such an implementation, and give a brief overview of pastefforts in this field. The CG matrix-vector product is parallelizable in various ways; Lopesand Constantinides use multiple linear systems to pipeline their design, which is optimizedfor small dense matrices. Alternative CG implementations are given by Morris et al. [40] andStrzodka and Göddeke [54]. The latter authors increase the amount of computation that canfit onto one chip by reducing precision of intermediate computations in a process known asiterative refinement. Iterative refinement is also used by Sun et al. [55] in an FPGA-basedimplementation of a direct solver using LU -decomposition. Another direct solver acceleratedwith FPGAs, this time via Cholesky decomposition of the normal equations, is presented byYang et al. [67].For completeness we also mention some of the work with GPU-based accelerators, althoughsince such units are more common than their FPGA counterparts, this survey is not compre-hensive. Elble et al. [12] have examined the improvements that CGMN and related algorithmsexperience on GPUs. Krueger et al. [31] compares the CPU, GPU and FPGA performance ofthe acoustic wave equation in the time domain using both computational and energy efficiencymetrics. Knibbe et al. [30] compare a GPU-accelerated time domain wave equation solver withan iterative frequency domain solver implemented solely on conventional CPU cores and findthe two approaches competitive with each other as long as the number of computational nodesper source is greater than two.Finally, Grüll et al. [19] present an implementation of the simultaneous algebraic reconstruc-tion technique (SART) on the same hardware platform (CPU host and Vectis model accelerator)used in the present work. SART is closely related to Kaczmarz row projections, and, to myknowledge, this is the most similar previous work to my own contribution. I defer a discussion ofthat implementation until Section 3.8, and now present the main technical challenges overcomein the course of the present work.14Number of gridpoints (×107)Time (s)0 0.5 1.0 1.5 2.02000400060008000104Kaczmarz sweeps (C)Rest of CGMN (MATLAB)Figure 3.2: Distribution of computation time before acceleration. The time spentin the Kaczmarz sweeps (implemented in C) and the rest of CGMN (implementedin MATLAB) are compared. Both parts of the algorithm are running on a con-ventional Intel Xeon core. CGMN solved the Overthrust system as described inChapter 4, with the right-hand side equal to A1. The algorithm was allowed toconverge to a relative residual tolerance of 10−6, so each system size ran for a dif-ferent number of iterations. The fraction of time CGMN spent in the Kaczmarzsweeps for this range of system sizes is 89± 0.4%.3.2 Data transferAlthough the CGMN algorithm is quite similar to the conventional conjugate gradient algo-rithm, the FPGA-based implementation proposed in the current work differs in an importantway from the CG implementations discussed in the previous section. Namely, the CG matrix-vector product is implemented via a double Kaczmarz sweep (line 5 of Algorithm 1). Thisallows us to apply CGMN to solve systems where the matrix A may be indefinite. Since thedouble Kaczmarz sweep implements the only matrix-vector operation in CGMN, it is the mosttime-consuming part of that algorithm. This statement is supported by the results of timingruns of CGMN before any acceleration is applied, shown in Figure 3.2, clearly motivating theidentification of DKSWP as the kernel of the PDE solver.As mentioned in Section 1.2, data transfer is usually the bottleneck when using accelerators.All instances of DKSWP in CGMN share the same matrix A(m, ω), so it is transferred to theaccelerator’s dedicated large memory at the start of execution. The first sweep of the CGMNinitialization phase (line 1 in Algorithm 1) requires the seismic source q as its right-hand side,15while all subsequent sweeps in CGMN require the zero vector as the right-hand side. Hencethis zero vector is also only transferred to the accelerator once. Together, these transfers takeapproximately (27 + 21 + 1 + 1)N ticks of the FPGA clock, where the extra factor of 21N isdue to zero-padding of the matrix, discussed in Section 3.5.Dataflow during the double Kaczmarz sweep using the accelerator proceeds in three stages.For concreteness I describe the stages referring to the data they use during the main while-loopof CGMN. First the iterate p is copied from the CPU host to LMem over the PCIe bus. Thistakes N ticks of the FPGA clock. Then the forward sweep is performed, during which p, theright-hand side 0 and rows of the matrix A are streamed to the FPGA from LMem and pis updated according to Equation 2.4. The result FKSWP(A,p,0, λ) is streamed back out toLMem, replacing p. The third stage consists of the backward sweep, during which the rows ofA, the elements of 0 and the result of the forward sweep are streamed in reverse order. Theresult DKSWP(A,p,0, λ) is this time streamed out to the CPU host via the PCIe bus. Becauseof the time necessary to fill and flush the buffers holding a sliding window onto the iterate (seeSection 3.3), each of the latter two steps take approximately N(1 + 2/niter) ticks of the FPGAclock. The matrix and right-hand side remain in LMem, ready to be read during the nextsweep which will occur on the next CGMN iteration. The CPU carries out the elementwiseand reduction operations necessary for CGMN, such as multiplying a vector (of length N) bya scalar, adding two vectors together, or taking inner products. These operations are on lines6–12 of Algorithm 1).3.3 Buffering: Reducing access to slow memoryFor an arbitrary matrix (not necessarily sparse), all elements of the Kaczmarz iterate xk needto be read to calculate the dot product in Equation 2.4. Each element then needs to be writtento with the updated values of xk+1. This would entail two passes over the entire computationalgrid on which xk is defined. However because the Helmholtz matrix arises from a finite differencestencil (see Operto et al. [42]), it has very favourable structure: 27 non-zero diagonals (in thecase of the 3×3×3 cube stencil used in this work), and zeros everywhere else. These diagonalsare clustered around the main diagonal at offsets given byi+ j · nfast + k · nfast · nmed where i, j, k ∈ {−1, 0, 1}, (3.1)where indices i, j, k are used for convenience and do not refer to the indices defined in theKaczmarz row projection formula (Equation 2.4). The quantities nfast and nmed are the numberof grid points along the first two dimensions of the grid. This terminology refers to the fact thata three dimensional quantity like m or u can be stored in linear memory in a variety of ways.One possibility is to store contiguously values corresponding to a line along the first dimension,followed by neighbouring lines that make up a two-dimensional slice, and finally neighbouringslices. Since memory is optimized for consecutive access (see Drepper [11] for a thorough16-30 -25 -20 -5 0 5 20 25 303D layoutLinear layout (for 5 x 5 x 5 system)01fastslow mediumincreasing memory addresses45625262430293119-1-6-25-24-26-20-21-19-30-29-31Figure 3.3: Storing a 3D volume in 1D memory. Three adjacent slices of a three-dimensional grid are shown above, along with the corresponding storage locationsin memory. The grid locations are labelled with their offsets in memory from thecentral point (marked with 0).treatment of issues relating to computer memory), elements along the first dimension can thenbe accessed quickly, while elements along other dimensions take longer. The correspondencebetween a three-dimensional grid and its storage in (one-dimensional) memory is illustrated inFigure 3.3. Note that a contiguous 3D sub-volume of the domain is not stored contiguouslyin memory. Instead, adjacent elements from neighbouring lines along the fast dimension areseparated by small gaps of size approximately nfast, and adjacent elements from neighbouringslices are separated by large gaps of size approximately nfastnmedDue to the “few non-zero diagonals” structure of the Helmholtz matrix, illustrated in Fig-ure 3.4 only 27 elements of the iterate need to be read and updated during each Kaczmarz rowprojection. These elements correspond to the non-zeros in row ai. Furthermore, the location ofthe needed elements changes smoothly as the projections sweep through the rows. This meansthat rather than needing random access to elements of the Kaczmarz iterate x, a relativelysmall contiguous working set of the elements is sufficient. This working set can be viewedas a sliding window over the iterate elements, or, alternatively, as a FIFO (first in, first out)pipeline. Because communicating with the large memory on the accelerator incurs a significantlatency, the elements in the sliding window are buffered in the on-chip BRAM. The size of the17=A u qFigure 3.4: The Helmholtz linear system of equations, illustrating that non-contiguous elements of the iterate (shown in red) must be read and written duringeach Kaczmarz row projection. Blue elements must be read but are not written.The smooth up-and-down arrow represents sequential element access, whereas thearrow that jumps around represents accelerator ordering, defined in Section 3.4.window is approximately equal to 2nfastnmed grid points; it contains the elements to be updatedduring a given row projection and all of the elements in between. As yet unmodified elementsof xk are read into the sliding window from large memory, undergo the operations for all therow projections they are involved in (27 projections each) and are written back as elements ofxk+1. This is analogous to a CPU cache storing frequently used variables. The important dif-ference is that a CPU cache is managed by transparent low-level processes, whereas the designof buffering structures on the FPGA and orchestrating re-use of data is at least partially upto the programmer. The ability to use a sliding window approach to update elements of theKaczmarz iterate x means that rather than needing two data passes over the entire grid foreach row projection, only one pass is necessary for each of the forward and backward Kaczmarzsweeps.The FPGA’s fast on-chip memory has two limitations. First, a whole BRAM block (2304 bytes,Xilinx [64]) must be used at a time, even if only a small fraction of that block will actually bewritten to. More importantly, each block can only have two ports: for example, one read andone write port (see the documentation by Maxeler Technologies [37] for details). Therefore itis not possible to implement the sliding window into the Kaczmarz iterate xk as a single buffer.Instead, the contents of the sliding window are distributed among 26 individual buffers, oneeach for the 27 elements of xk (save the first) that are updated during one row projection. Thefirst element can be read directly into the row projection computation from LMem. Similarly,the element from the last buffer is written back to LMem after being updated in the computa-tion, since it will not receive any further updates during the given sweep. The addresses intothe buffers and into LMem are incremented every tick of the FPGA clock, with the buffer readaddress one entry ahead of the write address. The flow of data between LMem and the buffers18Figure 3.5: Overcoming latency of memory access by the use of sliding windowbuffers, described in the text for a 5 × 5 × 5 system. At any one point in time,substituting the contents of LMem with the contents of the buffers yields the Kacz-marz iterate xk. If the rows are processed from 1 to N in a forward sweep, k willbe the index of the middle element of the sliding window. The element writtenback to LMem from the last buffer will then be an element of FKSWP(A,x,b, λ).The row projection computation reads the oldest element from each buffer. (In theillustration this is the leftmost element in each buffer.)is shown in Figure 3.5. I remind the reader that it is necessary to run the FPGA for aboutnfastnmed clock cycles before activating the row projection calculation. This phase allows thebuffers to fill with valid data. Likewise, a flush phase at the end of each forward and backwardsweep drains the buffers while the updating of iterate elements is disabled. Because the size ofthe buffers must be fixed at compile time, the size of the fast and medium dimensions of thedomain must also be fixed.3.4 Parallelizing the Kaczmarz sweeps: Dealing with rowprojection latencyIf the rows of A are processed from 1 to N , each Kaczmarz iteration depends on the resultsof the last. Furthermore, each iteration’s computation takes many ticks of the FPGA clock.19This is the latency L, and is chiefly due to the pairwise summation of values in 〈ai,xk〉 (seeEquation 2.4).One method of dealing with latency is to slow down the rate of dataflow in the FPGA. Thiscan be done by reading a set of inputs (a row of A and elements of x and b) only once every Lticks. This will give time for the results of one row projection computation to be written to thebuffer, so that a consistent state of x is available for the next iteration. This approach has theFPGA processing data at 1/L of its actual operating frequency and is clearly unacceptable.To avoid having to wait for L ticks between iterations, L independent iterations are usedto fill the computational pipeline. Two sources of parallelism can be exploited. First, multipleHelmholtz problems sharing the same matrix A but different right-hand sides q can be solvedon the same accelerator. At each tick, the current element of one of the L different iterate andsource pairs would be read and the corresponding row projection started. Because the solutionof systems with multiple sources can be treated independently, updating one of the L iterateswould not affect the others. As this method requires sliding windows of the kind pictured inFigure 3.5 to be stored in the FPGA’s on-chip memory for the wavefield associated with eachsource, the size of fast memory is a limit to how many sources can be used for pipelining. Toavoid confusion for readers familiar with direct methods for solving linear systems, I emphasizethat working with multiple right-hand sides in this way is not analogous to the favourable scalingof execution time with the number of sources seen in direct solvers. Instead, parallelizationover several right-hand sides on the FPGA-based accelerator is a scheme to fully employ thecomputational facility of the FPGA, in the sense of processing one Kaczmarz row projectionevery tick of the FPGA clock.Another approach is to change the order in which DKSWP accesses the rows of A suchthat consecutive Kaczmarz iterations update disjoint sets of iterate elements. Mathematicallythis corresponds to picking L mutually orthogonal rows to process. From a geometrical pointof view, considering x as a 3D quantity, this corresponds to processing L successive 3 × 3 × 3cubes, such that none of the cubes overlap. Once L cubes have began undergoing computation,the row selection wraps back to process the iterate elements it missed the first time. Thiswrap-around repeats three times in total, until all the intervening cubes have been processed.I term this order of processing rows accelerator ordering, in contrast to the sequential orderingwhere the index i proceeds from 1 to N . In Section 4.2 I show experimental results for howsuch a reordering affects the convergence properties of the CGMN algorithm.Despite this modified order of row access, memory should be addressed sequentially as muchas possible, in order to keep the FPGA’s memory controller efficient. Hence, the rows of A andthe elements of q are reordered prior to being sent to the accelerator. However reordering theelements of the Kaczmarz iterate in the same way is not possible. To understand why, consideragain the layout of the buffers shown in Figure 3.5. The specific alternation of small, mediumand large buffers hard-codes the structure of the finite difference stencil. This structure, shownin Figure 3.3 and expressed as diagonal offsets of A in Formula 3.1, will only produce correct20results if the elements of x are ordered sequentially (lexicographically). Nevertheless, if L is nottoo large, it is possible to keep the reordering of the rows local enough that a sliding windowonto the iterate can still be maintained, at the cost of a more sophisticated buffer layout. SimonTilbury of Maxeler Technologies made a significant contribution to this work by devising sucha buffer layout and implementing it on the accelerator. In his scheme the fast dimension of thedomain should be 3L grid points long to enable efficient use of on-chip memory for the buffers.This revised buffering approach enables the full pipelining of the Kaczmarz row projectionalgorithm, allowing the computation to run at the operational frequency of the FPGA, ratherthan being slowed by a factor of 1/L.Both the multiple right-hand side approach and the row reordering approach could beimplemented together, with a multiplicative effect on addressing the latency. For example, ifthe latency L is equal to 100 FPGA clock ticks, one might process 10 different right-hand sides,while re-ordering the rows of A so that 10 mutually orthogonal rows are chosen successively.However at the present I use the row reordering pipelining method exclusively, and do not usemultiple sources to fill the row projection pipeline on the FPGA.3.5 Number representation and storageTo take advantage of the special structure of A, I do not use a general sparse matrix storageformat, but rather follow van Leeuwen et al. [62] in storing the diagonals of A in a rectangulararray and the offsets of the diagonals in a separate short vector idx. Since the chief operationin the Kaczmarz algorithm is an inner product between a matrix row and the iterate vector,the array of diagonals is stored in row major order (that is, the diagonals are not stored con-tiguously). This produces an efficient access pattern of the matrix from memory. The compiler,developed by Maxeler Technologies [36], that translates the Java graph-based description of theFPGA design into a lower level hardware description language offers a complex data type. Totake advantage of this, it is convenient to store complex vectors and the matrix with the realand imaginary parts interleaved. The data is thus rearranged from its native MATLAB format,which stores the real and imaginary parts separately, before being sent to the accelerator.Unlike the memory attached to the CPU host, the dedicated large memory attached toeach FPGA cannot be addressed at the byte granularity level. Instead, addresses into LMemhave to be multiples of the burst size, and each chunk returned by the memory controller isalso a multiple of the burst size. On the Vectis accelerator model used in this work, a burstsize is 384 bytes (3072 bits), as confirmed by Diamond [9]. Furthermore, each stream of inputinto the FPGA from LMem must be either a multiple or an integer fraction of the burst size.For example, a stream that requests one double precision floating point number (8 bytes) perFPGA clock tick is allowed, but a stream that requests five doubles is not allowed because(3072 bits)/(5 doubles× 8 bytes per double× 8 bits per byte) is not an integer.The Kaczmarz algorithm requires reading all 27 non-zeros in a row of the matrix every tickof the FPGA clock. This means that the row as stored in LMem must take up a multiple (or2125 30 35 40 45 50 55 6000.20.40.60.81Matrix row storage efficiencyNumber of bits in a real numberUsed in the current workFigure 3.6: Matrix row storage efficiency as a function of the number of bits used torepresent the real and imaginary parts of each matrix element. This figure assumes27 elements are stored per row. Note that the three efficiency “modes” use differentamounts of storage for one row: half a burst, a full burst (384 bytes) and two bursts,respectively. Single precision (32 bit real numbers, resulting in one burst per row)is used in this work.an integer fraction) of the burst size. The row datatype is made up of 27 complex numbersstored at some precision, and zero bit padding that is used to fill out the datatype to thenearest integer fraction or multiple of the burst size. During computation, the zero paddingis discarded. There are three factors to consider when deciding on the bitwidth of the rowdatatype. The first is efficiency of matrix storage in memory: the row datatype should be aswide as possible without going over the nearest integer fraction or multiple of the burst size.Memory efficiency of a range of row datatype bitwidths is shown in Figure 3.6. Note thatfor row datatypes that take up half a burst, the matrix may have to be padded to give it aneven number of rows. The second factor is the available memory bandwidth. Picking a veryprecise number representation that necessitates two bursts for each matrix row means that thedesign will be limited by memory bandwidth at lower FPGA operational frequencies. Finally,wide datatypes will use more FPGA resources (like LUTs and flip-flops) and make it moredifficult (or impossible) to arrange the design elements on the physical chip. One approach toincrease performance on both of these factors is to store and transfer numbers in a compressedformat, unpacking them only for the computation. Another is to implement different parts ofthe algorithm with different precision, which is done by Strzodka and Göddeke [54] and Sunet al. [55]. This adds an extra level of complexity to the development and has not yet beenattempted by the author.The Kaczmarz algorithm also requires one complex element from the iterate and right-hand22Number of bits in a realnumberNumber of bits in acomplex numberComplex numbersper burst24 48 6432 (single precision) 64 4848 96 3264 (double precision) 128 24Table 3.1: Possible bitwidths for complex elements of the Kaczmarz iterate andright-hand side vectors. Bitwidths are constrained by needing to be an integerfraction of the FPGA memory controller’s burst size. Single precision is used in thiswork.side vectors at every clock tick, which means that the vector elements must be an integer fractionof the burst size. Possible options for the bitwidth of the vector element datatype are shown inTable 3.1. As a compromise between the need for an adequately precise number representationand a design that fits onto the FPGA, I settle on 32 bits for a real number. As shown by thework of Strzodka and Göddeke [54], CG will still converge even if the preconditioner (which inour case is implemented with the Kaczmarz sweeps) computes with precision as low as 28 bits.In addition, I perform a toy model experiment comparing the convergence properties of CGMNwith double and single precision Kaczmarz sweeps. The results are shown in Figure 3.7, andconfirm that single precision is sufficient for Kaczmarz sweeps to ensure CGMN convergence.In contrast to Strzodka and Göddeke, I do not find that using a reduced precision (32 bits)for the Kaczmarz sweeps increases the number of CG iterations that have to be performed. Ialso note that Nemeth et al. [41] use single precision in their FPGA-based time domain waveequation simulations, while Pell et al. [44] use compression to represent single precision floatsin a mere 16 bits. Both authors mention the sufficiency of such precision schemes for (timedomain) geophysical applications. Although it is not necessary to store the individual non-zeroelements of a row at the same precision as the elements of the iterate and right-hand sidevectors, this is being done in the current accelerated implementation for simplicity. I note thatthis is not optimal in terms of matrix storage efficiency.3.6 The backward sweep: Double bufferingFor the backward sweep, the rows of the matrix and the iterate and right-hand side elementsneed to be accessed from LMem in reverse. This is simple for the matrix, provided each rowtakes up one or more bursts in memory; it suffices to have a decrementing counter as theaddress. The case of the vector is more complicated because while bursts can be accessed inreverse order, the vector elements within each burst will still be in forward order. To resolvethis issue I maintain a buffer two bursts long in BRAM. One half of the buffer is written to bythe input stream, one vector element per clock tick. The other half is being read from by theKaczmarz iteration computation. Every T ticks, where T is the number of vector elements in a23IterationRelative residual100 200 300 40010-610-410-211Figure 3.7: Effect of sweep precision on CGMN convergence. The relative residu-als of a single precision CGMN run (circles) and a double precision run (solid line)agree to within single precision; the two convergence curves overlap. The systemsolved is a subsampling of the SEG/EAGE Overthrust model (shown in Chapter4) that results in a 47 × 201 × 201 grid. A source at ω = 3 Hz, equal to A1 wasused.burst, the role of the two halves of the buffer switches. This is a well known technique knownas double buffering; it is illustrated in Figure 3.8. Note that if a row takes up only half a burst,the matrix will also have to be double buffered during the backward sweep. When the result ofBKSWP is returned back to the CPU host, it is in reverse order and needs to be post-processedbefore being used in the rest of the CGMN calculation. Likewise, if multiple sweeps are doneby the accelerator before returning back to the CPU host, forward sweeps after the first alsoneed to double buffer the input to return it to forward order.3.7 Bandwidth requirementsRecall that the advantage of an FPGA is that the algorithm can be made more complex (i.e.more arithmetic operations) without impacting the execution time. This is because of the factthat all of the computations on the chip happen every clock tick. Computation is done “inspace” as opposed to “in time”. Execution time depends on the amount of data to process.The underlying assumption is that the FPGA always has data to work on. However if thespeed with which data is processed exceeds the speed at which data is streamed into theFPGA from external sources, the algorithm is said to be bandwidth limited. There are two24First 48 ticksSecond 48 ticks burst iburst i - 1burst i - 2burst i - 1WRITEWRITEREADREADFigure 3.8: Double buffering during the backward Kaczmarz sweep is needed toreverse the order of vector elements which are accessed in burst-sized chunks frommemory.communication channels between the FPGA and the outside world, and hence two potentialsources of bandwidth limitation: the PCIe bus connecting the accelerator to the host, andthe memory bus connecting the FPGA and its dedicated large memory. This is illustrated inFigure 3.1. It is worthwhile to note that the bandwidth of the PCIe link is actually determinedby the connection between the interface and compute FPGAs, not the link between the CPUhost and the interface FPGA, as confirmed by Pell [43]. Of course all bandwidths are theoreticalmaxima and will be impacted by the pattern in which data is accessed. In general, transferringa few large chunks is more efficient than transferring many smaller chunks (see Drepper [11] fordetails). The general strategy in implementing an algorithm on the accelerator is to minimizecommunication. Data transferred to the accelerator over the PCIe bus should be stored in LMemand re-used. If several problems share data, it should be loaded into LMem just once. If a streamcan be generated on the FPGA, thus doing away with data transfer completely, this should bedone. The communication bandwidth required by the current accelerated implementation ofthe double Kaczmarz sweep is given in Table 3.2. At the memory clock frequency of 303 MHzused in this work, the algorithm is memory bandwidth limited at an operational frequencyof the FPGA of 71 MHz or higher. In Section 4.4 I discuss improvements that will ease thisbandwidth limitation.3.8 Related work: computed tomographyIn this section I briefly describe closely related work by Grüll et al. [19]. I first introduce thedifference between Kaczmarz row projections and the algorithm used by those authors.The Kaczmarz algorithm (Equation 2.4) has been independently discovered in the medicalimaging community by Herman et al. [22], where it is known as the (unconstrained) algebraicreconstruction technique (ART). ART is used in computed tomography to reconstruct a threedimensional image of an object based on two dimensional projections onto a detector. Ratherthan being derived from a finite difference discretization of the wave equation, the matrix A25Loading phase FKSWP BKSWPVector elements from LMem 0 2: p,0 2: FKSWP(p),0Vector elements to LMem 1: p 1: FKSWP(p) 0Matrix rows from LMem 0 1 1Size of each vector element 8 bytesSize of matrix row 384 bytesDRAM bandwidth required 8 bytes 408 bytes 400 bytesVectors in from PCIe 1: p 0 0Vectors out to PCIe 0 1: FKSWP(p) 1: DKSWP(p)PCIe bandwidth required 8 bytes 8 bytes 8 bytesTable 3.2: Kaczmarz double sweep communication bandwidth requirements(per FPGA clock tick) for the current accelerated version. The notation used isapplicable to the main while-loop of CGMN; the names of the vectors transferredwill differ for sweeps in CGMN’s initialization phase. Note that the forward sweeptransfers its result to the CPU host even though it is not used in CGMN. Full sweepoperator arguments have been left out for brevity.in computed tomography is the discrete Radon transform, as noted by Yan [66]. The patternof non-zeros in matrix row ai corresponds to those voxels (small volume elements analogous topixels in 3D) that are projected onto the detector at pixel i. Because the number of pixels inthe 2D projection is smaller than the number of voxels in the volume to be reconstructed, thematrix A is underdetermined. It is possible to “stack” several such underdetermined systems,each corresponding to a single 2D projected image, and consider them as one (typically stillunderdetermined) system. I note that in full-waveform inversion the solution of the linearsystem 2.1 is only one step in an optimization algorithm to find the Earth model m. Incontrast, in computed tomography, solving the Radon system immediately yields the quantityof interest: the three dimensional volume.A variant of ART is the simultaneous iterative reconstruction technique (SIRT), which isequivalent to the Jacobi iteration on the normal equations for the error (Cimmino’s method, aspresented by Elble et al. [12]). SIRT proceeds by computing the row-wise update term for allthe rows at the same time and averaging the updates before applying them to the iterate, asdescribed by Kak and Slaney [29]. For “stacked” systems it is conventional to apply the updateone at a time to each block of rows that corresponds to one projected image. In this case theequivalence to Cimmino’s method is lost. Yet another popular variation is simultaneous ART(SART), described by Andersen and Kak [2], which differs from SIRT in how the updates areaveraged, as well as by replacing the square of the row norm ‖ai‖2 in Equation 2.4 by a sum ofthe row entries. This is the variation used by Grüll et al. [19]. SART is described in a rigorousway by Yan [66].SART features a higher degree of parallelism than ART and for this reason is favoured forGPU implementations, such as those by Castaño-Díez et al. [6] and Xu et al. [65]. This is26because all the row projections for a SART sweep can be calculated in parallel, averaged andapplied at the same time. Compare the inherently sequential nature of the Kaczmarz algorithm,which requires operating on mutually orthogonal rows in order to circumvent the latency of oneKaczmarz iteration. Instead of facing the challenge of pipelining a sequential algorithm, theFPGA-accelerated implementation presented by Grüll et al. [19] overcomes a different challenge.There is an important advantage enjoyed by the Helmholtz matrix (and other matrices derivedfrom finite difference methods) over the matrices used in tomography: regular banded structure.Grüll et al. give two alternatives for dealing with the highly variable location of non-zeros withina matrix row. The first involves row-wise access, but since this requires random access of theiterate elements (which are stored in memory), it is not practical. The other takes advantageof the fact that the matrix has only 4 non-zeros per column. (This is due to the fact that eachvoxel can contribute to at most 4 image pixels.) The implementation thus proceeds through thematrix column-wise, accumulating all of the N update terms needed for SART into temporarystorage. The heart of the implementation is the calculation of projection-ray-voxel intersectionsthat form the non-zeros of the matrix, rather than a buffering scheme to implement a slidingwindow onto the iterate, as in the present work.[\\This chapter has conceptually described the implementation of the Kaczmarz row projectionalgorithm on an FPGA-based accelerator. This algorithm is used as the kernel of an iterativelinear system solver, CGMN. While this chapter has outlined my contribution, my work is moreaccurately represented by the approximately 4000 lines of source code that make up the project,and 84 version control system revisions made from March 2013 to April 2014. The next chaptershows performance results of this implementation when it is used to simulate seismic wavesthrough a synthetic three-dimensional Earth model.27Chapter 4Performance results and discussionTo test the performance of the accelerated iterative Helmholtz solver, I compared two im-plementations. The reference implementation has CGMN implemented in MATLAB and theKaczmarz kernel implemented as a C MEX file. Both CGMN and its kernel are run on the CPUhost. The accelerated implementation also implements CGMN in MATLAB, but the kernel isinstead run on the Vectis FPGA-based accelerator at an operational frequency of 100 MHz. Itis worth noting that a MEX file is still required to interface between MATLAB and the accel-erator; it is automatically generated during the FPGA bitstream build process. As mentionedin Section 3, the host CPU is an Intel Xeon E5-2670.The reference implementation is run in single-threaded mode, using only one CPU core. Onthe one hand, this can be seen as a fair comparison with one FPGA-based accelerator, since itmeans that the degrees of parallelization in both the reference and accelerated implementationsare the same. Both implementations are running the serial linear system solver CGMN. On theother hand, the restriction of the scope of the experiment to one accelerator and one core issomewhat arbitrary. Working within a fixed space (1 rack unit), power or budgetary constraintwould be an alternative.The 3D acoustic velocity model used to generate a Helmholtz matrix for these tests is shownin Figure 4.1, it is a part of the SEG/EAGE Overthrust model developed by Aminzadeh et al.[1]. The first quadrant of the full model was used, corresponding to the coordinate ranges0 ≤ x < 432, 0 ≤ y < 240, 0 ≤ z < 187. The size of the system was varied by changing z, whichincreases down in Figure 4.1. A perfectly matched layer of 5 grid points was used on all sides ofthe domain, irrespective of system size. The frequency ω was fixed at 12 Hz to ensure at least6 grid points per wavelength throughout the model. This is within the limit recommended byOperto et al. [42].Two types of sources were used in the majority of the experiments: a point source placednear the middle of the top of the domain and a wavefield source. The wavefield source setsq equal to the solution of the Helmholtz system with a point source for a given domain size.(This solution must be calculated in advance.) The two types of sources emulate the forwardand adjoint modes of solution of the wave equation (see Section 1.1). A third source was used28246km/s432240187Figure 4.1: A part of the SEG/EAGE Overthrust velocity model was used toconstruct the Helmholtz matrix A(m, ω). This model was used for testing theperformance of the FPGA-based accelerator. The number of grid points is shownnext to each axis.0.50-0.5PaFigure 4.2: An example pressure wavefield solution u. The real part of the Fouriertransform of the pressure is shown. The ripple signature of the point source usedto create the wavefield is clearly visible. Using the accelerated implementationrunning on one Intel Xeon E5-2670 core and one Vectis FPGA-based accelerator,the solution shown took 2703 CGMN iterations to converge to a relative residualnorm of 10−6. This took 9826 s, or approximately 2:45 hours.298001000Number of gridpoints (×107)Time (s)0.5 1.0 1.5 2.00200400600Intel Xeon E5-2670 coreCore + FPGA-based acceleratorFigure 4.3: Time to execute 100 CGMN iterations on one Intel Xeon E5-2670 coreis compared to the time taken on a combination of one such core and one FPGA-based, Vectis model accelerator. The accelerator is running at 100 MHz. Thesystem solved is shown in Figure 4.1, with a source equal to A1.in the experiments described in the next section: q = A1. That is, the source was computedby multiplying a vector of ones by the Helmholtz matrix. This is done in order to know thetrue solution for evaluating how the error of the CGMN iterates behaves. I note that since theproblem is formulated in the frequency domain, all sources are time-harmonic, corresponding toa perfect sinusoidal oscillator at 12 Hz. An example solution wavefield is shown in Figure 4.2.4.1 Execution time and FPGA resource usageFigure 4.3 compares end-to-end execution times for 100 CGMN iterations using the referenceand accelerated implementations. These times do not include the time necessary to calculatethe Helmholtz matrix A, but do include the time during initialization necessary to transfer thismatrix and other vectors to the dedicated large memory of the accelerator. A speed-up of over2× is seen when the kernel is run on the accelerator. However when we examine the proportionof time spent in the kernel vs. the rest of CGMN, shown in Figure 4.4, we find that the kernelitself is accelerated by a factor of approximately 30, and accounts for only a small part ofthe time spent by the accelerated implementation. The rest is taken up by overhead resultingfrom two inefficiencies in the design. First, the real and imaginary parts of complex vectorsproduced in MATLAB are stored separately and must be interleaved before being passed to theaccelerator. Second, the vector that is the output of the backward sweep is in reverse order,30ReferenceImplementation AcceleratedImplementationKaczmarz sweepsRest of CGMNFigure 4.4: Distribution of computation time before and after acceleration. The“rest of CGMN” consists of reduction and elementwise operations. This proportionis stable with varying system size: the kernel takes up 89% ± 0.4 and 3% ± 0.3 ofthe time in the reference and accelerated implementations, respectively.Fraction of available resources0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1LUTsFFsBRAMsDSPsinner product: 〈ai,xk〉row projection component: × ai*updating the iterate: xk + ... buffers memory controllerkernel (other)automatically generatedlow-level components (manager)Figure 4.5: FPGA resource usage for lookup tables, flip-flops, block RAM (on-chipmemory), and digital signal processing blocks. The design components that usea large fraction of the resources are labelled. Note that the memory controller isgenerated automatically.31and needs to have its elements rearranged before being passed to the parts of CGMN that areimplemented in MATLAB.The fraction of FPGA resources used by the accelerated implementation is shown in Fig-ure 4.5.4.2 Effects of reordering rows of the Helmholtz matrixIn Section 3.4 we saw that in order to keep the computational pipeline of the FPGA full it isnecessary to access the rows of the matrix A non-sequentially. This accelerator ordering affectsthe number of iterations CGMN takes to converge. To evaluate whether this is a large effectthat might significantly detract from the speed-up per iteration demonstrated in the previoussection, I solved the Helmholtz system to a prescribed tolerance of 10−6. Three different systemsizes were used: 432×240×{25, 100, 187} and two different seismic sources: a point source anda wavefield source, as described at the beginning of this chapter.The results of this experiment are shown in Figure 4.6. For both the point and wavefieldsources, accelerator ordering actually somewhat improves convergence at lower relative residualnorm tolerance. For higher tolerance this effect is reversed, but also becomes less pronounced. Insummary, changing the order in which the rows of A are used for the Kaczmarz row projectionsin the CGMN kernel to accelerator ordering does not detract from the performance of CGMN.In fact it could even be advantageous in cases where imprecise PDE solves are used in theinitial iterations of full-waveform inversion, as suggested by van Leeuwen and Herrmann [61]and Herrmann et al. [23].4.3 Multiple double Kaczmarz sweeps per CGMN iterationThe top plot in Figure 4.7 shows how convergence of CGMN is affected by running more thanone double Kaczmarz sweep per CGMN iteration, as discussed in Section 2.4. As expected,doing more double Kaczmarz sweeps decreases the number of CGMN iterations required toconverge to a given tolerance, but each of those iterations is now more expensive.It is clear that since the Kaczmarz kernel is where most of the reference implementationspends its time (see Figure 4.4), doing more than one double sweep would not benefit thereference implementation. However for the accelerated implementation, since the Kaczmarzkernel is no longer the hotspot of the code, it makes sense to ask whether multiple sweeps perCGMN iteration might be advantageous. Because the overhead of each CGMN iteration islarge compared to the cost of running the sweeps, running additional sweeps comes with onlya small price.Nevertheless, the current version of the accelerated implementation does not completelyeliminate the increase in the time spent doing the reduction and elementwise operations ofCGMN as the number of multiple sweeps per iteration is increased. Thus, as seen in thebottom plot of Figure 4.7, multiple sweeps still do not offer a decrease in net computation time.32IterationRelative residual norm1 10 100 1000 10410-610-510-410-310-20.11pointsourcewavefieldsource1 to N (sequential)Accelerator orderingFigure 4.6: Impact of matrix row ordering on CGMN convergence. Two waysof ordering the rows are shown: 1 to N (sequential) and accelerator ordering.Accelerator ordering takes every third row along the fast (x) dimension and wrapsback around upon reaching the end of a line along that dimension. (See section3.4 for details.) The group of six curves toward the upper right of the plot weregenerated by solving a system with a wavefield source, and the six curves towardthe bottom left were generated by solving a system with a point source. Threedifferent system sizes are shown, corresponding to 2.6×106, 1.0×107 and 1.9×107gridpoints. Larger systems take progressively more iterations to converge for bothsources.It is interesting to note however that if the overhead due to each CGMN iteration were heldconstant as the number of multiple sweeps is increased, performing 2–16 double sweeps wouldoffer the best trade-off. Increasing the number of sweeps per CGMN iteration past this pointwould result in an increase in computational time.Finally, I note that running multiple double Kaczmarz sweeps per CGMN iteration is aninterim measure to attempt to make the most of the current version of the accelerated imple-mentation. The end goal is an implementation of all of CGMN on the FPGA-based accelerator,as discussed in the next section.33Number of double Kaczmarz sweeps / CGMN iteration1 10 100 10000100200300400500600Iterations to converge to a relative residual norm < 10-6Total timeKaczmarz sweepsRest of CGMNNumber of double Kaczmarz sweeps / CGMN iteration1 10 100 10000500100015002000Time to converge to a relative residual tolerance of 10-6Figure 4.7: Effect of multiple sweeps per CGMN iteration on CGMN conver-gence. The top plot shows the number of iterations required to converge, and thebottom plot shows the end-to-end time taken (by the accelerated implementation).The total time is the sum of the time spent in the Kaczmarz sweeps and the re-duction and elementwise operations that make up the rest of CGMN. Even thoughfewer CGMN iterations are required when multiple double Kaczmarz sweeps areperformed at each iteration, the decrease is outweighed by the corresponding in-crease in computational time. The system is a 432×240×25 part of the SEG/EAGEOverthrust system with a point source right-hand side.344.4 Planned improvementsIn the course of this work I have defined a to-do list of specific improvements to the currentaccelerated implementation. These fall into two categories: making the accelerated CGMNsolver more readily useful to my colleagues (measured by uptake of the solver by users), andimproving its performance (measured chiefly by end-to-end time taken by the solver to convergeto a particular tolerance).The main impediment to the use of my work in its current state is the requirement thatthe number of gridpoints along the fast dimension be equal to 3L = 432, and the compile-time definition of the number of grid-points along the medium dimension. Re-ordering thedimensions of the Earth model before using it to compute A allows to specify one dimensionat run time. Ultimately the size of the FPGA’s fast on-chip memory limits the size of the firsttwo dimensions of a block that can be processed at one time. A full solution would enablethe user to specify an arbitrary domain at run time and leave it to the solver to transparentlydecompose it into blocks that fit in the on-chip memory.As is clear from Figure 4.4, the most time-consuming part of the accelerated CGMN solver isno longer the Kaczmarz row projection kernel but the rest of CGMN. The most comprehensivemethod of dealing with this is to port all of CGMN to the accelerator. Once this has beendone, the next step will be to increase the throughput of CGMN by reducing the number ofpasses over the data to two per CGMN iteration (as discussed in Appendix A) and compilingthe design for a higher memory clock frequency and FPGA operational frequency. In orderto avoid becoming bandwidth-limited at these higher frequencies, one option will be to changethe representation of matrix elements to 28 bits per real number, as suggested by Figure 3.6.I term such a future implementation of CGMN on the accelerator the efficient matrix storageimplementation. However the most effective way to reduce bandwidth requirements will be totransfer m to the accelerator and generate the elements of A as they are needed. Such an onthe fly implementation would have the added benefit of moving algorithm complexity off theCPU host to the accelerator at no extra cost in computation time, and is the best way to realizethe full potential of this computing platform. An estimate of the expected speed-up of the “onthe fly” implementation can be gleaned from the right plot in Figure 4.4. Getting rid of thereduction and elementwise operations’ contribution to the execution time (since they will beimplemented on the FPGA) will yield an approximate speed-up of 30–60×, depending on theoperational frequency of the FPGA.Table 4.1 shows the bandwidth requirements of such future implementations of CGMN onthe accelerator, they should be compared with Table 3.2. Figure 4.8 illustrates how thoserequirements affect the range of FPGA operational frequencies.I note that the current accelerated implementation is limited by LMem bandwidth at anFPGA clock frequency of 71 MHz (94 MHz if the maximum memory clock frequency is set atcompile time). The efficient matrix storage implementation will be bandwidth limited beyond145 MHz. The “matrix on the fly” version is not limited by memory bandwidth within the35First pass (incl. FKSWP) Second pass (incl. BKSWP)Vector elements from LMem 5 (6†): r,u,p, s,0,m† 3 (4†): p, tmp,0,m†Vector elements to LMem 4: r,u,p, tmp 1: sMatrix rows from LMem 0 0Size of each vector element 8 bytesSize of matrix row 192 bytes, (0 bytes)†DRAM bandwidth required 264 bytes (80 bytes)† 216 bytes (40 bytes)†Vectors in from PCIe 0 0Vectors out to PCIe 0 0PCIe bandwidth required 0 bytes 0 bytesTable 4.1: CGMN (future) communication bandwidth requirements (per FPGAclock tick) for the implementation that puts all of CGMN on the accelerator, entirelydoing away with MATLAB for the main while loop. The table applies for theefficient matrix storage implementation; values for the “matrix on the fly” versionare marked with a dagger†. Note that these requirements apply for the main while-loop of CGMN; the matrix, iterate and right-hand side still have to be transferredto the accelerator’s dedicated large memory before CGMN starts. Likewise the finaliterate will be transferred back to the host when the algorithm either converges orreaches the maximum number of iterations.range of frequencies possible on the FPGA. Arranging the physical resources of the chip (LUTs,flip-flops, DSPs and BRAM) becomes difficult at high frequencies. This ability to meet timingwill limit the FPGA frequency for the future “matrix on the fly” implementation of CGMN onthe accelerator.Since CGMN requires a zero right-hand side for most of its double Kaczmarz sweeps, itmay be tempting to make a further change to the implementation by substituting a zero-streamgenerated on the FPGA for reading the vector b in from LMem. However since the reductionin communication this change would produce would be quite small (compared to improvingthe efficiency of matrix row storage or generating the matrix elements on the fly) and woulddecrease the generality of the sweeps, I do not recommend this change.Finally, I note that the compute node on which this work was done actually has four FPGA-based accelerators, each attached to the CPU host via its own PCIe link, and interconnectedwith each other in a square topology. Coarse-grained parallelization by solving problems withdifferent right-hand sides q (and even different Helmholtz matrices A) on each accelerator isimmediately possible, and has been done by the author in an ad-hoc, manual manner. The sys-tematic use of all four accelerators for parallelization of frequency domain FWI across sourcesand frequencies is a straightforward extension. The performance of the accelerated implemen-tation on multiple accelerators is expected to scale in the same way as performance of a hostapplication when using multiple nodes.I emphasize that the important goal is not to implement one form of parallelization inpreference to another, but to maximize the throughput of all computational units available36FPGA frequency (MHz)LMem and PCIe bandwidth (GB/s)100 150 200 2500510152025303540currentaccelerated75CGMN on the FPGA,matrix on the flyPCIe requirement for currentaccelerated versionCGMN on the FPGAefficient matrix storageFigure 4.8: Communication bandwidth requirements for LMem and the PCIe busare affected by the operating frequency of the FPGA and the details of the im-plementation. The two dashed red lines show the theoretical hardware bandwidthlimits for memory (top) and the PCIe bus (bottom). Note that the implementa-tions with all of CGMN on the FPGA have no host PCIe bandwidth requirementsafter CGMN initialization.to us. This should be done using the most straightforward mode of parallelism possible. Forthe platform studied in this work, reordering the rows of the Helmholtz matrix as discussed inSection 3.4, and solving different Helmholtz problems on each of the four accelerators is theapproach I recommend as long as the wavefield and other CGMN vectors fit into the 24 GBdedicated memory of one accelerator. I assume here that all of CGMN is implemented onthe accelerator. Solving problems whose domains do not fit into one accelerator’s memorycould be dealt with by using a row-block parallelization approach named CARP-CG, suggestedby Gordon and Gordon [17]. Those authors provide a block-parallelization of the Kaczmarzalgorithm termed CARP that performs double sweeps on sets of rows in parallel, averagingthose elements of u that are updated by sweeps on more than one set. Such an approach wouldnecessitate exchange of shared elements, which adds a layer of complexity to the dataflowparadigm.37Chapter 5ConclusionsI have demonstrated a time-harmonic iterative wave equation solver that offloads its compu-tational kernel (the Kaczmarz row projection algorithm) onto an FPGA-based accelerator. Inthe course of the work, challenges such as reading data from slow memory (including readingthe memory backward), orchestrating the flow of intermediate results to maximize data re-use,and seeking out parallelism to fill the computational pipeline of the accelerator were encoun-tered and overcome. The changes to the kernel algorithm dictated by the characteristics of thespecific hardware used were found to be benign or have positive side-effects. The end-to-endspeed-up of the current accelerated version of the wave equation solver is more than a factor oftwo compared to one Intel processor core. A planned future version that implements all of thelinear solver on the accelerator is estimated to give a combined performance improvement ofmore than an order of magnitude. The speed-up of the wave equation solver achieved via theaccelerator translates directly into being able to run more full-waveform inversion iterations ina given time, which leads to more meaningful FWI results.The accelerated implementation as presented here is specific to matrices having the bandedstructure arising from the 3× 3× 3 cube finite difference stencil. However now that the designmethodology has been explored, generalizing to FD matrices with different stencils will be easier.Indeed, provided that computational grid locality still translates into approximate memorylocality, and that some structure is present in the matrix, it should be possible to extend thepresent approach to more general matrices as well. Due to the availability of a time-steppinglibrary for the accelerator, comparisons with time domain methods are a further avenue ofresearch.Exciting opportunities lie in better tailoring the algorithm used in the iterative frequencydomain solver to the specific hardware. For example, since each pass over the data linearlyincreases the execution time of the algorithm, it behoves the author to search for an effectivealgorithm that only requires one pass per solver iteration. Other work involves determiningwhat, if any, advantages may be gained by implementing an asynchronous Kaczmarz algorithm,as suggested by Liu et al. [34]. Such parallel approaches will become increasingly important inthe future as dataset sizes increase, since Dongarra et al. [10] make the point that “exascale38computing restricts algorithmic choices, eliminating the use of important techniques such aslexicographic Gauss-Seidel smoothing [for being] too sequential.”Work on this project is ongoing, and while further development is needed in order to betterrealize the potential for acceleration inherent in the platform, the results presented herein givereason to be optimistic.39Bibliography[1] F. Aminzadeh, B. Jean, and T. Kunz. 3-D Salt and Overthrust Models. Society ofExploration Geophysicists, 1997. → pages 28[2] A. Andersen and A. Kak. Simultaneous algebraic reconstruction technique (SART): Asuperior implementation of the ART algorithm. Ultrasonic Imaging, 6(1):81–94, 1984.ISSN 0161-7346. doi: 10.1016/0161-7346(84)90008-7. URLhttp://www.sciencedirect.com/science/article/pii/0161734684900087. → pages 26[3] A. Y. Aravkin, M. P. Friedlander, F. J. Herrmann, and T. van Leeuwen. Robustinversion, dimensionality reduction, and randomized sampling. MathematicalProgramming, 134(1):101–125, 08 2012. → pages 2[4] Å. Björck and T. Elfving. Accelerated projection methods for computing pseudoinversesolutions of systems of linear equations. BIT Numerical Mathematics, 19(2):145–163,1979. ISSN 0006-3835. doi: 10.1007/BF01930845. URLhttp://dx.doi.org/10.1007/BF01930845. → pages 5, 9, 10, 11, 48[5] R. Brossier, S. Operto, and J. Virieux. Seismic imaging of complex onshore structures by2d elastic frequency-domain full-waveform inversion. Geophysics, 74(6):WCC105–WCC118, 2009. doi: 10.1190/1.3215771. URLhttp://library.seg.org/doi/abs/10.1190/1.3215771. → pages 2[6] D. Castaño-Díez, H. Mueller, and A. S. Frangakis. Implementation and performanceevaluation of reconstruction algorithms on graphics processors. Journal of StructuralBiology, 157(1):288–295, 2007. ISSN 1047-8477. doi: 10.1016/j.jsb.2006.08.010. URLhttp://www.sciencedirect.com/science/article/pii/S1047847706002760. → pages 26[7] A. Chronopoulos and C. Gear. s-step iterative methods for symmetric linear systems.Journal of Computational and Applied Mathematics, 25(2):153 – 168, 1989. ISSN0377-0427. doi: http://dx.doi.org/10.1016/0377-0427(89)90045-9. URLhttp://www.sciencedirect.com/science/article/pii/0377042789900459. → pages 47, 48[8] B. Dally. Efficiency and parallelism: The challenges of future computing, March 2014.URL http://rice2014oghpc.blogs.rice.edu/files/2014/03/Dally.pdf. Keynote lecture at Oil& Gas HPC Workshop. → pages 3[9] R. Diamond. Maxeler developer exchange: Memory architecture, Sept 2011. URLhttps://groups.google.com/a/maxeler.com/d/msg/mdx/AY-p73PyGXI/0WcXDIE0U2oJ. →pages 13, 2140[10] J. Dongarra, J. Hittinger, J. Bell, L. Chacon, R. Falgout, M. Heroux, P. D. Hovland,E. Ng, C. Webster, and S. M. Wild. Applied mathematics research for exascalecomputing, 2014. URLhttp://www.netlib.org/utk/people/JackDongarra/PAPERS/doe-exascale-math-report.pdf.LLNL-TR-651000. → pages 38[11] U. Drepper. What every programmer should know about memory. Red Hat, Inc, 2007.→ pages 16, 25[12] J. M. Elble, N. V. Sahinidis, and P. Vouzis. GPU computing with Kaczmarz’s and otheriterative algorithms for linear systems. Parallel Computing, 36(5-6):215–231, June 2010.ISSN 0167-8191. doi: 10.1016/j.parco.2009.12.003. URLhttp://dx.doi.org/10.1016/j.parco.2009.12.003. → pages 14, 26[13] O. Ernst and M. Gander. Why it is difficult to solve Helmholtz problems with classicaliterative methods. In I. G. Graham, T. Y. Hou, O. Lakkis, and R. Scheichl, editors,Numerical Analysis of Multiscale Problems, volume 83 of Lecture Notes in ComputationalScience and Engineering, pages 325–363. Springer Berlin Heidelberg, 2012. ISBN978-3-642-22060-9. doi: 10.1007/978-3-642-22061-6_10. URLhttp://dx.doi.org/10.1007/978-3-642-22061-6_10. → pages 8[14] H. Fu, W. Osborne, R. G. Clapp, and O. Pell. Accelerating seismic computations onFPGAs–from the perspective of number representations. In 70th EAGE Conference &Exhibition, 2008. → pages 14[15] M. S. Gockenbach, D. R. Reynolds, P. Shen, and W. W. Symes. Efficient and automaticimplementation of the adjoint state method. ACM Trans. Math. Softw., 28(1):22–44,Mar. 2002. ISSN 0098-3500. doi: 10.1145/513001.513003. URLhttp://doi.acm.org/10.1145/513001.513003. → pages 6[16] D. Gordon and R. Gordon. CGMN revisited: Robust and efficient solution of stiff linearsystems derived from elliptic partial differential equations. ACM Trans. Math. Softw., 35(3):18:1–18:27, Oct 2008. ISSN 0098-3500. doi: 10.1145/1391989.1391991. URLhttp://doi.acm.org/10.1145/1391989.1391991. → pages 8[17] D. Gordon and R. Gordon. CARP-CG: A robust and efficient parallel solver for linearsystems, applied to strongly convection dominated PDEs. Parallel Computing, 36(9):495–515, 2010. ISSN 0167-8191. doi: 10.1016/j.parco.2010.05.004. URLhttp://www.sciencedirect.com/science/article/pii/S0167819110000827. → pages 10, 37[18] D. Gordon and R. Gordon. Robust and highly scalable parallel solution of the Helmholtzequation with large wave numbers. Journal of Computational and Applied Mathematics,237(1):182–196, 2013. ISSN 0377-0427. doi: http://dx.doi.org/10.1016/j.cam.2012.07.024.URL http://www.sciencedirect.com/science/article/pii/S0377042712003147. → pages 10[19] F. Grüll, M. Kunz, M. Hausmann, and U. Kebschull. An implementation of 3D electrontomography on FPGAs. In Reconfigurable Computing and FPGAs (ReConFig), 2012International Conference on, pages 1–5, 2012. doi: 10.1109/ReConFig.2012.6416732. →pages 12, 14, 25, 26, 2741[20] S. Hauck and A. DeHon. Reconfigurable Computing: The Theory and Practice ofFPGA-Based Computation. Systems on Silicon. Elsevier Science, 2010. ISBN9780080556017. → pages 4, 12[21] C. He, M. Lu, and C. Sun. Accelerating seismic migration using FPGA-based coprocessorplatform. In Field-Programmable Custom Computing Machines, 2004. FCCM 2004. 12thAnnual IEEE Symposium on, pages 207–216, April 2004. doi: 10.1109/FCCM.2004.12.→ pages 13, 14[22] G. T. Herman, A. Lent, and S. W. Rowland. ART: Mathematics and applications: Areport on the mathematical foundations and on the applicability to real data of thealgebraic reconstruction techniques. Journal of Theoretical Biology, 42(1):1 – 32, 1973.ISSN 0022-5193. doi: http://dx.doi.org/10.1016/0022-5193(73)90145-8. URLhttp://www.sciencedirect.com/science/article/pii/0022519373901458. → pages 25[23] F. J. Herrmann, A. J. Calvert, I. Hanlon, M. Javanmehri, R. Kumar, T. van Leeuwen,X. Li, B. Smithyman, E. T. Takougang, and H. Wason. Frugal full-waveform inversion:from theory to a practical algorithm. The Leading Edge, 32(9):1082–1092, 09 2013. doi:http://dx.doi.org/10.1190/tle32091082.1. URL https://www.slim.eos.ubc.ca/Publications/Public/Journals/The_Leading_Edge/2013/herrmann2013ffwi/herrmann2013ffwi.html. →pages 32[24] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems.Journal of Research of the National Bureau of Standards, 49(6), 1952. → pages 8[25] Intel Corporation. Intel 64 and IA-32 architectures optimization reference manual, 2012.URL http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf. Order Number: 248966-026. → pages 4[26] T. Johnsen and A. Loddoch. Hybrid CPU-GPU finite difference time domain kernels,March 2014. URL http://rice2014oghpc.blogs.rice.edu/files/2014/03/Johnsen.pdf. →pages 3[27] S. Kaczmarz. Angenäherte auflösung von systemen linearer gleichungen. BulletinInternational de l’Academie Polonaise des Sciences et des Lettres, 35:355–357, 1937. →pages 8[28] S. Kaczmarz. Approximate solution of systems of linear equations. International Journalof Control, 57(6):1269–1271, 1993. doi: 10.1080/00207179308934446. → pages 8[29] A. C. Kak and M. Slaney. Principles of Computerized Tomographic Imaging. Society forIndustrial and Applied Mathematics, 2001. → pages 9, 26[30] H. Knibbe, W. Mulder, C. Oosterlee, and C. Vuik. Closing the performance gap betweenan iterative frequency-domain solver and an explicit time-domain scheme for 3Dmigration on parallel architectures. Geophysics, 79(2):S47–S61, 2014. doi:10.1190/geo2013-0214.1. URL http://library.seg.org/doi/abs/10.1190/geo2013-0214.1. →pages 3, 6, 14[31] J. Krueger, D. Donofrio, J. Shalf, M. Mohiyuddin, S. Williams, L. Oliker, and F.-J.Pfreund. Hardware/software co-design for energy-efficient seismic modeling. InProceedings of 2011 International Conference for High Performance Computing,42Networking, Storage and Analysis, SC ’11, pages 73:1–73:12, New York, NY, USA, 2011.ACM. ISBN 978-1-4503-0771-0. doi: 10.1145/2063384.2063482. URLhttp://doi.acm.org/10.1145/2063384.2063482. → pages 14[32] I. Kuon and J. Rose. Measuring the gap between FPGAs and ASICs. In Proceedings ofthe 2006 ACM/SIGDA 14th International Symposium on Field Programmable GateArrays, FPGA ’06, pages 21–30, New York, NY, USA, 2006. ACM. ISBN 1-59593-292-5.doi: 10.1145/1117201.1117205. URL http://doi.acm.org/10.1145/1117201.1117205. →pages 4[33] X. Li, A. Y. Aravkin, T. van Leeuwen, and F. J. Herrmann. Fast randomizedfull-waveform inversion with compressive sensing. Geophysics, 77(3):A13–A17, 05 2012.URL https://www.slim.eos.ubc.ca/Publications/Public/Journals/Geophysics/2012/Li11TRfrfwi/Li11TRfrfwi.pdf. → pages 2[34] J. Liu, S. J. Wright, and S. Sridhar. An asynchronous parallel randomized Kaczmarzalgorithm. arXiv preprint arXiv:1401.4780, 2014. → pages 38[35] A. R. Lopes and G. A. Constantinides. A high throughput FPGA-based floating pointconjugate gradient implementation. In R. Woods, K. Compton, C. Bouganis, andP. Diniz, editors, Reconfigurable Computing: Architectures, Tools and Applications,volume 4943 of Lecture Notes in Computer Science, pages 75–86. Springer BerlinHeidelberg, 2008. ISBN 978-3-540-78609-2. doi: 10.1007/978-3-540-78610-8_10. URLhttp://dx.doi.org/10.1007/978-3-540-78610-8_10. → pages 14[36] Maxeler Technologies. Maxcompiler white paper, Feb 2011. URLhttp://www.maxeler.com/media/documents/MaxelerWhitePaperMaxCompiler.pdf. →pages 3, 13, 21[37] Maxeler Technologies. Multiscale Dataflow Programming. Maxeler Technologies, 2013.→ pages 13, 18[38] H. Meuer, E. Strohmaier, J. Dongarra, and H. Simon. Top 500 supercomputer sites,November 2013. URL https://www.top500.org. → pages 3[39] G. Morris and V. Prasanna. An FPGA-based floating-point Jacobi iterative solver. InParallel Architectures, Algorithms and Networks, 2005. ISPAN 2005. Proceedings. 8thInternational Symposium on, pages 8 pp.–, Dec 2005. doi: 10.1109/ISPAN.2005.18. →pages 14[40] G. Morris, V. Prasanna, and R. Anderson. A hybrid approach for mapping conjugategradient onto an FPGA-augmented reconfigurable supercomputer. InField-Programmable Custom Computing Machines, 2006. FCCM ’06. 14th Annual IEEESymposium on, pages 3–12, April 2006. doi: 10.1109/FCCM.2006.8. → pages 14[41] T. Nemeth, J. Stefani, W. Liu, R. Dimond, O. Pell, and R. Ergas. An implementation ofthe acoustic wave equation on FPGAs. In SEG Technical Program Expanded Abstracts2008, pages 2874–2878. Society of Exploration Geophysicists, 2008. doi:10.1190/1.3063943. URL http://library.seg.org/doi/abs/10.1190/1.3063943. → pages 23[42] S. Operto, J. Virieux, P. Amestoy, J.-Y. L’Excellent, L. Giraud, and H. B. H. Ali. 3Dfinite-difference frequency-domain modeling of visco-acoustic wave propagation using a43massively parallel direct solver: A feasibility study. Geophysics, 72(5):SM195–SM211,2007. doi: 10.1190/1.2759835. URLhttp://geophysics.geoscienceworld.org/content/72/5/SM195.abstract. → pages 6, 7, 16, 28[43] O. Pell. Maxeler developer exchange: DRAM and PCIe bandwidth, Oct 2013. URLhttps://groups.google.com/a/maxeler.com/d/msg/mdx/UWdIf82dvt4/dD_KlTGqWHEJ.→ pages 25[44] O. Pell, J. Bower, R. Dimond, O. Mencer, and M. J. Flynn. Finite-difference wavepropagation modeling on special-purpose dataflow machines. Parallel and DistributedSystems, IEEE Transactions on, 24(5):906–915, 2013. ISSN 1045-9219. doi:10.1109/TPDS.2012.198. → pages 13, 14, 23[45] A. Petrenko, D. Oriato, S. Tilbury, T. van Leeuwen, and F. J. Herrmann. Acceleratingan iterative Helmholtz solver with FPGAs. In EAGE, 06 2014. → pages[46] A. Petrenko, D. Oriato, S. Tilbury, T. van Leeuwen, and F. J. Herrmann. Acceleratingan iterative Helmholtz solver with FPGAs. In OGHPC, 03 2014. URLhttp://rice2014oghpc.blogs.rice.edu/files/2014/03/petrenkoOGHPC2014aih_poster.pdf. →pages[47] R. Plessix. Three-dimensional frequency-domain full-waveform inversion with an iterativesolver. Geophysics, 74(6):WCC149–WCC157, 2009. doi: 10.1190/1.3211198. URLhttp://library.seg.org/doi/abs/10.1190/1.3211198. → pages 6[48] R.-E. Plessix. A Helmholtz iterative solver for 3D seismic-imaging problems. Geophysics,72(5):SM185–SM194, 2007. → pages 6[49] G. Pratt, C. Shin, and Hicks. GaussâĂŞNewton and full Newton methods infrequencyâĂŞspace seismic waveform inversion. Geophysical Journal International, 133(2):341–362, 1998. ISSN 1365-246X. doi: 10.1046/j.1365-246X.1998.00498.x. URLhttp://dx.doi.org/10.1046/j.1365-246X.1998.00498.x. → pages 2[50] Y. Saad. Practical use of polynomial preconditionings for the conjugate gradient method.SIAM Journal on Scientific and Statistical Computing, 6(4):865–881, 1985. doi:10.1137/0906059. URL http://epubs.siam.org/doi/abs/10.1137/0906059. → pages 47[51] Y. Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and AppliedMathematics, 2003. → pages 8, 9, 14[52] D. Singh. Implementing FPGA design with the OpenCL standard. Altera whitepaper,2011. URL http://www.altera.com/literature/wp/wp-01173-opencl.pdf. → pages 4[53] L. Sirgue, J. Etgen, and U. Albertin. 3d frequency domain waveform inversion using timedomain finite difference methods. In 70th EAGE Conference & Exhibition, 2008. →pages 6[54] R. Strzodka and D. Göddeke. Pipelined mixed precision algorithms on FPGAs for fastand accurate PDE solvers from low precision components. In Field-ProgrammableCustom Computing Machines, 2006. FCCM ’06. 14th Annual IEEE Symposium on,pages 259–270, April 2006. doi: 10.1109/FCCM.2006.57. → pages 14, 22, 2344[55] J. Sun, G. Peterson, and O. Storaasli. High-performance mixed-precision linear solver forFPGAs. Computers, IEEE Transactions on, 57(12):1614–1623, Dec 2008. ISSN0018-9340. doi: 10.1109/TC.2008.89. → pages 14, 22[56] W. Symes. Reverse time migration with optimal checkpointing. Geophysics, 72(5):SM213–SM221, 2007. doi: 10.1190/1.2742686. URLhttp://library.seg.org/doi/abs/10.1190/1.2742686. → pages 5[57] K. Tanabe. Projection method for solving a singular system of linear equations and itsapplications. Numerische Mathematik, 17(3):203–214, 1971. ISSN 0029-599X. doi:10.1007/BF01436376. URL http://dx.doi.org/10.1007/BF01436376. → pages 10[58] A. Tarantola. Inversion of seismic reflection data in the acoustic approximation.Geophysics, 49(8):1259–1266, 1984. doi: 10.1190/1.1441754. URLhttp://library.seg.org/doi/abs/10.1190/1.1441754. → pages 1[59] T. van Leeuwen. Fourier analysis of the CGMN method for solving the Helmholtzequation. Technical report, Department of Earth, Ocean and Atmospheric Sciences, TheUniversity of British Columbia, Vancouver, 2012. URL http://arxiv.org/abs/1210.2644.→ pages 9, 10[60] T. van Leeuwen and F. J. Herrmann. Mitigating local minima in full-waveform inversionby expanding the search space. Geophysical Journal International, 195:661–667, 10 2013.doi: 10.1093/gji/ggt258. → pages 5[61] T. van Leeuwen and F. J. Herrmann. 3D frequency-domain seismic inversion withcontrolled sloppiness. To appear in the SIAM Journal on Scientific Computing (SISC), 032014. URL https://www.slim.eos.ubc.ca/Publications/Public/Journals/SIAM_Journal_on_Scientific_Computing/2014/vanLeeuwen20143Dfds/vanLeeuwen20143Dfds.pdf. → pages 32[62] T. van Leeuwen, D. Gordon, R. Gordon, and F. J. Herrmann. Preconditioning theHelmholtz equation via row-projections. In EAGE technical program. EAGE, 01 2012.URL https://www.slim.eos.ubc.ca/Publications/Public/Conferences/EAGE/2012/vanleeuwen2012EAGEcarpcg/vanleeuwen2012EAGEcarpcg.pdf. → pages 8, 21[63] J. Virieux and S. Operto. An overview of full-waveform inversion in explorationgeophysics. Geophysics, 74(6):WCC1–WCC26, 2009. doi: 10.1190/1.3238367. URLhttp://library.seg.org/doi/abs/10.1190/1.3238367. → pages 2, 5, 6[64] Xilinx. Virtex-6 family overview, 2012. URLhttp://www.xilinx.com/support/documentation/data_sheets/ds150.pdf. DS150 (v2.4). →pages 13, 18[65] W. Xu, F. Xu, M. Jones, B. Keszthelyi, J. Sedat, D. Agard, and K. Mueller.High-performance iterative electron tomography reconstruction with long-objectcompensation using graphics processing units (GPUs). Journal of Structural Biology, 171(2):142–153, 2010. ISSN 1047-8477. doi: 10.1016/j.jsb.2010.03.018. URLhttp://www.sciencedirect.com/science/article/pii/S104784771000095X. → pages 26[66] M. Yan. Convergence analysis of SART: Optimization and statistics. InternationalJournal of Computer Mathematics, 90(1):30–47, 2013. doi:10.1080/00207160.2012.709933. URLhttp://www.tandfonline.com/doi/abs/10.1080/00207160.2012.709933. → pages 2645[67] D. Yang, G. Peterson, H. Li, and J. Sun. An FPGA implementation for solving leastsquare problem. In Field Programmable Custom Computing Machines, 2009. FCCM ’09.17th IEEE Symposium on, pages 303–306, April 2009. doi: 10.1109/FCCM.2009.47. →pages 1446Appendix AAn alternative conjugate gradientformulationAn implementation of CGMN as shown in Algorithm 1 using the streaming dataflow paradigmnatural to working with an FPGA accelerator would require three passes over the data foreach iteration. The two inner products 〈p, s〉 and ‖r‖2 are reduction operations that must becompleted in their entirety (over the entire stream) before the rest of the iteration can continue.These synchronization points were also an inconvenience for conjugate gradient implementationson vector and distributed computers during the 1980’s. A solution, found by Saad [50] amongothers, was to reformulate the calculation of CG’s β. This reformulation, applied to CGMN, isshown in Algorithm 2 and allows to fit the algorithm into two passes over the data provided thematrix is banded. It was suggested to me by Diego Oriato of Maxeler technologies and can befound in the work of Chronopoulos and Gear [7] who apply it to CG. Chronopoulos and Gearalso note that recalculating ‖r‖2 every iteration directly using r ensures numerical stability ofCG.47Algorithm 2 The CGMN algorithm (Björck and Elfving [4]) split into two passes over thedata following Chronopoulos and Gear [7].Input: A,u,q, λ1: Rq← DKSWP(A,0,q, λ)2: r← Rq − u + DKSWP(A,u,0, λ)3: p← r4: s← DKSWP(A,p,0, λ)5: α← ‖r‖2 / 〈p, s〉6: β ← 07: while ‖r‖2 > tol do8: Data pass 19: u← u + αp10: r← r− αs11: p← ri + βpi−112: tmp← FKSWP(A,p,0, λ)13: Data pass 214: s← p− BKSWP(A, tmp,0, λ)15: α← ‖r‖2 / 〈p, s〉16: β ←(α2 ‖s‖2 − ‖r‖2)/ ‖r‖217: end whileOutput: u48"""@en ; edm:hasType "Thesis/Dissertation"@en ; vivo:dateIssued "2014-05"@en ; edm:isShownAt "10.14288/1.0167409"@en ; dcterms:language "eng"@en ; ns0:degreeDiscipline "Geophysics"@en ; edm:provider "Vancouver : University of British Columbia Library"@en ; dcterms:publisher "University of British Columbia"@en ; dcterms:rights "Attribution-NonCommercial 2.5 Canada"@en ; ns0:rightsURI "http://creativecommons.org/licenses/by-nc/2.5/ca/"@en ; ns0:scholarLevel "Graduate"@en ; dcterms:title "Accelerating an iterative Helmholtz solver using reconfigurable hardware"@en ; dcterms:type "Text"@en ; ns0:identifierURI "http://hdl.handle.net/2429/46512"@en .