Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Short-read DNA sequence alignment with custom designed FPGA-based hardware Hall, Adam 2010

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2011_spring_hall_adam.pdf [ 2.96MB ]
Metadata
JSON: 24-1.0071441.json
JSON-LD: 24-1.0071441-ld.json
RDF/XML (Pretty): 24-1.0071441-rdf.xml
RDF/JSON: 24-1.0071441-rdf.json
Turtle: 24-1.0071441-turtle.txt
N-Triples: 24-1.0071441-rdf-ntriples.txt
Original Record: 24-1.0071441-source.json
Full Text
24-1.0071441-fulltext.txt
Citation
24-1.0071441.ris

Full Text

Short-Read DNA Sequence Alignment with Custom Designed FPGA-based Hardware by Adam Hall B.A., The University of Cambridge, 2007  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES (Bioinformatics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) November 2010  c Adam Hall, 2010  Abstract The alignment of short DNA read sequencing data to a human reference genome sequence has become a standard step in the analysis pipeline for short DNA read sequence data. As the rate at which short read DNA sequence data is being produced doubles every 5 months, analysis of this data in a computationally efficient way is becoming increasingly important. We demonstrate how we can exploit the “embarrassingly parallel” property of short read sequence alignment in custom-designed hardware in FPGAs. Hardware is chosen, a system is designed, and this system is implemented. My FPGA-based hit finder was demonstrated to produce correct hit results. The performance of this single FPGA implementation was demonstrated to be 71,000 seed hits found per hour on a human genome sized reference sequence. The implementation was demonstrated to produce identical results to the hit finder stage of the MAQ aligner. We demonstrate that the price/performance of this sliding-window FPGA aligner (∼355 seeds/hr/$) compares favorably to the price/performance of sliding-window software aligners (∼67.5 seeds/hr/$ for MAQ). However, software aligners which are based on the superior Burrows-Wheeler alignment algorithm still have a significant price/performance advantage over the FPGA-based approach (∼7,200 seeds/hr/$). We predict that as chips continue to increase in size due to Moores Law and computation is performed in high-density cloud-computing datacenters the FPGA-based approach will become preferable to current software aligners.  ii  Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ii  Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iii  List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ix  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . .  xi  1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Illumina Short-Read DNA Sequencing . . . . . . . . . . . 1.2.2 The Short-Read Alignment Problem . . . . . . . . . . . . 1.2.3 Field Programmable Gate Arrays (FPGA’s) . . . . . . . . 1.2.4 Programming FPGA’s . . . . . . . . . . . . . . . . . . . . 1.2.5 Instantiating a Soft-Core Processor in an FPGA . . . . . . 1.2.6 Adoption of the “Cloud Computing” Model in Bioinformatics(Stein)(Baker) . . . . . . . . . . . . . . . . . . . . . 1.2.7 How BLAST (Basic Local Alignment Search Tool) and Other Related Algorithms Work . . . . . . . . . . . . . . . 1.3 Software Aligners . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 The Indexing/Hit Finding/Hit Extension Paradigm . . . . 1.3.2 Error Models . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 ELAND . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.4 MAQ (Mapping and Assembly with Qualities)(Li, Ruan, and Durbin) . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.5 SOAP(Li) . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.6 PASS(Campagna, Albiero, Bilardi, Caniato, Forcato, Manavski, Vitulo, and Valle) . . . . . . . . . . . . . . . . . . . 1.3.7 SeqMap(Jiang and Wong) . . . . . . . . . . . . . . . . . . 1.3.8 Slider(Malhis, Butterfield, Ester, and Jones) . . . . . . . . 1.3.9 Bowtie(Li and Durbin) . . . . . . . . . . . . . . . . . . . .  1 1 2 2 3 6 7 10  iii  11 17 19 19 20 22 33 35 35 36 36 36  1.4 Other 1.4.1 1.4.2 1.4.3  Related Work . . . . . . . . . . . . . . . . . . . . . . . Dynamic Programming in FPGA’s . . . . . . . . . . . Other Previous Uses of FPGA’s in Bioinformatics . . . A Previous Implementation of a Short Read Aligner FPGA Hardware(McMahon) . . . . . . . . . . . . . . .  . . . . . . in . .  37 37 37 38  2 Overall System Architecure . . . . . . . . . . . . . . . . . . . . . 39 2.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.2 First Implementation Attempt: Using the Cray XD1 FPGAAccelerated Computer . . . . . . . . . . . . . . . . . . . . . . . . 44 2.3 Second Implementation Attempt: Development of a PCI-Express based Accelerator Card for the Host Workstation . . . . . . . . . 46 2.4 Third, Final Implementation Attempt: Development of an Ethernet-based Appliance . . . . . . . . . . . . . . . . . . . . . . 47 2.5 Choice of Development Tools . . . . . . . . . . . . . . . . . . . . 48 2.6 Development Hardware Setup . . . . . . . . . . . . . . . . . . . . 49 2.7 Design Decision: Where to Store the Reference Sequence . . . . . 50 2.8 Design Decision: Which Devices are Used on the DE2-70 Board and What Happens to the Rest . . . . . . . . . . . . . . . . . . . 51 2.9 Adapting the The Basic Idea to Short Read Alignment with the DE2-70 Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.10 Design Decision: Method of Getting Reference Sequence Data into the Query Generator Sliding Window . . . . . . . . . . . . . . . . 59 2.10.1 Method 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.10.2 Method 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.10.3 Method 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.11 System Components . . . . . . . . . . . . . . . . . . . . . . . . . 61 3 SOPC Controller Development . . . . . . . . . . . . . . . . . . . 63 3.1 What the SOPC Controller Does . . . . . . . . . . . . . . . . . . 63 3.2 Implementation Method . . . . . . . . . . . . . . . . . . . . . . . 64 3.3 The SOPC Builder . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.4 Design Version 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.5 Design Version 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.6 Combining the two 32MB SDRAM Chips into a Single 64MB Memory Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.7 Clock Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.7.1 Method Used to Generate the Clocks . . . . . . . . . . . . 76 3.8 Text FIFO Writer Custom SOPC component . . . . . . . . . . . . 78 3.9 The Seed Register Writer Custom SOPC Component . . . . . . . 79 3.10 The Seed Register Enable Bit Writer Custom SOPC Component . 80 3.11 The Control Flag Generator Module . . . . . . . . . . . . . . . . 80 3.12 The Parameter Readout Module . . . . . . . . . . . . . . . . . . . 83 iv  3.13 The Multi-Match Hit Vector Readout Module . . . . . . . . . . . 3.14 Getting Data From the Results FIFO into the SOPC . . . . . . . 4 Alignment Pipeline Development . . . . . . . . . . . . 4.1 Overall Pipeline Design . . . . . . . . . . . . . . . . . . 4.2 Design Decision: Allowing Multiple Identical Seeds in a 4.3 Detailed Pipeline Design . . . . . . . . . . . . . . . . . 4.4 The Text FIFO . . . . . . . . . . . . . . . . . . . . . . 4.5 The Query Generator . . . . . . . . . . . . . . . . . . . 4.6 The Seed Comparison Module . . . . . . . . . . . . . . 4.7 The Priority Encoder . . . . . . . . . . . . . . . . . . . 4.8 The Results FIFO . . . . . . . . . . . . . . . . . . . . . 4.9 Computing the Stall Signal . . . . . . . . . . . . . . . . 4.10 Alignment Pipeline Global Reset . . . . . . . . . . . . 4.11 Increasing Clock Frequency with Pipelining . . . . . . .  . . . . . . . . Batch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . .  85 85  . . . . . . . . . . . .  88 88 91 93 95 95 97 99 101 103 104 105  5 Embedded Software Development . . . . . . . . . . . . . . . . . 5.1 What the Embedded Software Does . . . . . . . . . . . . . . . . . 5.2 Overall Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 States and State Transitions . . . . . . . . . . . . . . . . . . . . . 5.4 The Main Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Loop Contents . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Shut Down . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 The Bit Manipulation Functions . . . . . . . . . . . . . . . . . . . 5.5.1 Other Support Functions Written . . . . . . . . . . . . . . 5.6 Retrieving Configuration Parameters from the Alignment Appliance Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Initialization of Seed Registers and Comparison Module Enable Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Design Decision: How the Reference Genome is Contained in the SDRAM Text Buffer . . . . . . . . . . . . . . . . . . . . . . . . . 5.9 Other functions performed by the Embedded Software . . . . . . 5.10 Developement of Driver Software for the DM9000A Ethernet Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.10.1 Ethernet Interface Configuration . . . . . . . . . . . . . . 5.10.2 Method to Read from the DM9000A Ethernet Frame Receive Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . 5.10.3 Method to Send an Ethernet Frame with the DM9000A Ethernet Interface Controller . . . . . . . . . . . . . . . .  108 108 109 110 110 112 112 113 113 114 115 115 115 116 118 118 119 120  6 Development of Appliance Control Application for the Workstation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 v  6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10  What the Workstation Software Does . . . . . . . . . . . . . . . . 123 Choice of Library for Sending and Receiving Ethernet Frames . . 124 Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Type of Packets Used . . . . . . . . . . . . . . . . . . . . . . . . . 124 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Receiving Ethernet Frames . . . . . . . . . . . . . . . . . . . . . . 125 Parsing the Command Line Arguments . . . . . . . . . . . . . . . 126 Writing Unsigned Integer Classes for Java . . . . . . . . . . . . . 127 Retrieving Configuration Parameters from the Alignment Appliance130 How a Reference Sequence is Loaded into the Alignment Appliance 131 6.10.1 Loading the Reference Genome from Disk to Memory . . . 131 6.10.2 Transmitting the Reference Genome from Memory to the SDRAM on the DE2-70 Board . . . . . . . . . . . . . . . . 131 6.11 How Reads are Uploaded into the Alignment Appliance . . . . . . 131 6.11.1 Loading the Reads from Disk to Memory . . . . . . . . . . 131 6.11.2 Packaging Seeds into Packets . . . . . . . . . . . . . . . . 132 6.12 Packet Transmission and Reception While in Operation . . . . . . 132  7 Development of an Application to Resolve Ambiguities in Reference Genome Files . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.1 Introduction to Ambiguity Resolution in Reference Genomes . . . 135 7.2 Design and Implementation of an Ambiguity Resolution Application137 7.3 Software Engineering Issues . . . . . . . . . . . . . . . . . . . . . 143 7.4 Testing the Ambiguity Resolution Application . . . . . . . . . . . 145 8 Correctness Testing . . . . . . . . . . . . . . . . . . . . . 8.1 Choice of Test Reference Sequence . . . . . . . . . . . . 8.2 Choice of Test Seed Dataset . . . . . . . . . . . . . . . . 8.3 Generation of Correct Hit Results . . . . . . . . . . . . . 8.4 How the Two Sets of Hit Results are Demonstrated to be  . . . . . . . . . . . . . . . . . . . . Identical  147 147 148 148 151  9 Performance Measurement . . . . . . . . . . . . . . . . . . . . . 153 9.1 Configuration of FPGA Binary Used for Performance Measurement 153 9.2 Choice of Reference Sequence . . . . . . . . . . . . . . . . . . . . 153 9.3 Choice of Read Dataset . . . . . . . . . . . . . . . . . . . . . . . . 154 9.4 Method of Measuring Performance . . . . . . . . . . . . . . . . . 154 9.5 Performance Measurement Values . . . . . . . . . . . . . . . . . . 155 9.6 Extrapolation of Performance Values to a Human-Size Reference Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 10 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . 156 10.1 Comparison of FPGA-Based Hit Finder and Microprocessor-Based Hit Finders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 vi  10.2 10.3 10.4 10.5 10.6  10.1.1 Hardware Cost . . . . . . . . . . . . . . . . . . . . . . . . 10.1.2 Inexact Matching . . . . . . . . . . . . . . . . . . . . . . . Demonstration that the FPGA Hit Finder Produces Identical Results to MAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ideas for Future Work . . . . . . . . . . . . . . . . . . . . . . . . Future Work: Handling Ambiguous Characters in the Reference Genome in Hardware . . . . . . . . . . . . . . . . . . . . . . . . . Improvements that Could be Made to the Implementation . . . . Performance Scaling with Larger Chips . . . . . . . . . . . . . . .  156 157 158 159 161 163 164  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166  vii  List of Tables 1.1 The sliding window algorithm. . . . . . . . . . . . . . . . . . . . . 1.2 Some values of 2n , the number of hash tables needed for an nn mismatch alignment. . . . . . . . . . . . . . . . . . . . . . . . . .  24 32  3.1 The SOPC controller components not shown on the diagram. . . . 3.2 The flags which the control flag generator module can generate. .  73 81  4.1 The reasons that the alignment pipeline can stall. . . . . . . . . . 103 4.2 The effects of a stall event in the alignment pipeline. . . . . . . . 104 5.1 The states the alignment appliance embedded software can be in. 110 5.2 The nine frame types that can be sent from the host workstation to the alignment appliance. . . . . . . . . . . . . . . . . . . . . . 121  viii  List of Figures 1.1 Analysis of data from a short read sequencing instrument. . . . . 1.2 The Verilog code from listing 1.2.4 shown in the equivalent circuit diagram form. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 The three phases of short read alignment. . . . . . . . . . . . . . 1.4 An example of 2b/nt encoding. . . . . . . . . . . . . . . . . . . . 1.5 Generating query subsequences with a sliding window. . . . . . . 1.6 The table used for the direct address method. . . . . . . . . . . . 1.7 The exact matching hash table method. . . . . . . . . . . . . . . . 1.8 The one-mismatch hash table method. . . . . . . . . . . . . . . . 1.9 The two-mismatch hash table method. . . . . . . . . . . . . . . . 2.1 Method 1 for implementing custom hardware in an FPGA for embarrassingly parallel problems. . . . . . . . . . . . . . . . . . . . . 2.2 Method 2 for implementing custom hardware in an FPGA for embarrassingly parallel problems. . . . . . . . . . . . . . . . . . . . . 2.3 Properties of the subproblems of alignment. . . . . . . . . . . . . 2.4 Image of the Arria GX FPGA development board. . . . . . . . . . 2.5 How the Ethernet-based appliance connects to the host workstation. 2.6 Image of the Altera DE2-70 board . . . . . . . . . . . . . . . . . . 2.7 The hardware setup used during development and testing. . . . . 2.8 Adapting the basic multi-parallel-module to a hit finder on the DE2-70 board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 The design of a comparison module. . . . . . . . . . . . . . . . . . 2.10 The hardware of a seed register. . . . . . . . . . . . . . . . . . . . 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10  How the SOPC Builder is used. . . . . . . . . . . . . . . . . . . . Version one of the controller SOPC. . . . . . . . . . . . . . . . . . The method used to generate flags in version one of the controller. Version two of the controller SOPC. . . . . . . . . . . . . . . . . . How a single SDRAM controller drives two SDRAM chips. . . . . The text FIFO writer custom SOPC component. . . . . . . . . . . The seed register writer custom SOPC component. . . . . . . . . The seed register enable bits writer custom SOPC component. . . The aligner control flags module. . . . . . . . . . . . . . . . . . . The configuration parameter readout module. . . . . . . . . . . . ix  4 10 21 23 24 25 26 28 31 40 42 43 46 47 48 49 53 56 58 66 67 69 71 75 78 79 80 82 84  3.11 The multi-match hit vector readout module. . . . . . . . . . . . .  86  4.1 4.2 4.3 4.4 4.5  High level design diagram of the alignment pipeline. . . . . . . . . 89 Detailed design diagram of the alignment pipeline. . . . . . . . . . 94 The internals of the seed comparison module. . . . . . . . . . . . 98 A priority encoder constructed recursively. . . . . . . . . . . . . . 100 My modified priority encoder constructed recursively with the extra multi match signal. Note that “ multi ” is used as short for “ multi match ” to save space on the diagram. . . . . . . . . . . . 102 4.6 The pipelined version of the alignment pipeline detailed design. . 107  5.1 The allowed state transitions of the alignment appliance embedded software controller. . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.2 The layout of the reference genome in the SDRAM of the appliance.117 6.1 The transmission and reception of frames between the alignment appliance and host workstation. . . . . . . . . . . . . . . . . . . . 133 7.1 Correctness testing the wrong way. Notice that the two aligners are aligning to different reference sequences because their ambiguity resolvers work differently. . . . . . . . . . . . . . . . . . . . . . . . 141 7.2 Correctness testing the right way. Notice that the two aligners are aligning to identical reference sequences. . . . . . . . . . . . . . . 142 7.3 Internal structure of the ambiguity resolution Java application. . . 144 8.1 How the Test Datasets were Generated. . . . . . . . . . . . . . . . 149 8.2 Producing hit finding results for correctness testing. . . . . . . . . 152 10.1 Hardware for Comparing a 4-bit Encoded Nucleotide with a 2-bit Encoded Nucleotide. . . . . . . . . . . . . . . . . . . . . . . . . . 162  x  Acknowledgements First and foremost I would like to thank Dr. Steven Jones for supervising me as a graduate student. Without the enormous amount of professional and personal support I received from him this thesis would not have been possible. Useful ideas were contributed by Dr. Steven Jones, Dr. Nawar Malhis, Nina Thiessen, Erin Pleasance and Rodrigo Goya. Thanks to Dr. Arvind Gupta, Dr. Steven Jones and Dr. Paul Pavlidis for supervising my three rotations. Thanks to Anthony Fejes and Kelsey Hamer for providing much patient help during those rotations. Thank you to my committee members Dr. Steven Jones, Dr. Artem Cherkasov and Dr. Stephan Flibotte for providing constructive feedback in my committee meetings. Thank you to Sharon Ruschkowski, Marilyn Gillespie and Becky Gillespie for providing help with various administrative matters. Thanks to the CIHR/MSFHR Bioinformatics Training Program for providing financial assistance and administrative support. Thanks also to anyone else who has helped with my graduate studies but who I haven’t mentioned here.  xi  Chapter 1 Introduction 1.1  Aim  The aim of this project is to investigate how FPGA’s can be used to implement Illumina short-read DNA sequence alignment algorithms. In particular we look at the alignment of short read sequences to mammalian-size reference genome sequences. FPGA’s are programmable logic chips which can be programmed to function as a user-designed custom chip. Short-read DNA sequence alignment is the problem of finding every occurrence in a reference genome of each short read produced in a run of an Illumina machine. Software-based approaches for short read sequence alignment already exist. Potential advantages of FPGA-based approaches over microprocessor-based approaches include • better price/performance, • better energy consumption/performance, • better scalability to future large chips, and 1  • better reliability for large multi-chip systems.  1.2  Technical Background  1.2.1  Illumina Short-Read DNA Sequencing  An Illumina (previously Solexa) Genome Analyzer sequencing instrumet can sequence one 8-lane flowcell of shotgun DNA reads (∼100Gbp) in one run taking approximately a week. Read lengths of ∼30bp were used initially. As time goes on read lengths are increasing due to better lab methods being developed and the hardware of the short read sequencing instruments being improved. At the time of writing (late 2010) read lengths of 150bp are possible with Illumina short read sequencing instruments. The reads are called short reads because they are shorter than reads produced by earlier Sanger-sequencing based DNA sequencing machines. Short read sequencing machines are also produced by other companies such as ABI and Helicos. The DNA sequencing instruments produced by Illumina and Helicos both use slightly different versions of the “sequencing by synthesis” DNA sequencing method. The ABI instrument uses the fundamentally different “capillary electrophoresis” method of DNA sequencing. Each base call has a probability of error associated with it (i.e. the probability that the base call is incorrect and should be some other base, often denoted pe ). Along the length of the read, pe increases gradually to start with and then starts to increases more rapidly after the initial “high quality” part of the read. Early on in the development of short read sequencing machines, the high quality part of the read was typically the first 28bp. The high quality part  2  of the read is now substantially longer than the first 28bp and continues to increase in length as read lengths increase. Figure 1.1 on page 4 shows a typical analysis piepline for short read sequencing machine data. Short read analysis methods are often used to find differences between a standard reference genome and the genome of a particular individual. These kinds of approaches are generally known as resequencing. An example of this kind of analysis is SNP (Single Nucleotide Polymorphism) finding. A SNP is a single base pair difference between two genomes. It can be either an insertion of a single base pair, a deletion of a single base pair or a substitution of one base pair for another. SNPs have turned out to be a natural way of characterizing the differences between two genomes in a way that can be used for further analysis. Personalized medicine(Jones, Laskin, Li, Griffith, An, Bilenky, Butterfield, Cezard, Chuah, Corbett et al.) is the tailoring of medical treatments to individual patients and individual instances of a disease. Short read sequencing machines make it possible to perform whole-genome analyses for humans. It’s likely that future versions of short read sequencing machines and the analysis methods that are currently being developed alongside them will bring about personalized medical treatments. Examples include treatments tailored to specific cases of cancer, and predicting the probability a patient will experience desirable effects and/or side effects to a particular drug before it is administered.  1.2.2  The Short-Read Alignment Problem  The short read alignment problem (also known as genome mapping and read mapping) is the problem of finding every position of a reference sequence at which a DNA read motif appears as a locally-aligned subsequence(Trapnell and 3  Figure 1.1: Analysis of data from a short read sequencing instrument.  4  Salzberg). A typical run might require an aligner to align, say, 5 million reads each of length 30bp (1 flowcell). We want to be able to do this for a human-genome sized reference sequence of 3.0Gbp. We also want to be able to do this for other genomes, depending on whatever species the reads we are aligning were sequenced from. Systems which perform this task are called short read aligners or usually just aligners. We typically don’t align the whole read, but just the high quality part of it. The low probability of sequencing errors in the high quality part of the read means that hits generated from the high quality part of the read are less likely to be false positives than if a longer seed were used, which would have to incorporate lower-quality sequence positions from the read. Once we’ve aligned the high quality part of the read we can extend it to check whether the alignment is a real alignment or just a false positive. We sometimes call this high quality part of the read the seed, a term which came from the BLAST algorithm literature(Altschul, Gish, Miller, Myers, and Lipman). Places where the seed aligns to the reference are called hits. A seed length of 28bp is often used by default because the per-base probability of error of the position of the read increases rapidly after 28bp. Longer seeds are being used for longer reads which have a longer high quality segment. If a read maps to multiple places in a genome or zero places in a genome, we usually just have to discard it and it does not provide us with useful information. Only reads that map to exactly one place in a reference sequence provide useful information. We say that these reads are uniquely mapped. Developing algorithms which solve the short read alignment problem can make use of the existing string matching algorithm literature. String matching algorithms are an old and well established field of study. There is a very large 5  literature containing a wide variety of algorithms with useful properties. Many of the important publications in the string matching literature are from the 1970’s and 1980’s(Boyer)(Knuth)(Aho)(Karp). There are many good textbooks in the area(Crochemore, Hancart, and Lecroq)(Gusfield). These algorithms were typically used in compilers and word processors. In recent years there has been renewed interest in string matching algorithms due to the vast datasets used in bioinformatics(Li and Durbin).  1.2.3  Field Programmable Gate Arrays (FPGA’s)  FPGA’s were invented and patented by Ross Freeman in 1984(Ros). They are general purpose user-programmable logic chips. They are a more modern and sophisticated version of the PLD (Programmable Logic Device) chip. Internally an FPGA consists of a 2-dimensional array of programmable logic elements (LE’s) and a large programmable interconnect connecting together those logic elements. A typical logic element contains a user programmable 8-input look up table and one or two 1-bit registers. The logic elements and the interconnect can be programmed arbitrarily to produce any logic circuit the user requires, provided it fits into the available resources of the chip. Once programmed with a user’s design, an FPGA can be thought of as a custom logic chip, or ASIC (Application Specific Integrated Circuit). Very approximately the cost of an FPGA is 10 times higher than an ASIC(Kuon and Rose), and it runs at approximately one tenth of the speed. However, ASICs can only be produced economically in huge quantities. The ability to implement a hardware design in the soft hardware of an FPGA effectively makes it possible to produce custom logic chips in small quantity production runs. FPGA’s usually lose their configuration state when they are powered 6  down. Older FPGA’s held their configuration state in nonvolatile memory based on fuses or flash memory. Now virtually all FPGA’s hold their configuration state in volatile SRAM. FPGA’s are reprogrammed by either auto-uploading a binary from off-chip non-volatile memory on power up (in production systems) or from a workstation connected via an interface such as USB (in development systems). Reprogramming an FPGA’s configuration state typically takes a few seconds. The whole configuration state of the FPGA has to be changed at once: It cannot be changed a small piece at a time. Some FPGA’s are specially made to have this property of being able to be changed a small piece at a time, which is known as partial reconfiguration. Most FPGA’s, including the one I use for development, can’t be partially reconfigured. This is because it’s harder to design FPGA’s which can be partially reconfigured than ones which can’t. FPGA’s usually include special purpose blocks of logic because they can be constructed more efficiently as special-purpose hardware blocks than out of the FPGA’s general purpose logic resources. Typically blocks of memory (250 x 4.5KB in the Cyclone II EP2C70) and hardware multipliers (150 18bitx18bit multipliers in the Cyclone II EP2C70). The Cyclone II EP2C70 also has 4 PLL’s (Phase Locked Loops) which can each generate up to 3 separate clocks of user-specified frequency and phase from an external input clock. These are needed because there are situations where multiple clocks are needed in a single virtual hardware design.  1.2.4  Programming FPGA’s  FPGA’s cannot be programmed with the same languages or approaches as microprocessors, such as C and Java. This is because the hardware is fundamentally different. Microprocessors execute a sequence of instructions to 7  manipluate the contents of memory, FPGA’s are an array of programmable logic elements connected into a parallel simulation of a piece of user-defined hardware. An FPGA binary contains the information needed to connect the FPGA’s internal resources into the user’s virtual hardware design. In principle a user could construct this configuration information manually, as is done for PLD’s. This would usually be impractical because the configuration information is too large and sophisticated to be produced by hand for real implementations in real-sized FPGA’s. In practice Computer-Aided Design tools are used to produce a configuration binary for the FPGA from a user’s virtual hardware design. A user’s design consists of logic functions, state holding registers and possibly embedded functional blocks in the FPGA such as multipliers or memory. Users enter their design into the workstation using CAD tools. Tools exist to allow the user to specify their designs on screen as a schematic circuit diagram. This is rarely used in practice, instead users enter their designs to the CAD software in the form of HDL’s (Hardware Description Languages) such as Verilog. HDL’s are programming languages designed to represent a logic circuit. The circuit diagram shown in figure 1.2 on page 10 can be represented by the equivalent Verilog code shown in code listing 1.2.4 on page 9. Even if the reader hasn’t seen Verilog code before the direct correspondence between the circuit diagram and the Verilog code description of it should be easy to see. Lines 0-5 of listing 1.2.4 declare the input and output ports of the module. We have three 1-bit inputs ( A , B , Clk ) and a 2-bit output ( R ). These can be seen on the equivalent circuit diagram. Line 6 creates an AND gate with A and B as its two inputs and a wire called T as its output. Line 7 creates a 1-bit register 8  whose output wire is called buf . Lines 9-12 create a 1-bit D-type flip flop register which registers its new data value when the rising edge of the Clk signal arrives. Wire 14 creates a 1-bit inverter which takes buf as its input and drives its output onto the wire called buf inv . Line 16 groups buf and buf inv into a 2-bit bus called R which is the output of the module. 00  module example_Verilog_module (  01  input  A,  02  input  B,  03  input  Clk,  04  output [1:0] R  05  );  06  wire T = A && B;  07  reg  buf;  08 09  always @ (posedge Clk)  10  begin  11 12  buf <= T; end  13 14  wire buf_inv = !buf;  15 16 17  assign R = {buf_inv, buf}; endmodule The advantages of representing logic circuit diagrams with textual HDL’s  rather than schematic diagrams are the same as the advantages of using high 9  Figure 1.2: The Verilog code from listing 1.2.4 shown in the equivalent circuit diagram form. level languages such as C instead of flowcharts to represent programs for microprocessors. The use of HDL’s makes the activity of digital hardware design more like software developement. We can use text editors, version control systems, makefiles etc.. with an HDL design but not with a schematic design. Verilog is a widely used standard that can be taken as input by FPGA manafacturer’s synthesis tools to produce FPGA configuration binaries in an automated way.  1.2.5  Instantiating a Soft-Core Processor in an FPGA  It’s possible to describe a whole processor core in an HDL and then instantiate it in an FPGA. Reasons someone might want to do this are • as an embedded controller for a larger system, or • to instantiate a processor which has some unusual feature not available in an off-the-shelf processor. 10  In the first example a soft-core processor optimized to have low FPGA resource usage rather than to be fast could be used which uses up just a few percent of the FPGA’s resources. This would leave most of the FPGA resources free to contain the rest of the system. In the second example one might want to include an unusual functional unit in the processor which is not available in off-the-shelf hardware microprocessors, such as a “reverseWord” instruction which reversed the order of bits in a word. This would be useful in applications which made heavy use of it as a performance optimization. The FPGA design for the hardware aligner described later in this document instantiates a SOPC (System on a Programmable Chip) system which contains a soft-core processor.  1.2.6  Adoption of the “Cloud Computing” Model in Bioinformatics(Stein)(Baker)  The Traditional Model of Performing Computation The current way in which people perform computations is to purchase computers, storage hardware and networking hardware, then set them up and run software on them(Stein). This model extends from the beginning of the computer industry (1950’s) until about now (2010 at the time of writing). The Cloud Computing Model of Performing Computation In the cloud computing model, the physical hardware on which computation is performed is located at large Internet-connected datacentres. Users have only a small amount of storage and computation resources locally. They rent computational resources and storage space at the central datacentre on demand.  11  This model of renting computational resources on demand is known as “computation as a service”(Stein). Advantages of the cloud computing model compared to the traditional model: • The work involved in purchasing, setting up, renewing and maintaining the physical hardware is done by the cloud computing service provider rather than the user. This frees the user from the burdens of hardware ownership while still allowing them to benefit from the computational services provided by the hardware. • At peak time users can rent an enormous virtual cluster, at quiet time their resource usage can be zero. There is no wastage of computational resources by the user: they only pay for what they actually use. The service provider can multiplex hardware between multiple users, reducing wastage by the service provider as well because different users peak resource requirements occur at different times. • Generally speaking ecomonies of scale makes one large datacentre more cost effective to run than lots of small ones because it allows for bulk purchasing of electricity, hardware, floorspace, support services, etc.. Advantages of the traditional model of computation compared to the cloud computing model: • A connection to a local datacentre is lower latency, higher bandwidth and cheaper than a connection over the Internet to a large datacentre. This may make some applications impossible to convert from the traditional model to the cloud computing model. 12  • Provision of the computational services is not reliant on an external service provider. The Traditional Bioinformatics Ecosystem(Stein): In the traditional bioinformatics ecosystem, sequencing labs submit their data to one or more international archival databases. Six of these include: • GenBank at NCBI (National Centre for Biotechnology Information)(Benson, Mizrachi, Lipman, Ostell, and Wheeler). • EMBL at EBI (European Bioinformatics Institute)(Brooksbank, Cameron, and Thornton). • DDBJ (DNA Database of Japan)(Sugawara, Ogasawara, Okubo, Gojobori, and Tateno). • SRA (Short Read Archive)(Shumway, Cochrane, and Sugawara). • GEO (Gene Expression Omnibus)(Barrett, Troup, Wilhite, Ledoux, Rudnev, Evangelista, Kim, Soboleva, Tomashevsky, Marshall et al.). • ArrayExpress(Kapushesky, Emam, Holloway, Kurnosov, Zorin, Malone, Rustici, Williams, Parkinson, and Brazma). These six organizations are responsible for maintaining a permanent archive of the world’s genomic data and for distributing it(Stein). End users typically download data from these archives over the Internet, perform their computations on them with their local high-performance hardware and then discard the datasets once no longer needed(Stein). Differences Between the Traditional and Cloud Computing Bioinformatics Ecosystems: The differences between the cloud computing 13  Bioinformatics ecosystem and the traditional bioinformatics ecosystem is that computation hardware, storage hardware and the data itself will all be physically located at the cloud datacentre. Users will log into the cloud remotely to perform their computations. Only small amounts of data will be transmitted between the user and datacentre, such as the analyses programs themselves and results of analyses if they are small. Large datasets, intermediate files and reults files will never physically leave the datacentre. This situation is summed up well by a quote in Lincoln Stein’s paper(Stein) which I will paraphrase here: “In the traditional bioinformatics ecosystem the data moves to the programs, in the cloud computing Bioinformatics ecosystem the programs move to the data.” There are underlying reasons why this transition is happening now rather than earlier in the developement of Genomics. Trends in computer hardware price/performance: • Every 18 months the number of transistors that can be fabricated on a chip of fixed size and cost doubles. This is the famous “Moore’s Law”(Moore et al.) which has been going on since 1965. • Every 12 months the capacity of a hard disk of fixed cost doubles. This is known as “Kryder’s Law.”(Walter). • Every 9 months the number of bits that can be transmitted over an optical network for a fixed cost doubles. This is known as “Butter’s Law.” Prior to the introduction of next-generation sequencing, the amount of genomic data that could be produced for a fixed cost doubled approximately every 19 months. Computer speed, disk space and Internet data transmission costs all doubled more frequently than this (every 18, 12 and 9 months 14  respectively), so the computational resources needed to keep pace with the amount of available DNA sequence data would always be available by the time they were needed. Since the introduction of next-generation short-read sequencing hardware in 2005, the amount of DNA sequence data available has been doubling every 5 months(Stein). This is substantially faster than the doubling time for all 3 computational resources. The result of this is that the required storage space, data transmission cost and data processing costs associated with a fixed cost of sequence data are increasing with time. Previously, in the pre-next-generation-sequencing era, they were decreasing. The cloud computing model helps with this because: • Each dataset is stored at fewer locations: just 1 copy at 1 datacentre rather than n copies at n different locations where it’s used for analysis. This reduces the total number of copies that need to be stored, hence reducing the total storage capacity requirement associated with the dataset. This is particularly helpful in bioinformatics as there are certain datasets (e.g. genomes, transcriptomes, pathway databases, ontologies, etc...) which are used repeatedly by a very large number of people in a wide variety of different analyses and so are currently stored at a very large number of different locations. • Data is transferred over internal high bandwidth interconnect in a datacentre when going from storage to processors rather than over low bandwidth, expensive, Internet when going from an international database to analysis hardware at an analysis facility. Certain popular datasets may even reside semi-permanently in the memory of analysis hardware at the datacentre rather than in disk storage.  15  How Cloud Computing is Relevant to FPGA-Based Computation The users of cloud computing services will be able to purchase computation services at a high level of abstraction. The correspondence between the computational problem they are trying to solve and the service itself will be very direct. For example, the type of service offered might be “Alignment of 1.0M short reads to the latest release of the human reference genome in 1 minute for $100” or “Perform structural rearrangement detection on a human short read dataset in 1 second for $1.”. The user needn’t think about how the underlying hardware and software implementation works at all and can just purchase and remotely invoke the standardized computation services that suit their specific needs. Multiple different cloud computing service providers will be able to offer the same set of standardized computation and storage services but with different underlying implementations. As better implementations and better hardware become available the datacentre operators can replace the underlying implementation with the new, better one transparently to the user. Competition between different datacentre operators will ensure adoption of better underlying implementations as they become available. For a small datacentre it does not make sense to invest in application specific equipment such as a hardware appliance which performs short read sequence alignment even if it provides better price/performance than software running on generic hardware. This is because its utilization will be too low to make it competitive with software running on generic hardware. However, for a large cloud computing datacentre, purchasing this kind of equipment may be cost effective because of its larger number of users. For a large datacentre it may be worthwhile to use GPU’s, FPGA’s, custom ASIC’s or some other 16  non-standard hardware for a certain percentage of its computation infrastructure. FPGA’s are inherently low power. This is very useful in a datacentre environment as it saves electicity costs and cooling costs and allows them to be packed more densely than higher-power chips in rackmount equipment.  1.2.7  How BLAST (Basic Local Alignment Search Tool) and Other Related Algorithms Work  BLAST (Basic Local Alignment Search Tool)(Altschul, Gish, Miller, Myers, and Lipman), BLAT and related programs are used to search large databases of DNA and protein sequences for sequences which have high sequence similarity to a user’s query sequence. BLAST makes use of the heuristic that homologous DNA and protein sequences are much more likely to have exactly-matching subsequences than non-homologous DNA and protein sequences. BLAST takes as input a query sequence and a sequence database. It produces as output a set of alignments of the query sequence to sequences in the sequence database. Method Used by BLAST(Baxevanis and Ouellette) Step 1: Generate every length W subword of the query sequence, these are the query word s. For a query sequence string S of length |S| we will generate (|S| − W + 1) ≈ |S| query words. By default we use W = 11 for DNA sequences and W = 3 for protein sequences. This process is known as seeding. Step 2: A neighbourhood word of a string, P , is a word which can be created from P by performing substitutions with a given scoring matrix with a  17  total substitution score below a threshold value, T , the neighbourhood score threshold. For each query word, we generate the set of all neighbourhood words of that query word. That set of words is called the neighbourhood of P . Our input parameters for generation of the neighbourhood are: The string P (of length k), the threshold score T and the substitution matrix. By choosing a large value for T we can find only exact and near-exact hits in the sequence database, a smaller value of T will also find more distantly related neighbourhood words from the query word P . Step 3: Find every subsequence position of the sequence database at which each neighbourhood word appears as an exact match. A position at which a neighbourhood word appears in a sequence database and the position the corresponding query word appears in a query sequence are together known as a High-Scoring Segment Pair (HSP). There may be 0, 1 or multiple HSP’s for a given neighbourhood word and sequence database. Step 4: Extend each HSP in both directions along the database sequence. We typically score each HSP alignment using a fixed number of points for matches (a positive number), mismatches (a negative number) and gaps (a negative number). The alignment continues in both directions. As the length of the extension increases the cumulative score of the alignment will vary. Once the cumulative score drops by more than a pre-specified amount, X, below the previous maximum cumulative score, the alignment length is set to the length at which the previous maximum cumulative score occured and the extension in that direction is halted. This results in a local alignment of the query sequence to the sequence database. BLAST computes an E-value for each HSP. This is the probability that 18  BLAST would find that particular HSP by chance. The E-value is a probability estimate of whether the HSP is a false positive. Why we Can’t use BLAST for Short DNA Read Alignment BLAST-type algorithms can be used to perform short-read sequence alignment in an obvious way. Simply use each short read as a query sequence and use the reference sequence as the sequence database. Using an alignment policy of allowing 2bp mismatches, there are  28 2  = 378 ways that 2 base pairs in a read can be mismatches. Each mismatch can be one of 3 base pairs, so for 2 mismatches there are a factor of 32 = 9 mismatch combinations to search for. So for 5.0M reads we need to search for a total of 5.0M ∗ 378 ∗ 9 = 17.0 ∗ 1012 neighbourhood words. While doing this would produce correct alignment results, it would be too slow for real short DNA read datasets and real reference sequences. BLAST algorithms do not search for the seeds in an efficient way as they’re only designed to search for a small number of seeds. Short read aligners build an index data structure of the seeds or the reference sequence which makes the hit finding process much more efficient than the hit finding process used by BLAST. This can only be done because we’re searching for a large number of seeds in the same reference sequence.  1.3  Software Aligners  Several software short-read aligners designed to run on microprocessors have already been written and are in use.  19  1.3.1  The Indexing/Hit Finding/Hit Extension Paradigm  The short read alignment problem is often structured into 3 phases. This is similar to the way compilers are often structured into five phases(Aho and Ullman) and the way the TCP/IP stack is structured into four layers(Stevens). This aids with conceptual understanding of the aligner. The phases do not neccesarily run one after another, nor do they correspond to a particular implementation technique. The paradigm is shown in figure 1.3 on page 21. Indexing: A data structure is built containing all the seeds or the whole reference genome, called the index. Hit Finding: The reference genome is scanned against the index, or the seed dataset is scanned against the index and all the hits for each seed are found. This is usually done with a method which allows a small number of false positive hits to be generated. Hit Extension: Each hit is extended to check if it’s a false positive. If it is a false positive then it’s simply discarded. If the hit is a true positive then additional processing may be carried out, such as computing an associated quality score.  1.3.2  Error Models  The error model of a string matching algorithm is the criteria by which an alignment of a pattern (i.e. a seed) to a text (i.e. a reference sequence) at a particular position is considered an alignment (i.e. a hit). The most obvious error model to use is to consider an alignment position where the pattern is identical to the subsequence of the text a hit. In practice  20  Figure 1.3: The three phases of short read alignment.  21  for short read alignment we wish to allow one or two mismatches between the seed and reference at an alignment position. This allows for SNPs between the reference sequence and the sample sequence and individual base sequencing errors in the reads. The choice of error model has an enormous influence on how computationally expensive the alignment problem is to perform. For a text of length n, subsequence matching has run time from O(n) to NP-complete depending on the error model used(Navarro).  1.3.3  ELAND  ELAND (Efficient Large-scale Alignment of Nucleotide Databases) was the first alignment program designed specifically to align whole-flowcell sized datasets of Illumina reads (typ. ∼12.5M reads) to a mammalian-sized reference genome (typ. ∼3Gbp). ELAND (previously known as IMPALA) was written in C++ by Anthony Cox at Illumina. The information in this section comes partly from the MAQ paper(Li, Ruan, and Durbin) and partly from comments in the ELAND source code. The source code and binary for ELAND are supplied by Illumina with the Illumina Genome Analyzer instrument. The source code and binary of ELAND cannot be downloaded by the general public, they are only distributed as part of the Genome Analyzer’s software package. There is no publication associated with ELAND in the short read alignment literature. The only documentation available for ELAND are the comments in the source code and information on its website(ELA). Being the first short read sequence aligner, ELAND influenced the development of later aligners such as MAQ.  22  Figure 1.4: An example of 2b/nt encoding. Representing Seed Sequences and Reference Sequences with 2-bits-per-nt Encoding ELAND represents the query seeds and reference genome it uses with 2-bits-per-nucleotide encoding. I will use the abbreviation 2b/nt for 2-bits-per-nucleotide from here on for brevity. Each nucleotide is simply represented as one of the four possible 2-bit words. Which 2-bit word is assigned to which nucleotide doesn’t matter as long as the assignments are consistent throughout the whole implementation. It would make sense to assign the nucleotides in alphabetical order to the the 2-bit strings in numerical order (i.e. A = 00, C = 01, G = 10, T = 11) so this is what I do later on when I use this technique in my implementation. This is the most space efficient way we can represent DNA sequence data without using compression. This method has four times the space usage efficiency of representing each nucleotide with an 8-bit ASCII character code. This scheme has the nice feature that four nucleotides can be neatly packed into a byte with no wasted space. An example of 4 nucleotides encoded into a byte using 2b/nt encoding is shown in 1.4 on page 23. Because all 4 2-bit words are used to represent sequence characters, it’s impossible to include additional information alongside a 2b/nt encoded DNA  23  Step Initialization Iteration Termination condition  Action taken Time required Q[1..m] := T[1..m]; i=m; O(m) ≈ O(1) as m << n Q’[1..m] := Q[2..m, T[i+1]]; i++; O(1) When i = n.  Table 1.1: The sliding window algorithm. sequence. For example, if we represent a reference genome with this method we can’t include ambiguous characters or chromosome delimiters, there’s just no way to represent them. Sliding Window Query Generator for Hit Finding In the hit finding phase of the alignment, ELAND needs to generate every subsequence position of the reference sequence. These are each looked up in turn in the index to find hits. To generate these positions, ELAND uses the “sliding window”(Crochemore, Hancart, and Lecroq) technique which is an old and well known technique in the string matching algorithm literature. Given a text T[1..n] then to generate all subsequences of length m one after another in the character array window Q[1..m] simply perform the algorithm in table 1.1 on page 24. The iteration step requires only time O(1) as we produce Q’ from Q by applying a 1-character shift and then a 1-character insert rather than an m-character insert, which would require O(m) time. We can visualize this process as the window Q[1..m] sliding over the character sequence T[1..n], hence the name “sliding window”. This is shown in figure 1.5 on page 24. Hash Table Index Lookup Method for Hit Finding When performing hit finding with a sliding window, the 2-part question we want to be able to answer is “Is this 28bp (56 bit) subsequence of the reference 24  Figure 1.5: Generating query subsequences with a sliding window. sequence one of the N x 28bp (56 bit) seeds in the seed dataset? If so, which one(s)?” We can then simply generate every 28bp subsequence of the reference sequence (with the sliding window method, or otherwise) and answer that question for each one. One way to answer this question is to use the direct-address table data structure(Cormen). This data structure has the very desirable property of being able to answer the question in O(1) time. The table has 256 slots, numbered 0 to 256 − 1 each of which is simply a 1-bit boolean value. If the 56 bit seed, s, is in the seed dataset, we set the corresponding slot to TRUE, otherwise we leave that slot as FALSE. In practice we may want to store an unsigned integer indicating the number of times a particular seed occurs in the dataset, as each seed sequence can validly occur more than once in a particular seed dataset. Figure 1.6 on page 25 shows this method. It would be acceptable to use this method for small seeds but 56-bit seeds require tables with 256 slots. Even if we had a slot size of 1 bit, a table this size would require 223 ≈ 8, 000, 000GB of memory which makes it impractical to use. An improvement on this basic direct-addressing method is to hash each seed into a 24-bit integer and use the integer as the key value instead of the raw seed. This reduces the number of slots required in the hash table to 224 ≈ 16M, 25  Figure 1.6: The table used for the direct address method.  Figure 1.7: The exact matching hash table method. which is large but possible to implement on current hardware. A problem with this is that multiple different 56-bit seeds can have the same 24-bit hash value: So when a collision occurs in the hash table we wouldn’t be able to uniquely identify the seed which caused that collision. To get around this we modify the table so that each slot stores the unique identifier(s) of the seed(s) stored instead of just a boolean value. Then when we find a collision in the hash table, indicating a hit, we can read out the unique identifiers of the seed(s) which the query collided with. Figure 1.7 on page 26 shows this exact match hashing method.  26  This modified method is actually usable because it has a manageable memory requirement. It also still has the very desirable property of answering our query in approximately O(1) time. Having this O(1) runtime is dependent on the standard assumptions needed for hash tables to operate: that they don’t get too full and the hashing function distributes keys to slots randomly(Cormen). In practice we break a large seed dataset into multiple batches and perform multiple passes of the reference genome to stop the hash table getting too full. This exact-matching hash-table method would be fine to use if we only wanted to perform exact matching, but for short read hit finding we need to find hits which have one or two mismatches between the seed and reference sequence (as explained in section 1.3.2, page 20). Modifying the Hash Table Method to Work with Inexact Matching Suppose we have two strings each of length 2n characters, X and Y. We can divide each of them into 2 segments each of length n characters: X0 , X1 and Y0 , Y1 . If X and Y are identical (i.e. they have 0 mismatched characters) ⇒ Both segments will match ⇒ At least one segment will match If X and Y have exactly 1 mismatched character ⇒ Exactly one of the two segments will mismatch ⇒ Exactly one segment will match ⇒ At least one segment will match If X and Y have 2 or more mismatched characters ⇒ At least one of the two segments will mismatch ⇒ At most one segment will match So if we find alignment positions where at least one segment of the seed matches the corresponding segment of the query sequence, we will find all 27  0-mismatch hits, all 1-mismatch hits and some (2-or-more)-mismatch hits. We can use this observation to modify the hash table hit finding method to find 1-mismatch hits as well as 0-mismatch hits. This modified 1-mismatch hash table method is shown in figure 1.8 on page 28. Notice that the figure has been rotated 90 ◦ clockwise relative to figures 1.6 and 1.7. Making this modification to produce the 1-mismatch modified method introduces problems that the 0-mismatch original method doesn’t have: • Memory usage: It uses twice the amount of memory as the original method. This is because it uses 2 hash tables, whereas the original method uses just 1. • Speed: It is less than half the speed of the original method. This is because it has to perform 2 hash table inserts/searches where the original method performs just 1. It also has to perform two template applications per query, whereas the original doesn’t have to perform any. • Accuracy: The 0-mismatch modified method generates all the true positive hits and no false positive hits. The 1-mismatch modified method generates all the true positive hits and some false positive hits. This is not a problem in practice (explained below). The increased memory usage and runtime are not problems, the method is still usable. The generation of false positives would make the method unusable were it not for the fact that we can simply use the seed extension phase of the algorithm to filter out false positives. As long as the number of false positives generated is reasonable (say, 1 false positive for each true positive) the method remains usable. A really inaccurate hit finder could generate a huge number of 28  Figure 1.8: The one-mismatch hash table method. 29  false positives e.g. by generating a hit for every seed at every reference sequence position. The hit extender would still be able to filter all the false positive out but it would take an unusably long time to do so. We can extend this 1-of-2 subsequence matching method to the 2-of-4 subsequence matching method. This is the method actually used by ELAND. We divide the query sequence and seed sequence into 4 subsequences. If there are at most 2 mismatches between the query and the seed, there will be at most 2 mismatched segments between the query and the seed, so there will be at least 2 matching segments between the query and the seed. So finding all hits with at least 2 matching segments will find all 2-mismatch hits along with some false positives. This 2-mismatch hash table method is shown in figure 1.9 on page 31. ELAND was the first aligner to use this “2-of-4 subsequence method”. This method is known as the “pigeonhole principle”(Jiang and Wong). It’s also been used in a completely different application: Detection of nearly-identical webpages(Manku, Jain, and Das Sarma). In the general n-of-2n method we need  2n n  hash tables if we want to allow  n mismatches between the query sequence and seed sequence. Consequently an n-mismatch alignment takes alignment and consumes A few values of 2n n  2n n  2n n  2n n  times longer to run than a 0-mismatch  times the amount of memory whilst running.  are shown in table 1.3.3 on page 32. The function  = ((2n)!)/((n!)2 ) grows very rapidly. This is faster than 2n . Really this  technique of running multiple hash tables behind the scenes to cope with mismatches is a bit of a hack - it doesn’t scale up to large numbers of mismatches because the runtime and memory requirement quickly becomes unuseable. However, it happens to have manageable resource requirements for very small values of n, so it can be used in practice. With a slot size of 4 bytes 30  31 Figure 1.9: The two-mismatch hash table method.  Table 1.2: Some values of mismatch alignment.  2n n  , the number of hash tables needed for an nn 0 1 2 3 4 5 6 ... r ...  2n n  1 2 6 20 70 252 924 ... 2r r  ...  and for 6 hash tables, we require 6 ∗ 224 = 96MB of memory. How ELAND fits into the Indexing/Hit Finding/Hit Extension Paradigm Indexing: Generate the seed corresponding to each read in the input dataset. Load each seed into each of the 6 hash tables after first applying the corresponding template and then the hash function. Hit finding: Scan a sliding window over the reference genome to generate queries. Apply each of the 6 templates to each query, then apply the hash function to each query. Look up each of the 6 x 24 bit hash integers in the corresponding hash table. If a collision occurs in any of the 6 hash tables, generate a hit of the collided seed with the position the window is at in the query. Hit extension: None, raw hits are output with no post-processing.  32  Known Problems with ELAND • If a read appears at two or more positions of the reference genome ELAND will find only one of them. • ELAND cannot perform gapped alignment. • All the reads aligned in a single run of ELAND must be the same length. This length is fixed at compile time so a single executable version of ELAND can’t be used on reads of different lengths. • ELAND can only align reads of length 32bp or less. • ELAND cannot handle ambiguous bases in the reference genome file.  1.3.4  MAQ (Mapping and Assembly with Qualities)(Li, Ruan, and Durbin)  MAQ is a free and open source suite of programs for performing short-read alignment. The MAQ aligner itself is written in C++. Its auxiliary tools are written in C and Perl. MAQ was written by Heng Li. At the time he was a postdoctoral researcher at the Sanger Institute in Cambridge, UK. MAQ’s hit finding method is identical to the method used by ELAND (explained in section 1.3.3 on page 24). MAQ’s implementation and ELAND’s implementation do not share any code. MAQ is a more feature-rich and higher quality package than ELAND. Computation of Mapping Qualities for each Read The most important difference between MAQ and ELAND is MAQ’s generation of a mapping quality value for each read aligned. The mapping quality of a read 33  is an estimate of the probability that the alignment is a false positive i.e. that the read didn’t come from the position to which it has been aligned. Each mapping quality value is a phred-scale probability score(Ewing, Hillier, Wendl, and Green). The phred-scale probability score can be computed from a probability value with the formula Q = −10. log10 (P robability) On this logarithmic scale a score of 0 corresponds to a probability of 1, a score of 10 corresponds to a probability of 0.1, a score of 20 corresponds to a probability of 0.01, and so on. Usually these values are stored in 8-bit unsigned bytes, making only the integers 0 to 255 inclusive valid as phred-scale probability scores. If every character position of the seed matches the corresponding character position of the reference sequence at the position it is aligned, then the probability that the hit is a false positive (i.e. its mapping quality) is 0. If the seed has exactly one mismatched character position at the position it’s aligned then its mapping quality is the probability that that character position is incorrect, i.e. the probability of error (pe ) of the read at the mismatching character position. If there are 2 or more mismatching character positions, then the probability that the read is incorrectly aligned is the product of the error probabilities of the mismatching characters of the read. (mapping quality of alignment) = Π(raw error probabilities of mismatched bases of reads) Equivalently, using Q-values this can be expressed as (mapping quality of alignment) = Σ(raw Q values of mismatched bases of reads) Note that MAQ doesn’t consider the possibility of errors in the reference 34  sequence, it assumes every sequence position of the reference is 100% accurate. It only takes into account the possibility or errors in the reads. How MAQ Differs to ELAND • Unlike ELAND, MAQ can handle ambiguous bases in reference genome files. • MAQ can do paired end read alignment (ELAND cannot). • MAQ computes a mapping quality for each hit - an estimate that the hit is a false positive. • MAQ is generally a better designed, easier to use and more feature rich software package than ELAND.  1.3.5  SOAP(Li)  SOAP (Short Oligonucleotide Alignment Program) can search for either hits with a maximum number of mismatches, or hits with a single contiguous gap of up to a certain length. When searching for hits with a maximum number of mismatches, up to 2 mismatches are allowed. When searching for hits with a maximum gap size, maximum gap sizes of 1-3bp are permitted. SOAP represents the reference sequence and reads with standard 2b/nt encoding (see p 22). SOAP is a sliding-window aligner, as are MAQ and ELAND. However, the exact method it uses differs slightly from MAQ and ELAND. SOAP indexes and hashes the reference sequence rather than the reads. ELAND and MAQ hash the reads.  35  1.3.6  PASS(Campagna, Albiero, Bilardi, Caniato, Forcato, Manavski, Vitulo, and Valle)  PASS (Program to Align Short Sequences). Fits into the indexing/hit finding/hit extension paradigm as follows: Indexing: It produces an every-subsequence list of the reference genome as an index. Hit finding: Each subsequence of each read is aligned to each element of the index using the Needleman-Wunsch algorithm(Needleman and Wunsch). This would be very slow if it weren’t for the fact that PASS uses precomputed tables of the results this algorithm generates. For subsequences of length w, these precomputed tables have size O(42w ), so only short subsequences can be used. Usually a maximum of 7-8bp is used. Hit extension: The hits are scored and displayed.  1.3.7  SeqMap(Jiang and Wong)  SeqMap was written in C++ and the associated publication(Jiang and Wong) was published in 2008. It uses the standard 2b/nt scheme to represent the reads and reference in memory. SeqMap can perform alignments with up to 5 substitutions and/or indels. SeqMap indexes the reads and scans the reference genome.  1.3.8  Slider(Malhis, Butterfield, Ester, and Jones)  Slider is the first aligner which makes use of the per-base probability-of-error estimates generated by the Illumina short-read alignment instruents. Despite it’s name is doesn’t actually use the sliding-window query generation method 36  for the hit finder. It enumerates every subsequence of the reference sequence and stores them in a lexicographically-sorted list. It also lexicographically sorts a list of the seed dataset which it’s searching for. The two lists are stepped through by a pair of pointers, outputting reference sequence positions which match seeds as hits.  1.3.9  Bowtie(Li and Durbin)  Bowtie uses a more sophisticated hit finding algorithm than the sliding window approach. It uses the Burrows-Wheeler Transform(Adjeroh, Bell, and Mukherjee), a technique originally developed as a “compression boosting” algorithm.  1.4  Other Related Work  Implementing dedicated hardware in FPGA’s for performing bioinformatics algorithms is a well established field, and a large supporting literature exists.  1.4.1  Dynamic Programming in FPGA’s  There are many examples in the literature where the Smith-Waterman(Smith and Waterman) and Needleman-Wunsch(Needleman and Wunsch) dynamic programming algorithms have been implemented in FPGA’s(Hoang and Lopresti)(Li, Shum, and Truong)(Jacobi, Ayala-Rinc´on, Carvalho, Llanos, Hartenstein et al.)(Jiang, Liu, Xu, Zhang, and Sun). These approaches generally all take the same basic approach: They take advantage of the fact that of the n cells in the dynamic programming matrix √ approximately n of them can be evaluated concurrently in a sweeping 37  diagonal line. However: This does not scale well to larger FPGA’s. Only  √  n/n  of the FPGA can be utilized, as only that much of the matrix can be evaluated concurrently. As n gets larger (it approximately doubles every 18 months, a phenomenon known as Moore’s Law), the utilization of the FPGA will get smaller. In (Masuno, Maruyama, Yamaguchi, and Konagaya) the authors take a slightly different approach: They perform 3-sequence, 3-dimensional aligment on an FPGA and make use of the fact that the FPGA hardware is reconfigurable.  1.4.2  Other Previous Uses of FPGA’s in Bioinformatics  BLAST-type algorithms: Many publications describe attempts to accelerate BLAST-type algorithms with FPGA implementations(Buhler, Lancaster, Jacob, Chamberlain et al.)(Lavenier, Xinchun, and Georges)(Sotiriades and Dollas)(Herbordt, Model, Sukhwani, Gu, and VanCourt). Protogenomic Mapping: In (Dandass) the authors present their design for an FPGA implementation of the Aho-Corasick algorithm(Aho). They optimize their design for proteogenomic mapping. They show how the RAM blocks on an FPGA can be used to create high-performance amino acid sequence matchers. The hardware they implement is effectively a short read aligner. HMM’s: In (Gupta), the author uses FPGA’s to implement HMM (Hidden Markov Models). RNA Secondary Structure Prediction: In (Xia, Dou, Zhou, Yang, Xu, and Zhang) the authors implement an existing RNA secondary structure prediction algorithm in FPGA hardware.  38  1.4.3  A Previous Implementation of a Short Read Aligner in FPGA Hardware(McMahon)  A short read alignment appliance already exists. One was implemented for an M.Sc. project at the Univeristy of Cape Town in 2008. The method used was substantially different to the method I used: It performed a dynamic programming algorithm at every position of the reference sequence for every seed loaded into the FPGA. There were some similarities: It used the same method of storing the reference sequence in off-chip SDRAM and loading it into the FPGA and serializing it at runtime. Also, the alignment appliance was connected to the host workstation by Ethernet. The authors only managed to fit 18 seeds of 31bp into each FPGA. Because of the method used it was very slow, extrapolating from their performance measurements gives a speed of only ∼1,000seeds/hr for 3Gbp reference on a high end FPGA. This is about 1% of the speed I achieved (approx. 100,000 seeds per hour per FPGA).  39  Chapter 2 Overall System Architecure This chapter describes the “big picture” view of how the design works, and the rationale behind the design decisions made.  2.1  Basic Idea  Hit finding is easy to parallelize: Each seed can be scanned against the reference genome independently of the rest by a separate hardware module. The hits from each hardware module can then be collated into a single batch of results. If we have N identical hardware modules for performing hit finding we can obtain a speedup of approximately N by simply dividing our seed dataset into N equal parts and running them in parallel. Problems which have this property of having a factor of N speedup when run on N parallel modules are informally called “embarrassingly parallel.” A well-known example of where embarrassingly parallel problems occur is in some 3D graphics algorithms where the color of each pixel can be computed independently of the rest. This property is exploited by 3D graphics hardware.  40  Figure 2.1: Method 1 for implementing custom hardware in an FPGA for embarrassingly parallel problems. Not all problems are embarrassingly parallel: for example computing a sequence of values with the Newton-Raphson process(Ypma) cannot be sped up regardless of the number of hardware modules available as each step of the iteration requires the value computed by the step before so cannot be begun until the previous step completes. The obvious way to implement a system which exploits the massive parallelism of a problem in an FPGA is to implement N custom-designed hardware modules, each with their own bank of memory and run them in parallel in the FPGA. This is shown diagrammatically in figure 2.1 on page 40. The problems with this method are: • Each memory bank would have to be large enough to hold a human 41  reference DNA sequence (i.e. about ∼1GB using 2b/nt. encoding). For N=100’s or N=1,000’s the cost of this amount of memory would be enormous compared to the cost of the FPGA. • Each memory bank would require, say, ∼100 pins to connect to the FPGA. There are typically only ∼1,000 pins on an FPGA. This would limit the number of custom modules that could be instantiated in the FPGA to around ∼10. This would be very wasteful of the space in the FPGA: Really we want to be able to implement hundreds or thousands of hardware modules in the FPGA. Fortunately hit finding with a sliding window allows an alternative, better way to exploit massive parallelism. When hit finding with a sliding window, our custom module is simply a register holding the seed in 2b/nt form and a comparator which tests if this seed is identical to the search seed at the current window position in the reference sequence. Because all N custom modules are aligning to the same reference sequence at the same clock speed, we can have a single memory interface streaming the reference genome into a window in the FPGA. This could then be broadcast over an internal broadcast bus in the FPGA to all N functional units simultaneously. This is shown in figure 2.2 on page 42. This method requires only a single interface to a memory bank and so avoids both the problems with the method shown in figure 2.3. Note that the reference sequence needn’t necessarily be stored in a bank of memory: It could be stored on a host workstation and streamed over an interface such as PCI, PCI-Express, USB, Ethernet, etc.. This is indicated in the diagram. Unfortunately hit extension can’t be used with this second better method. While hit extension is a massively parallel problem, each custom module needs  42  Figure 2.2: Method 2 for implementing custom hardware in an FPGA for embarrassingly parallel problems.  43  Figure 2.3: Properties of the subproblems of alignment. random access to its own bank of memory because it is aligning to a different part of the reference sequence. Fortunately hit extension is less computationally expensive than hit finding so it can be done as a post-processing step by a general purpose processor rather than custom hardware in the FPGA. This could take the form of an embedded soft processor in the FPGA, a hardware co-processor on the FPGA board or the host workstation itself. Which will be used in the implementation will be a design decision taken based primarily on the features of the development board used. Figure 2.3 on page 43 summarizes the relevant properties of hit finding and hit extension. This approach of combining microprocessors with FPGA’s is generally known as reconfigurable computing(Compton and Hauck). Being able to keep the utilization of the custom modules high relies on the number of seeds in the dataset being large compared to the number of custom units in the FPGA. This is an acceptable assumption to make as short read alignment datasets are typically millions of reads and the number of custom modules in the FPGA will probably number in the 1,000’s at most. This means when we run the aligner multiple times, the seed registers will be full for virtually all the runs. Programming the seeds into the custom modules takes non-zero time. 44  This prevents us having a truly linear speedup with N modules as a larger value of N has a larger reprogramming time as the custom modules are reprogrammed individually. However in practice the time taken to reprogram the custom modules is likely to be very small (i.e. virtually zero) compared to the time taken to scan the reference genome, so the speedup gained with N custom modules is almost linear. If we have 1,000 custom modules each with a 56 bit register we have 56kbits = 7kbytes of data to load into the seed registers. A 3.0Gbp reference sequence using 2b/nt encoding has size 750MB. Assuming we transfer data at the same speed, loading data into the seed registers takes only (7kbytes/750MB) = 0.0009% of the time it takes to scan the reference genome.  2.2  First Implementation Attempt: Using the Cray XD1 FPGA-Accelerated Computer  At first I used the Cray XD1 hardware. This was chosen mainly because it happened to be available at the GSC at the time and my supervisor was able to give me use of it. A cray XD1 system contains one or more rack-mount chassis. Each chassis contains a custom motherboard with 6 Opteron high-end processors and 6 Xilinx Virtex-II high-end FPGA’s. Each of the processors has an FPGA connected directly to its Hypertransport bus (i.e. its memory bus). This allows extremely low-latency, high-bandwidth communication between the processor and FPGA compared to systems which use FPGA’s that connect to expansion buses such as PCI or PCI-Express. The system doesn’t provide device drivers for the user code to interact with the FPGA’s, instead the system runs a 45  customized version of the Linux operating system which includes code for CPU-to-FPGA communication integrated into the kernel. I decided not to use the Cray XD1 for my final implementation for these reasons: • The documentation provided by Cray was good but I found it to be very difficult to use as I was essentially programming the “bare metal” of the Hypertransport interface. Familiarizing myself with the details of how Hypertransport works would be too much work for a graduate student working alone. • The Virtex-II FPGA’s in the Cray XD1 can only be used with development tools which Xilinx charge $2,500/year for. • The FPGA’s can only be programmed with VHDL, not Verilog. This is because the hardware “stubs” which the user’s FPGA code connects to at compile time provided by Cray are written in VHDL. VHDL is a similar but incompatible language to Verilog. I could potentially have learned VHDL but I would prefer to use Verilog as I have previous experience with it. • The hardware cost of the Cray XD1 is extremely high (∼$150,000 for a 6-Opteron, 6-FPGA system) as it uses a custom designed motherboard and operating system. This is necessary to provide the very low latency CPU-to-FPGA communication. My implementation does not need to have such low-latency communication between the FPGA and CPU, so this capability is wasted.  46  Figure 2.4: Image of the Arria GX FPGA development board. The FPGA is the large, silver chip near the centre of the board. The PCIExpress x4 connector is on the bottom edge of the card. The rest of the components form the power supply, programming logic and glue logic of the card.  2.3  Second Implementation Attempt: Development of a PCI-Express based Accelerator Card for the Host Workstation  The second attempt I made was to implement a plug-in card for a workstation which connected the FPGA to the workstation via PCI-Express. This is a similar idea to using a 3D graphics card. However, instead of offloading the 3D graphics calculations to a GPU, the short read sequence alignment computations would be offloaded from the CPU to the custom logic in the FPGA. An Arria GX development board was purchased from Altera for this purpose. Image 2.4 on page 46 shows the development board. After spending several weeks using the card I decided to abandon using it in favor of easier-to-use hardware. PCI-Express is a very complex interface to interface custom logic to and to write device drivers for. With no previous experience with it and working alone I decided that, while it would be possible to produce a PCI-Express based implementation, it would take too long.  47  Figure 2.5: How the Ethernet-based appliance connects to the host workstation.  2.4  Third, Final Implementation Attempt: Development of an Ethernet-based Appliance  The way I finally decided to implement the system was as an Ethernet-based network-accessible appliance. The user would communicate with the appliance over an Ethernet link. Software to allow the user to communicate with the appliance would have to be custom written. The appliance could contain multiple FPGA’s. For very high speed it would be possible to have hundreds or thousands of FPGA’s in a single appliance. Figure 2.5 on page 47 shows how an appliance would be connected to a host workstation. The FPGA development board I chose as the hardware to implement the appliance was the Altera DE2-70 board. It was chosen because it had an Ethernet transceiver, it was inexpensive ($330 academic pricing) and because I had previous experience with a similar board. An image of the DE2-70 board is shown in figure 2.6 on page 48. Once I had the board I began to experiment with implementation strategies and make design decisions. The DE2-70 board includes the following integrated components.  48  Figure 2.6: Image of the Altera DE2-70 board. The Cyclone II FPGA is the large chip slightly to the right of the centre of the board. The rest of the board comprises the power supply, IO devices, memory chips, programming logic and glue logic. • One Altera Cyclone II EP2C70 FPGA. • Integrated USB-connected FPGA programmer. • Integrated 50MHz clock generation hardware. • 64MB SDRAM. • 2MB SSRAM. • DM9000A 10/100Mbps Ethernet Adaptor. • Other IO devices (LED’s, switches, RS232, USB, monitor connector, audio connector, LCD display).  2.5  Choice of Development Tools  For the FPGA design development I used the Quartus II development environment and synthesizer. This is the development environment provided by Altera specifically for use with their FPGA’s. Altera update their development tools frequently, typically more often than once per year. I used the v7.2 development tools at first as they were supplied on CD with the DE2-70 board. 49  Figure 2.7: The hardware setup used during development and testing. I incrementally upgraded the tools by downloading the newest versions as they became available. The final version was built with the v9.1 development tools. In addition to Quartus II I used the MegaWizard tools (needed to use embedded on-chip memory in the FPGA), the SOPC Builder (needed to auto-generate the code for the embedded processor in the FPGA and the associated logic) and Eclipse(Ecl) for developing the Java software used on the host workstation. All HDL (Hardware Description Language) code in the implementation was written in Verilog 2001.  2.6  Development Hardware Setup  Figure 2.7 on page 49 shows the 3 pieces of equipment used during development and testing and how they are interconnected.  50  2.7  Design Decision: Where to Store the Reference Sequence  The reference sequence resides on the disk of the host workstation to start with. There are 2 ways it can be transferred into the alignment hardware of the FPGA: • Stream the reference sequence incrementally from the workstation into the sliding window of the alignment hardware in the FPGA. • Pre-load the reference sequence into the onboard SDRAM over the Ethernet link, then transfer the reference sequence into the sliding window of the alignment hardware in the FPGA only when it’s actually performing an alignment. I first implemented option 1 because it was easier to implement as it didn’t require me to use the SDRAM. It also has the advantage of allowing any size reference sequence to be used, whereas pre-loading the reference sequence into SDRAM limits the user to using a reference sequence that will fit into the amount of SDRAM available. Despite these advantages to option 1, I used option 2 in my final implementation because it has certain advantages: • Each time the aligner runs when using method 1, it generates many gigabytes of network traffic to transfer the reference sequence from the workstation to the FPGA. When running from the on-board SDRAM (option 2) this traffic is only generated once, when the reference sequence is transferred from the workstation to the SDRAM. When the aligner actually runs the reference sequence data is transmitted from the onboard SDRAM to the FPGA so no network traffic is generated. 51  • The Ethernet adaptor on the DE2-70 board only runs at 100Mbps, and is even slower in practice. This can deliver reference genome data at a maximum rate of 50Mnt/s using 2b/nt encoding. So if we used the Ethernet interface to stream reference sequence data into the alignment hardware, it would limit the alignment hardware to running at 50MHz or slower. This is likely to be a problem as I managed to implement alignment hardware with a maximum clock speed of over 100.0MHz. Another option would have been to store the reference sequence in the onboard 8MB flash memory of the DE2-70 board. The fact that flash memory wears out after a small number of write cycles wouldn’t matter as the reference genome would only have to be updated occasionally. The property of retaining its memory state when powered down would be desirable so that the reference genome wouldn’t have to be loaded into the appliance again once it had been powered down and then back up again. I didn’t use this flash memory to store the reference genome because its capacity of 8MB is too small and the guaranteed read rate of ∼20MB/s is too slow.  2.8  Design Decision: Which Devices are Used on the DE2-70 Board and What Happens to the Rest  The DE2-70 board includes multiple IO devices hardwired to the FPGA. It’s similar to a PC motherboard but with an FPGA in place of the processor. The only hardware components used on the board are the 2 x 32MB SDRAM chips,  52  the 2MB SSRAM chip and the DM9000A Ethernet Controller. None of the other IO devices are used. The rest of the hardware devices on the board are all deactivated. This is done to • Reduce power consumption by the unused components. • Prevent potential interference by these devices with the operation of the aligner. This could especially be a problem for the devices which share pins of the FPGA. The unused hardware devices cannot be physically disconnected from the FPGA, so I wrote a Verilog code “stub” to deactivate each of these unused devices and incorporated it into the final FPGA design. During development the switches, keys, led displays and lcd display were used to display debug information. These were temporarily enabled for debug purposes and then deactivated in the final design.  2.9  Adapting the The Basic Idea to Short Read Alignment with the DE2-70 Board  For the reasons outlined in section 2.1 on page 39, hit finding will be performed in the FPGA and hit extension will be performed by the host workstation. The method for parallel execution of a problem (figure 2.2, p 42) can be adapted for use with the DE2-70 board as shown in figure 2.8 on page 53. In my implementation, the custom module consists of a 56 bit hardware register (28nt, 2b/nt encoding) for storing a seed and a comparator to compare that seed to the 56-bit query sequence. In figure 2.8 they are shown as the small 53  Figure 2.8: Adapting the basic multi-parallel-module to a hit finder on the DE270 board.  54  blocks labeled “CM i” for i = 1 to N. The internals of one of these comparison modules is shown in figure 2.9 on page 56. The internals of each comparison module replicates the 2-of-4 subsequence matching method of ELAND and MAQ, but does this in hardware instead of software. The green area represents the FPGA development board and the boxes within it represent the components on the board. The pale yellow box represents the FPGA and its contents represent the virtual hardware design instantiated in it. The orange block within the FPGA contains the hit finding hardware which uses the majority of the FPGA’s resources (at least 90%, depending on the configuration). The dedicated hardware for performing hit finding shown in this orange block is called the alignment pipeline. The controller hardware occupies just a few % of the FPGA’s resources. The green block inside the FPGA represents the deactivation stubs connected to the rest of the hardware devices on the DE2-70 board, deactivating them. The control module is responsible for • Driving the DM9000A Ethernet adapter to transmit packets to the host workstation and receive packets from it. • Writing the reference sequence data into the SDRAM reference sequence buffer. • Writing the seed register data into the seed registers in the comparison modules. • Collecting hit results from the comparison modules and returning them to the host workstation. • Overall control of the alignment appliance 55  The control module is implemented as a SOPC with Altera’s SOPC Builder tool. It incorporates the NIOS-II embedded soft-processor along with several other off-the-shelf hardware components. When the hit finder is running, the reference sequence is transferred from the reference sequence SDRAM into the sliding window query generator. The window advances 1nt along the reference sequence per clock cycle. The sliding window query generator outputs 56-bit queries to a broadcast bus which is broadcast to all N comparison modules. Each comparison module compares the query to its stored seed using the 2-mismatch-permitting 2-of-4-subsequence comparison method. The results from the comparison module are collated and returned to the host workstation by the controller via the Ethernet interface. The SDRAM used as the reference sequence buffer has a capacity of 64MB. Using 2b/nt encoding this can store a reference sequence of maximum length 256Mbp. Each comparison module contains a 56 bit register to store a 28nt seed in 2b/nt encoding along with the hardware required to compare the stored seed to the query using the 2-of-4 subsequence comparison method. Figure 2.9 on page 56 shows the design of a comparison module. The seed sequence and query sequence are each split into 4 x 14-bit subwords. Each of the 4 subwords in the query is tested for equality with the corresponding subword of the seed. If 2 or more of the subwords are identical then the comparison module asserts its hit output signal to tell the controller a hit has been found. Figure 2.10 on page 58 shows the hardware used to read and write to the seed registers. Note that the register is horizontally inverted compared to the one in figure 2.9. The register is basically 2 x 28-bit shift registers in parallel. There have to be 2 bits at each position as it takes 2 bits to store each 56  Figure 2.9: The design of a comparison module.  57  nucleotide. The contents of each nt position can simply be read off as shown in the diagram. To write to the seed register, new data is presented to it 1nt (2 bits) at a time, then the write line is asserted causing both shift registers to shift by 1 bit and take the new 2-bit value presented on the data signal. Each piece of seed register write hardware is chained to the seed register writing hardware in the next comparison module. This allows the controller to write to the contents of the seed registers without having to have a dedicated data bus and address bus for writing to the seed registers, reducing FPGA resource usage. A side effect of this is that all the seed registers have to be updated together - they can’t be changed individually. This isn’t a problem though - when we load a new seed batch into the seed registers in the FPGA, we want to load all the seed registers at once anyway. Because we don’t always wish to use the full seed batch size that the FPGA can support, there needs to be a mechanism to deactivate certain comparison modules when they aren’t in use to stop them generating spurious hits. This is done by having a 1-bit enable bit in each comparison module. This bit is ANDed with the 2-of-4 subsequence detector to produce the hit output signal. When the enable bit is set the hit output signal is simply the output from the 2-of-4 subsequence detector. When the enable bit is clear the output is permanently off, regardless of whether the query and seed match. The seed register enable bits are chained and written to by the controller in the same way as the seed registers, for the same reason.  58  Figure 2.10: The hardware of a seed register.  59  2.10  Design Decision: Method of Getting Reference Sequence Data into the Query Generator Sliding Window  2.10.1  Method 1  For the first method I used, the reference sequence was not stored on the board, it was streamed incrementally from the workstation over the Ethernet interface into the query generator window. The SOPC controller wrote data into the query generator hardware using a 32-bit connection from a PIO (Programmed Input/Output) device from the NIOS-II processor in the SOPC controller. Both the NIOS-II and PIO are library components supplied by Altera. A PIO is simply a memory-mapped register which the processor can write to which connects onto the rest of the hardware design. There were two problems with this: 1. The Nios-II processor can only run a single thread at a time, and it also has other duties than writing data into the query generator window. When it’s doing other things it can’t be writing data into the query generator window. This stalls the entire alignment pipeline in the FPGA, reducing its performance. 2. The PIO is not designed for high performance. It’s simply too slow to keep up with the rate of data consumption by the query generator.  60  2.10.2  Method 2  I solved the first problem by inserting a hardware FIFO buffer between the NIOS-II processor and the query generator window hardware called the text FIFO. This works in the same way as a software FIFO buffer and is implemented as a piece of dual-ported memory. The processor writes to the top of the hardware FIFO and the query generator reads from the bottom of it. The data arrives from the processor in bursts and is consumed at a steady rate by the query generator window hardware. I solved the second problem by replacing the PIO with a custom SOPC component which interfaces directly to the data bus of the NIOS-II processor in the SOPC controller. I also modified the design so that the reference sequence is buffered in the SDRAM first and then streamed into the query generator window by the processor when an alignment is actually being performed, rather than being streamed over the Ethernet interface. The bandwidth to the SDRAM is ∼533MB/s, much higher than the ∼10MB/s offered by 100Mbps Ethernet. There was still a problem with this: The write was being performed by a loop in the C program embedded software running on the NIOS-II processor. I determined by measurement the maximum data rate the NIOS-II processor could sustain was 2.4MB/s. This corresponded to a rate of consumption of ∼9.6Mnt/s by the query generator sliding window. This would be limiting as the alignment pipeline is capable of running substantially faster than 9.6MHz.  2.10.3  Method 3  I made 2 changes to version to produce version 3:  61  1. I inserted a DMA (Direct Memory Access) library component to perform the reading from the SDRAM and writing to the query generator. The DMA component is controlled by the processor and instructed to actually perform the memory transfer. The processor can continue to perform other tasks while the DMA controller carries out the transfer. 2. I modified the controller SOPC so it had a “high performance bus” on physically separate hardware than the regular processor bus. The DMA transfers data from SDRAM to the query generator window over this high performance bus, so as not to use up all the bandwidth on the regular processor bus. The high performance bus is run at a higher clock speed that the regular processor bus (133MHz rather than 50MHz). The reference sequence memory is still written to by the NIOS-II processor, as it isn’t particularly important that the reference genome be able to be updated quickly as it’s only done occasionally. This is explained in more detail in the chapter on the development of the SOPC. This had the desired effect of increasing the rate at which reference sequence data could be delivered to the query generator window fast enough that the alignment pipeline doesn’t stall.  2.11  System Components  The whole system will consist of the implementation of the hardware appliance itself along with supporting software for the host workstation. It will consist of the following subcomponents. The design and implementation of each of them is described in its own chapter.  62  • A SOPC controller design. • The alignment pipeline. • Embedded software (firmware). • An appliance control application for uploading reference sequences to the alignment hardware and receiving hit results from it. • A reference sequence ambiguity resolution application.  63  Chapter 3 SOPC Controller Development This chapter outlines the development of the controller for the alignment appliance. It’s been put in its own section as it turned out to be a large and elaborate part of the design. The controller described in this section runs embedded software written in C. That software is described in its own separate chapter. The alignment pipeline is also described in its own separate chapter. The controller, alignment pipeline and embedded software are tightly integrated components of the FPGA design and were designed alongside each other in practice.  3.1  What the SOPC Controller Does  Version one of the SOPC controller performed the following functions: • Interfaces to the DM9000A Ethernet adaptor on the DE2-70 board for communication with the host workstation. • Updates the contents of the SDRAM reference sequence buffer. • Writes to the seed registers and to the comparison module enable registers. 64  These additional functions were added for version two: • Maintains configuration information about the appliance. • Generates the control flags which control the rest of the FPGA design. • Reads out the multi-match hit word from the alignment pipeline. • Controls the DMA controller to instruct it to transfer the contents of the reference genome memory into the query generator sliding window.  3.2  Implementation Method  First Idea: The first idea I had to implement the controller was as a hardware state machine. I would implement the controller as a Verilog module and incorporate it into the larger FPGA design. The problem with this is that the controller is simply far too complicated to implement this way, but it could have been done in principle. Second Idea: The second method I came up with, and the one I actually used for the implementation, was to instantiate an embedded soft-processor in the FPGA. This would allow me to describe the behavior in a high level language rather than an ad-hoc state machine, simplifying the design task. I used the off-the-shelf “Nios-II” soft processor library component supplied by Altera with their development tools rather than implementing my own. I also used several other of Altera’s library components (detail below). The controller design also included a small amount of my own hand-written Verilog code, hand-written embedded software for the embedded processor, code from the C standard library and code from Altera’s C libraries supplied with their development tools. 65  3.3  The SOPC Builder  Altera’s “SOPC Builder”(Altera) is a development tool supplied by Altera which will auto-generate Verilog code for a whole system-on-programmable-chip design. The SOPC Builder comes with the NIOS-II soft processor and a library of components including memories, off-chip memory controllers and interfaces, adaptors to other interfaces, adaptors to peripherals, etc.. The user enters the design with a GUI (Graphical User Interface)-based tool. The tool will then automatically generate the Verilog code for the system the user specifies. This code can then be synthesized into an FPGA configuration binary. This procedure is shown in figure 3.1 on page 66. The components of the system are connected with Altera’s proprietary Avalon bus. The ports of components are either Master ports or Slave ports. Each port can only be connected to ports of the opposite type. Also, it’s permissible for one Master to connect to multiple Slaves and one Slave to connect to multiple Masters. When the user specifies this in the design the Verilog code which handles the arbitration at runtime is auto-generated. Note that data can flow from Slave ports to Master ports as well as Master ports to Slave ports, but only Master ports can initiate bus transactions. Using the SOPC builder allows the designer to model at a higher level of abstraction than using Verilog. This is useful because it allows faster development and produces a finished product which probably has fewer bugs than if the Verilog code was written from scratch by the programmer.  66  Figure 3.1: How the SOPC Builder is used.  67  Figure 3.2: Version one of the controller SOPC.  3.4  Design Version 1  Figure 3.2 on page 67 shows the first version of the SOPC I implemented. Not every component in the implementation is shown in this diagram, only the ones needed to explain the overall way in which the design works. The diagram showing the final version that was actually included in the implementation is shown in the section explaining version 2 of the design. Notice there is no SDRAM connected to the SOPC in this first version. Notice that the NIOS-II processor has two bus connections - one for its data bus and one for its instruction bus. This is because internally the processor uses the “Harvard architecture”(Har) design where the processor expects the program and data to reside in physically separate banks of memory. Having to have two separate banks of memory would be inconvenient, so I  68  connected both the instruction and data ports of the NIOS-II processor to the SSRAM, which contains the program and data in separate segments. This can be seen from the diagram. The data bus of this version of the SOPC runs at 25MHz. This clock frequency was chosen because the data bus connects to the DM9000A Ethernet interface custom SOPC component which has to be clocked at 25MHz as that is the speed the DM9000A Ethernet interface hardware operates at. This custom SOPC component was supplied by Terasic, the manufacturer of the DE2-70 board. I had to modify the code of this component slightly. The original component was designed for a 16-bit bus but I was using a 32-bit bus so I had to increase the bus width to 32 bits. Considering that I was modifying the code of the DM9000A Ethernet interface component anyway, I also made some other changes to it: • removed a small syntax error that was generating a warning (a superfluous comma), and • changed the port interface names and grouped them in a logical way. Instead of modifying the code of the component so it had a 32-bit bus, I could have written a 32-bit to 16-bit interface, but I thought it would be neater to modify the component itself and saw no reason not to do this as the code was supplied. This version of the controller streams the reference sequence incrementally from the workstation into the sliding window query generator. The NIOS-II writes to the text FIFO of the alignment pipeline by setting the 32-bit word it wishes to write with a PIO device, and then asserting the write signal of the text FIFO for one clock cycle to push the new 32-bit word of data to it. The 69  Figure 3.3: The method used to generate flags in version one of the controller. method of asserting a flag for one cycle I implemented is shown in figure 3.3 on page 69. When the NIOS-II processor changes the state of the PIO, then for one cycle the registered value is different to the unregistered value. This is detected by the ex-or gate which consequently asserts its output for one cycle. There were several problems with version 1: • The data bus ran at 25MHz. It had to be this speed because the DM9000A Ethernet controller device can only run at 25MHz. • The method of generating 1-cycle asserted flags is flaky. • The performance of the NIOS-II when copying data from the Ethernet interface chip to the text FIFO is too slow to keep up with the rate of data consumption by the query generator window (explained in section 2.10.3).  3.5  Design Version 2  Changes made from version one to version two include:  70  • A 64MB SDRAM memory bank was added to the SOPC to store the reference sequence on the board, rather than stream it from the workstation. • A second “high performance” data bus was created on physically separate hardware from the main data bus. • A DMA hardware component was incorporated to copy the reference sequence from the SDRAM into the query generator of the alignment pipeline. It uses the high performance data bus to perform the write. • The data bus clock frequency was increased from 25MHz to 50MHz. A clock crossing bridge to the DM9000A Ethernet interface running at 25MHz was used to allow this. • Custom SOPC components were written for writing to the text FIFO, writing to the seed registers and writing to the seed register enabling bits. Version 2 of the SOPC controller is shown in figure 3.4 on page 71. This is the final implementation of the SOPC controller that was actually incorporated into the final build of the FPGA design. Altera library components are shown in blue, hardware components on the DE2-70 board outside the FPGA are shown in yellow, custom SOPC components are shown in orange. The diagram in figure 3.4 doesn’t show all the components. Table 3.5 on page 73 lists the components which aren’t shown explicitly in the figure. All components listed in this table connect directly to the data bus. The results FIFO outputs words of 64-bits and the maximum allowed size of a PIO is 32-bits, so my design includes two PIO’s to read the top and bottom half of the 64-bit word of the results FIFO. 71  Figure 3.4: Version two of the controller SOPC.  72  There are 2 banks of memory: a 2MB SSRAM chip for storing the program and data composing the embedded software, and a 64MB 2-chip bank of SDRAM for storing the reference sequence. In version one, only the 2MB SSRAM memory was used for the embedded software as the reference sequence was streamed incrementally over the Ethernet interface, not stored on the SDRAM of the board. I could have implemented a separate SOPC for the DMA controller. Having 2 separate SOPC’s in a single FPGA design is possible with the development tools. I think this would have been a conceptually better design than incorporating the DMA component into the single SOPC in my design. However, I decided to just incorporate the DMA controller into the main SOPC in the final design because this allowed me to connect the DMA to the rest of the design with auto-generated Verilog code output by the SOPC Builder rather than with hand-written code. Auto-generated code is faster to produce and less likely to contain bugs. When configuring the address space of the SOPC, I used the address space locking feature of the SOPC Builder to prevent address space clashes.  3.6  Combining the two 32MB SDRAM Chips into a Single 64MB Memory Bank  The DE2-70 board contains two 32MB SDRAM chips, each separately connected to the FPGA in its own physically separate bank. I decided to connect these two chips together into a single 64MB bank of SDRAM with a single address space. The benefits of doing this were to:  73  Table 3.1: The SOPC controller components not shown on the diagram. Component JTAG UART  Purpose Sends debug information from the embedded software to the host workstation. System ID peripheral Needed for debug and programming of processor during development. Enable register writer com- Custom SOPC component needed to write to the enable ponent registers. Aligner control flag con- Custom SOPC component needed to drive the aligner troller component flag generator module. System clock timer compo- Needed by the NIOS-II processor to execute timer-based nent C library functions like msleep(). 14-bit PIO (input) Reads the fill level signal from the text FIFO (in number of words). 13-bit PIO (input) Reads the fill level signal from the results FIFO (in number of words). 32-bit PIO (input) Reads data from the configuration parameter module. 32-bit PIO (input) Reads data from the hit vector of the alignment pipeline. 32-bit PIO (input) Reads the current sequence position from the alignment pipeline. 32-bit PIO (input) Reads bits 63:32 of the top word of the results FIFO. 32-bit PIO (input) Reads bits 31:0 of the top word of the results FIFO. 32-bit PIO (output) Outputs a new 32-bit value to be written to the sequence position counter in the alignment pipeline. 8-bit PIO (output) Selects which configuration parameter is to be read in from the configuration parameter module. 8-bit PIO (output) Selects which 32-bit subrange of the hit vector we want to read from. 1-bit PIO (output) Enables/disables the stall-on-empty-text-FIFO feature of the alignment pipeline. 1-bit PIO (input) Detects that a multi match has occurred in the alignment pipeline. 1-bit PIO (input) Detects that the alignment pipeline has completed its run.  74  1. Create a larger bank of usable memory for the reference sequence SDRAM than could be created with a single chip. 2. Double the available memory bandwidth. Each of the two 32MB memory chips contained four banks internally. Each of these four banks contain 4Mwords of 16-bit words. The two chips could be combined in two ways: 1. as 16Mwords (8Mwords + 8Mwords) of 16-bit words, or 2. as 8Mwords of 32-bit (16-bit + 16-bit) words. I chose option two. This is because this doubled the available bandwidth compared to using a single chip, as the clock speed is the same as a single chip but a 32-bit word is transferred from/to the bank instead of a 16-bit word on every write/read. The capacity of the resulting memory bank is the same in both cases (64MB). The SDRAM controller used was the standard SDRAM controller hardware library component provided by Altera. It was configured with the timing information from the SDRAM data sheet. It was also configured to have a 32-bit bidirectional data bus which was split into 2 16-bit bidirectional data busses by hand-written Verilog code. The control signals generated by the SDRAM controller are duplicated by hand-written Verilog code. One of the two sets of the control signals and one of the two 16-bit halves of the 32-bit data bus are fed to each of the SDRAM chips. Figure 3.5 on page 75 shows this setup.  75  76 Figure 3.5: How a single SDRAM controller drives two SDRAM chips.  3.7  Clock Generation  The design uses multiple clocks. The DE2-70 board provides 4 x 50.0MHz clocks and one 28.6MHz clock. Clocks of the necessary frequency and phase relationships had to be derived from these clocks for my design. This section explains how this was done.  3.7.1  Method Used to Generate the Clocks  Possible methods for generating clocks 1. Write custom Verilog code to produce clock dividers. 2. Use the dedicated on-chip PLL (Phase Locked Loop) hardware blocks in the FPGA. I used option two as this was a far more flexible and effective approach. The PLL blocks can be integrated into the design by either using the MegaWizard tool to configure and instantiate them, or as an integrated part of a SOPC. Both would produce a functionally equivalent design. I integrated the PLL blocks through the SOPC Builder as my design contained a SOPC anyway and this allowed me to use the auto-generated clock distribution code created by the SOPC Builder, rather than having to write my own. Each PLL takes one clock as input and produces up to three output clocks from the input clock. The user can specify the frequency of each of the three output clocks independently. The user can also specify the phase relationship between the three output clocks of a specific PLL. E.g. they can specify that the rising edge of clock one must occur, say, 5ns before the rising edge of clock 2. This is useful when communicating with off-chip memory, where the delay in 77  the transmission of data across the wires on the board needs to be taken into account. The clocks used by this design are 1. 133MHz SDRAM clock, 2. 133MHz SDRAM clock, phase shifted -3.0ns, 3. 50MHz processor clock, 4. 25MHz DM9000A Ethernet interface clock, 5. 167MHz SSRAM clock, 6. 167MHz SSRAM clock, phase shifted -1.76ns, 7. 80MHz aligner clock. Three of the four available PLL’s are used to generate these clocks. The grouping of which clocks are generated by which PLL are shown in the list above by the blank lines between the groupings The 133MHz bus to the SDRAM has to be shifted -3.0ns relative to the bus to the SDRAM controller to account for the delay on transmitting the signal along the board. A shift of -1.76ns has to be applied to the bus to the 167MHz SSRAM clock for the same reason. The main data bus is clocked at 50MHz and the DM9000A Ethernet controller at 25MHz. The third PLL generates just one clock: the clock to the alignment pipeline. This is clocked nominally at 80Mhz but can be increased depending on the build configuration. This is explained later on, see section 4.11 on page 105.  78  Figure 3.6: The text FIFO writer custom SOPC component.  3.8  Text FIFO Writer Custom SOPC component  Altera provide a development tool called the “Component Editor”(Altera) to allow the user to design their own components which can be included into SOPC designs with the SOPC Builder. I used this tool to produce the text FIFO writer custom SOPC component and the rest of the custom SOPC components used. The text FIFO writer custom SOPC component is shown in figure 3.6 on page 78. The Avalon bus side of the component connects to the data bus. The text FIFO side of the component connects to the write side of the text FIFO. The alignment pipeline reads from the read side of this text FIFO. The text FIFO writer custom SOPC component can be thought of as an adapter from the Avalon data bus to the text FIFO. Note that not every signal in the Avalon bus is shown in the diagram, only the ones that are actually used are shown. When the processor writes to the address of the text FIFO writer  79  Figure 3.7: The seed register writer custom SOPC component. component, the 32-bit value writedata has the data the processor wishes to write displayed on it for one cycle, and the write N signal is de-asserted for one cycle. This asserts the o push FIFO signal for one cycle, writing the 32-bit word to the top of the text FIFO. The clock signal from the data bus just passes straight through the component into the clock signal of the text FIFO.  3.9  The Seed Register Writer Custom SOPC Component  The explanation of how the seed registers are written to and the associated diagram is in section 2.9 on page 57. Figure 3.7 on page 79 show the seed register enable bits writer custom SOPC component. This component works in almost an identical way to the text FIFO writer SOPC component but instead of pushing a whole 32-bit word to the seed registers with every write, it only pushes 1 bit, bit 0 from the 32-bit word on the data bus.  80  Figure 3.8: The seed register enable bits writer custom SOPC component.  3.10  The Seed Register Enable Bit Writer Custom SOPC Component  Figure 3.8 on page 80 show the seed register enable bits writer custom SOPC component. It works the same way as the seed register writer custom SOPC component.  3.11  The Control Flag Generator Module  A flag is a pulse on a 1-bit wire which lasts for one cycle. They’re used for control purposes. My design uses four flags. They can be triggered from the SOPC. The PIO-based method I implemented first shown in figure 3.3 on page 3.3 could potentially produce erroneously generated flags. The alternative method described in this section is the one I actually used in the final version because it’s better engineered and works consistently well. The control flag generator module consists of two components: a custom SOPC component and a custom written Verilog module. Figure 3.9 on page 82 shows the aligner control flag generator connected to the Avalon data bus via  81  Table 3.2: The flags which the control flag generator module can generate. Flag Reset aligner flag Pop results FIFO flag Set new nt coutner flag Ignore multi match for one cycle flag  Purpose Resets the alignment pipeline.  ID number 1  Pops the top result from the results FIFO. 2 Store the new PIO-specified value into the 3 nt counter. Ignore the multi match signal from the 4 alignment pipeline for one cycle.  its custom SOPC component adaptor. Each flag has its own clock domain crossing module associated with it. These use the well known “register and XOR synchronizer” technique. These are necessary because the flags are generated relative to the processor clock but in the FPGA design they need to be relative to the hardware they control. For three of the flags they need to be relative to the aligner clock. The fourth flag needs to be relative to the results FIFO clock. Table 3.11 on page 81 shows the flags generated and what each of them does. When the embedded software wishes to assert a flag it writes the 8-bit flag ID number to the control flag generator SOPC component. If it matches one of the fours hard-coded flag ID numbers then that flag is asserted. The clock domain crossing hardware translates the flag into the appropriate clock domain for where it’s used. Notice that the Verilog module for actually deciding which flag to assert is outside the SOPC component in its own custom-written Verilog module. I could have included it in the Verilog code of the component as just one module. 82  83 Figure 3.9: The aligner control flags module.  I didn’t do this because having an adapter component to a custom-written Verilog makes it easier to change the custom-written code than if it’s integrated into a custom SOPC component.  3.12  The Parameter Readout Module  The compile-time-defined configuration parameters for the appliance are used in three places: 1. In the Verilog custom hardware. 2. In the embedded software that runs on the NIOS-II processor in the SOPC. 3. In the software which runs on the host workstation. These parameters include the number of seed registers in the appliance and the size in bits of each seed register, among other things. The obvious solution would be to simply define the same constants in the three different places, once in each project which used them. One problem with this is that three projects might end up with different values for the constants, which wouldn’t be able to happen if the constants are defined in just one place. Another problem is that it makes it impossible to use a single version of the embedded software binary or a single version of the workstation application binary - the user has to recompile them if they want to change these constants. The solution I took was to define them in just one place: in the Verilog custom hardware of the alignment appliance. The embedded software reads the constants out from the custom hardware at run time. The workstation software then retrieves the values of the constants from the embedded software at run 84  Figure 3.10: The configuration parameter readout module. time. This way the constants are defined in just one place (the Verilog custom hardware code) and the embedded software and workstation software don’t have to be recompiled to change the values of these constants. The other approach I considered was to have the user specify the appliance parameters on the command line of the workstation application. While easier to implement, this makes using the workstation software more complicated so I didn’t implement this. Figure 3.10 on page 84 shows this module. The 2 red boxes represent constants defined in the code. The SOPC can select which of the constants to display on the output by setting the value of the multiplexer with its 8-bit output PIO. The displayed constant can then be read into the embedded software by the SOPC with the 32-bit PIO.  85  3.13  The Multi-Match Hit Vector Readout Module  The multi-match hit vector readout module is used to read out the N-bit hit vector from the alignment pipeline when a multi match occurs. The reason we can’t do this with an N-bit PIO is because PIO’s have a maximum width of 32 bits, N is usually much larger than 32 in practice (N is the number of seed registers in the alignment appliance). It works in a similar way to the parameter readout module. Figure 3.11 on page 86 shows the hit vector readout module when configured for N=100. Note that the code for breaking the N-bit work into 32-bit subwords is auto-generated so it doesn’t need to be modified by the user when they change the value of N. Note that if N is not a multiple of 32 then one of the subwords will not be 32-bits long.  3.14  Getting Data From the Results FIFO into the SOPC  The results FIFO contains results encoded into 64-bit words. The first idea I had was to write a custom SOPC component to pop words from the top of this FIFO when the processor read from it. The problem with this is that the data bus is 32 bits so can’t read a 64-bit word. The second idea I had, and the one I used in the final implementation, was to simply implement 2 32-bit PIO’s to read the upper and lower half of the 64-bit result word. I configured the results FIFO to be in “show ahead” mode where the contents of the top of the FIFO can always be read. After reading off 86  Figure 3.11: The multi-match hit vector readout module.  87  a word from the results FIFO, the processor uses the control flag generator to pop the word that has already been read from the results FIFO.  88  Chapter 4 Alignment Pipeline Development This chapter describes the design and implementation of the alignment pipeline.  4.1  Overall Pipeline Design  Figure 4.1 on page 89 shows the high level design of the alignment pipeline. Only the components in green are really part of the alignment pipeline. The blue components are part of the system but not the alignment pipeline in particular. They are shown as it would be difficult to see how the pipeline operated without them. The reference sequence is contained in the 64MB bank of SDRAM. The text FIFO is a hardware FIFO of 16,384 x 32-bit words. The query generator of the alignment pipeline continuously pops words from the text FIFO to keep its sliding window full. It consumes 1nt (2 bits) per cycle, so pops the text FIFO on every 16th cycle. The embedded software in the SOPC continuously monitors the fill level of the text FIFO and instructs the DMA to transfer segments of the reference sequence from the SDRAM into the text FIFO to  89  90 Figure 4.1: High level design diagram of the alignment pipeline.  keep the text FIFO full. The 56-bit query sequence output by the sliding window query generator is compared concurrently to every sequence stored in each seed register in the seed comparator module. Typically there will be several hundred of these, depending on the build parameters. Each of these seed registers will have been loaded with a seed from the seed dataset by the embedded software in the SOPC before the alignment run starts. If the seed batch loaded into the alignment pipeline is smaller than the number of seed comparison modules, then unused seed comparison modules are disabled by the embedded software before the alignment is started. At most positions of the reference sequence, zero of the seed comparators will produce a hit. At some positions there will be a hit from exactly one seed comparison module and at some positions there will be hits from multiple seed comparison modules. The priority encoder takes the hit vector output from the seed comparator and decides whether 0, 1 or multiple seed comparator modules are asserting their outputs. If there are none, no action is taken. If there is exactly one hit, then the ID number of the seed register which is generating the hit and sequence position it occurred at is encoded into a 64-bit word and pushed into the results FIFO. The controller SOPC pops this word from the other side of the results FIFO and eventually returns it to the host workstation via the Ethernet interface. If multiple seeds match at a particular sequence position, the alignment pipeline stalls and waits for the embedded software to read off the hit vector to be processed in software. This is explained in more detail later.  91  4.2  Design Decision: Allowing Multiple Identical Seeds in a Batch  It’s possible for there to be multiple identical reads in a set of short DNA reads. It’s also possible for two non-identical reads to have the same seed generated from them. For example, if we use the default “take the 28bp prefix” read to seed conversion method, then any two seeds with the same first 28bp will have the same seed generated from them. For this reason, the aligner be able to produce correct results when we have hits from multiple different seeds at a particular position of the query sequence. This caused a problem for the first version of the design of the alignment pipeline because it had no multi-match detection and readout hardware - if 2 or more hits occurred in a particular cycle only one of them would be recorded in the results FIFO. The first solution I thought of would be for the workstation software to identify all identical seeds in a batch in advance. It would maintain a record of which reads in the input dataset corresponded to that particular seed. That seed would then only be loaded into one comparator module of the alignment appliance. This would ensure that there were no identical seeds loaded into the alignment appliance at the same time, making multi-matches impossible. When a hit was returned to the host workstation, the host workstation would look up the multiple reads that particular seed was generated from and generate a hit for each. I didn’t implement this solution. A potential problem with it is that the user would use the alignment appliance with a different host workstation application to the one I wrote and be unaware that the appliance’s results needed pre-processing and post-processing to be able to detect multi-matches. It would be a reasonable assumption for the user to make that the hit results 92  returned from the appliance to the host workstation were correct without any further processing. The results FIFO is 64 bits wide. This comprises a 32-bit unsigned integer indicating the sequence position and a 32-bit unsigned integer indicating the ID number of the comparison module which detected a hit. I considered making the results FIFO wide enough so that when a multi-match occurs the results FIFO could store the whole hit vector (typically hundreds of bits wide, depending on build parameters) and the embedded software would read in this huge vector when popping the results FIFO. The embedded software could then examine the read vector bit-by-bit in software to identify which comparison module(s) caused the hit. This would work in principle but there are some problems with it: • The results FIFO is constructed out of embedded on-chip memory blocks in the FPGA. If we made the results FIFO have a word length of hundreds of bits, this could easily use up or exceed this low-availability resource. • Writing code which auto-generated a FIFO of variable width was difficult. It couldn’t be done with any of the built in language features of Verilog. This would make the code hard to maintain and potentially buggy. • The memory blocks out of which the results FIFO is constructed are spread out through the FPGA. Joining a large number of them together to produce a very wide FIFO would be wasteful of the FPGA’s internal resources. • Multi-matches are rare compared to single matches. The bit-by-bit analysis of the hit vector is slow to perform in the embedded software and  93  it would be better if it could be avoided for single matches and only used for multi matches. The third solution I thought of, and the one I actually implemented, was to modify the priority encoder so that instead of outputting either of the two options “no hits” or “at least one hit” at each sequence position it instead output one of the three options “no hits”, “exactly one hit” or “2 or more hits”. When no hit occurred at a particular sequence position, no word was pushed to the results FIFO. When exactly one hit occurred at a particular sequence position, the word encoding the sequence position and comparison unit ID number which generated the hit were pushed to the results FIFO. When a multi-match occurred at a particular sequence position the alignment pipeline stalls and waits for the controller to read out the hit vector word to perform bit-by-bit analysis on it.  4.3  Detailed Pipeline Design  Figure 4.2 on page 94 shows the detailed design of the alignment pipeline. This is the same alignment pipeline shown in figure 4.1 on page 89 except with more of the implementation details exposed. Clocked components are shown in green. All other components can be considered to be combinatorial (stateless) logic. The seed comparison module is not actually stateless but can be considered to be. This is explained in section 4.6 on page 97. Control signals are shown in red and data signals in black. All the input and output signals shown on the diagram connect to the SOPC controller module of the FPGA design, with the exception of stall. Stall 94  95 Figure 4.2: Detailed design diagram of the alignment pipeline.  is only generated internally to the alignment pipeline and only used internally in the pipeline. This diagram is not complete, several connections from the alignment pipeline to the SOPC controller have been left off, however, it should be detailed enough for the reader to see how it works.  4.4  The Text FIFO  The text FIFO is a dual-ported memory which smoothes the uneven flow of data from the SDRAM into the query generator sliding window. It’s constructed from embedded memory blocks in the FPGA rather than from the general purpose logic elements of the FPGA because this is a more efficient way to use the FPGA’s resources. The DMA controller which writes to it and the query generator window which read from it run at different speeds so the read side and write side of the text FIFO are connected to different clocks. The words of the text FIFO are 32 bits wide (16 nucleotides). It has a capacity of 16,384 words. This capacity was determined experimentally to be sufficient to never run out while the alignment pipeline was running but to not be so large as to be wasteful of FPGA resources. The SOPC controller has a dedicated PIO for reading the number of words of the text FIFO used. It needs to be able to do this to calculate the number of words to instruct the DMA to transfer into the text FIFO to top it up.  4.5  The Query Generator  The job of the query generator is to read in the reference sequence from the text FIFO and generate a 28nt (56 bit) query at each position of the reference  96  sequence. This is the same length as a seed register (56 bits). Both these lengths can be changed by changing a configuration parameter in the source code. The query generator uses a hardware version of the sliding window query generator described in section 1.3.3 on page 23. Instead of storing the window contents in a fixed size array in memory it’s stored in a hardware register. The shift and insert operation is produced in O(1) time by dedicated shift register hardware. The query generator contains three registers: 1. an 88-bit (56 bits + 32 bits) “ window ” shift register containing the window contents, 2. a 4-bit “ nt ctr ” register which counts up to 15 then resets to 0. 3. a 32-bit “ seq pos ” register which counts down from the sequence length to 0 The window register has a fixed length 32-bit buffer which the next word of the reference sequence text is held in. It also has a 56-bit length segment which the query is shifted into whose length is set at compile time by the seed register length parameter. Initialization: The SOPC Controller writes the length of the reference sequence in nucleotides into the seq pos register using the dedicated hardware in the query generator module. Iteration: On every cycle the window shifts 1nt with its built in shift register hardware. The seq pos register is automatically decremented by 1 on each shift of the window. The nt ctr register is automatically decremented by 1. When the nt ctr window reaches 0, all 16nt in the 32-bit word from the text FIFO output have been shifted into the window register, so the text FIFO is popped and the nt ctr register is reset to 15. 97  Termination: When the seq pos register reaches 0. This means the whole reference sequence has been shifted through the window register and the alignment is complete. The query generator module takes the stall signal as input and stalls and does nothing whenever the stall signal is asserted. More detail on how the stall feature works is given in section 4.9 on page 103.  4.6  The Seed Comparison Module  The seed comparison module takes as input a 56-bit query signal and generates as output an N-bit hit vector signal, where N is the number of seed register comparison units in the seed comparison module. If the query signal matches the contents of comparison module n then bit n will be 1 in the hit vector output word. Otherwise that bit will be 0 in the hit vector output word. The seed comparator module connects to the custom SOPC components for writing to the seed registers and the seed register enable bits. The way these modules work is explained in section 3.9 on page 79. The clock used by the shift register which writes to these registers is a different one from alignment pipeline clock. One seed register and comparator is shown in figure 2.9 on page 56. Figure 4.3 on page 98 shown how N of them are connected together in the seed comparison module. The seed comparison module is initialized by the SOPC controller writing the seed register and enable bit contents. While the alignment is running the contents of these registers doesn’t change.  98  Figure 4.3: The internals of the seed comparison module. When the alignment pipeline stall signal is asserted the generation of results must be suppressed. The obvious way to do this would be in the seed comparison module by inserting a multiplexer which selects the N-bit word consisting of all 0’s as the output value for hit vector when the stall signal is asserted. This isn’t implemented as it’s more efficient to put in a condition on the write signal of the results FIFO than to put in an N-bit multiplexor. This can be seen from the detailed design diagram of the alignment pipeline (p. 94). We can change the type of comparison the alignment appliance performs by changing the comparison hardware in the comparison modules. For example, we could substitute the 2-of-4 comparator for a 3-of-6 comparator or exact matching comparator.  99  4.7  The Priority Encoder  A priority encoder is a standard component which takes as input an N-bit word, decode , and generates from this: 1. An n-bit encode signal which is the unsigned integer of the bit position of the input which is a 1. 2. A 1-bit valid signal which is 1 if one or more of the N 1-bit inputs is 1. For a priority encoder which takes a word of N bits as input, the encode signal must have length log2 N bits. When more than one bit of the input word is 1, encode takes the bit position number of the smallest active bit. A priority encoder can be constructed recursively as shown in figure 4.4 on page 100. By default a priority encoder has just the valid signal which is asserted when one or more of the inputs is 1. For my design I’d ideally like to have two outputs: 1. exactly one match which is asserted when exactly one (not 0 or multiple) of the input bits is 1. This is for pushing a result to the results FIFO when there is exactly one match of a seed to the query. 2. multi match which is asserted when two or more of the input bits are 1. This is for stalling the alignment pipeline when two or more seeds match to the current query value and indicating to the SOPC controller that a multi match has occurred so it can read off the hit vector for bit-by-bit analysis.  100  Figure 4.4: A priority encoder constructed recursively. 101  I modified the priority encoder to also include a multi match signal which detects when 2 or more of the inputs are 1. This can be used to generate the “exactly one match” signal as ( valid AND (! multi match )). This modified priority encoder is written as a recursively defined Verilog module. The Verilog compiler expands it out to the full circuit diagram during compilation. It can only construct priority encoders whose input decode signal has a width that is a power of 2. If the programmer tries to instantiate a priority encoder whose input width is not a power of 2 (100, say) then the Verilog compiler will instantiate a priority encoder whose width is the smallest power of 2 larger than the specified size (128 in this example) and pad the unused part of the input with 0’s (there would be 28 0’s in this example).  4.8  The Results FIFO  The results FIFO combines the encode value from the priority encoder with the query pos signal into a single 64-bit integer which encodes the hit. The results FIFO is written to if there is exactly one match from the seed comparison module i.e. if the valid signal is asserted but the multi match signal is not asserted. If the alignment pipeline is being stalled then this also prevents the results FIFO being written to. The logic which orchestrates this can be seen in the detailed design diagram of the alignment pipeline (figure 4.2, p. 94). The write signal of the results FIFO is only asserted if these conditions are satisfied. The other side of the results FIFO connects to the SOPC controller so it can read the hit results out of it and return them to the host workstation. This is explained in section 3.14 on page 85.  102  Figure 4.5: My modified priority encoder constructed recursively with the extra multi match signal. Note that “ multi ” is used as short for “ multi match ” to save space on the diagram.  103  Table 4.1: The reasons that the alignment pipeline can stall. Stall condition The text FIFO is empty.  Reason Popping the empty text FIFO would insert bad data into the query generator so we stall until the controller SOPC tops it up. The results FIFO is full. Pushing to the already-full results FIFO would cause hit data to be lost so we stall until the controller SOPC pops data from the results FIFO and creates space in it. The alignment pipeline has If the alignment pipeline kept running after the end of completed its run. the reference sequence it might generate spurious hits. A multi match has occurred. If a multi match occurs the alignment pipeline must stall until the controller SOPC can read out the hit vector and process it to identify the seed registers which caused the multi match. If the alignment pipeline continued to run the hit vector would change before this process could be completed, and the information would be lost.  4.9  Computing the Stall Signal  The alignment pipeline can stall. If the alignment pipeline is stalled on a particular cycle then the alignment pipeline does no useful work on that cycle. The more often the alignment pipeline stalls the lower its performance. However, the situation can arise where it’s desirable for the pipeline to stall because otherwise it would produce incorrect results. The four situations which cause the alignment pipeline to stall and the corresponding reason why it stalls in each of the four situations is given in table 4.1 on page 103. The effects caused by the stall signal being asserted are given in table 4.2 on page 104. The logic which generates the stall signal and causes the stalling effects can be seen in the detailed design diagram of the alignment pipeline (figure 4.2, p. 94).  104  Table 4.2: The effects of a stall event in the alignment pipeline. Effect Reason for effect Popping the text FIFO is To prevent the text FIFO being popped when query gendisabled. erator was not registering its output, corrupting the reference sequence by skipping a segment of it. Pushing the results FIFO is To prevent the same hit being pushed to the results FIFO disabled. multiple times if the write signal was asserted when a stall occurred. All the query generator reg- To prevent corruption of the alignment pipeline state. isters are frozen. All the pipeline registers are To prevent corruption of the alignment pipeline state. frozen.  4.10  Alignment Pipeline Global Reset  The alignment pipeline has a global reset signal which can be asserted by the controller SOPC. The mechanism used to do this has been explained in section 3.11 on page 80. The global reset signal is used by the controller SOPC to reset the alignment pipeline into a ready-to-use state before beginning the alignment process. The effects of the alignment pipeline global reset signal are to: • Clear the contents of the text FIFO. • Clear the contents of the results FIFO. • Reset all the query generator registers and all the pipeline registers. The alignment pipeline global reset signal doesn’t reset the seed registers or seed register enable bits to a default value, it leaves them unchanged. It would have been possible and desirable for the alignment pipeline global reset signal to do this. The reason I didn’t is because the seed registers and the 105  enable bits between them occupy a very large percentage of the FPGA’s registers (∼90% or more). The associated logic circuitry to allow them to be reset would be large and wasteful of the available resources in the FPGA. As a result the seed registers and seed register enable bits must be explicitly written to by the SOPC controller to set them to a default value. This isn’t a huge problem in practice, it just makes the method by which the alignment pipeline is reset more complicated. It also takes longer, but this doesn’t cause a problem in practice as the total time taken to reset the alignment pipeline is still tiny compared to the total time taken for the alignment pipeline to perform an alignment.  4.11  Increasing Clock Frequency with Pipelining  The performance of the alignment pipeline is directly proportional to its clock speed so ideally we want to have the clock speed as high as possible. Pipelining(Hennessy, Patterson, and Goldberg) is a technique used to increase the clock frequency at which a processor design will operate. The processor pipeline is broken into multiple separate pipe stages with registers between them. The clock frequency of a processor pipeline (or in this design, the alignment pipeline) is determined by the delay of its slowest pipe stage. The delay of a pipe stage is determined by the length of time it takes for the logic to change state in that pipe stage. The original alignment pipeline shown in figure 4.2 on page 94 consists of just one pipe stage. I broke the pipeline into three stages. The pipelined version of the alignment pipeline is shown in figure 4.6 on page 107. The divisions 106  between the pipe stages are shown by the brown dotted lines. Where the brown dotted lines cross signals that are pipelined, the dotted boxes around the intersection points represent the pipe registers that are inserted. The original one-stage alignment pipeline operated at a maximum clock frequency of approximately 40.0MHz. After the addition of the second pipe stage this maximum frequency was increased to 89.92 MHz. The addition of the third pipe stage increased the maximum clock frequency to 128.25 MHz. These clock frequency figures were obtained when building the appliance FPGA binary with 64 x 56-bit (28nt) seed registers. There would have been no point in breaking the alignment pipeline into more than three stages as there wouldn’t have been any substantial further speedup. The control signals such as stall were not pipelined, only the data signals.  107  108 Figure 4.6: The pipelined version of the alignment pipeline detailed design.  Chapter 5 Embedded Software Development This section describes the embedded software which was written for the Nios-II embedded processor in the SOPC controller of the alignment appliance. The embedded software is designed to run on the NIOS-II soft-processor, a library hardware component supplied by Altera. The processor which runs this embedded software is incorporated into the SOPC Controller, described in chapter 3 of this document.  5.1  What the Embedded Software Does  The embedded software performs the following functions: • Controls the DM9000A Ethernet interface chip for communication between the alignment appliance and the host workstation. • Reads configuration data from the alignment appliance hardware and uploads it to the host workstation software. • Writes the reference sequence text into the SDRAM after receiving it over the Ethernet link from the host. 109  • Tops up the text FIFO of the alignment pipeline with the reference sequence data by DMA. • Reads results from the results FIFO. • Returns packets of hit results to the host workstation over the Ethernet interface. • Reads the hit vector from the alignment pipeline when a multi-match occurs.  5.2  Overall Design  Choice of Language: I wrote the embedded software in C. The only other choice would have been assembler. Choice of Development Tools: I used Altera’s “Nios-II Software Build Tools for Eclipse” development environment. This is an Eclipse based IDE and C compiler supplied by Altera for use with the Nios-II processor. Choice of Operating System: It would have been possible to run an operating system on the Nios-II processor and then run my embedded software as an application running within that operating system. I decided not to do this and to instead run my embedded software on the NIOS-II processor without an operating system. I used the HAL (Hardware Abstraction Layer) provided by Altera. This is simply a set of auto-generated C and assembler functions which are statically linked to the user’s embedded software code at compile time. It can be thought of as a very lightweight operating system. A more feature-rich operating system may have provided functionality such as TCP/IP but this turned out not to be necessary anyway as the simple Ethernet packet 110  Table 5.1: The states the alignment appliance embedded software can be in. Name IDLE  Description The appliance is waiting for a command from the host workstation. The appliance is updating the reference text in the SDRAM as it receives a stream of packets from the host workstation containing a new reference sequence. The appliance is updating the seed batch buffer in main memory from the new seed data it’s receiving over the Ethernet interface from the host workstation. The appliance is performing an alignment.  RECEIVING REFERENCE TEXT  RECEIVING SEEDS  PERFORMING ALIGNMENT  acknowledgement method was sufficient flow control to make the implementation work. A TCP/IP stream between the appliance and workstation would have provided higher bandwidth but this wouldn’t have made much difference to the implementation in practice.  5.3  States and State Transitions  The alignment appliance is in one of four states at any one time. These four states are shown in table 5.1 on page 110. When the aligner powers up it automatically enters the “IDLE” state. Figure 5.1 on page 111 shows the allowed state transitions.  5.4  The Main Loop  The Nios-II processor supports interrupts but I decided not to use them. Instead all of the I/O devices are polled. Setting up and using interrupts 111  Figure 5.1: The allowed state transitions of the alignment appliance embedded software controller.  112  correctly make the code more complex to write and test without adding any functionality. At power up the program performs the initialization section once then enters an infinite loop performing the loop contents.  5.4.1  Initialization  The following things are done when the embedded software begins execution: • The appliance configuration parameters are read out of the alignment appliance hardware and stored into variables in the embedded software. • The DM9000A Ethernet interface is initialized. • Global variables are initialized. • The DMA device is initialized. • The performance timer is initialized.  5.4.2  Loop Contents  On each iteration of the main loop all packets waiting in the DM9000A Ethernet interface’s receive buffer are read out and processed. If the alignment appliance is in the PERFORMING ALIGNMENT state then the following additional things are done on every iteration of the main loop: • The text FIFO is topped up. • Check if the whole reference sequence has already been pushed to the text FIFO. if it has then disable the “stall on empty text FIFO” stall condition so that the aligner can run right to the end of the reference sequence.  113  • All the results waiting in the results FIFO are popped from it and processed. • If a multi match has occured then read out the hit vector from the alignment appliance and process it. • Check if the alignment pipeline has reported that the alignment run has finished. If it has then take appropriate action.  5.4.3  Shut Down  When the application exits the main loop it shuts down the DMA device, frees dynamically allocated memory and then stops execution. In practice this should never actually happen as the embedded software is designed to be an integral part of the alignment appliance that is running the main loop continuously whenever the appliance is powered up.  5.5  The Bit Manipulation Functions  A set of functions were written to carry out bit-level operations on unsigned integers. These are used in packet processing, I/O device communication and several other places in the embedded software. I wrote functions to carry out the following: • Extracting the value of a particular bit position from 8-bit, 16-bit and 32-bit words. • Inverting a boolean value stored in an 8-bit word. • Extracting an 8-bit subword from a 64-bit word. 114  • Extracting a 32-bit subword from a 64-bit word. • Extracting a 16-bit subword from a 32-bit word. • Extracting an 8-bit subword from a 32-bit word. • Extracting an 8-bit subword from a 16-bit word. • Concatenating four 8-bit subword from a 32-bit word. • Concatenating two 8-bit subword from a 16-bit word. • Concatenating two 16-bit subwords into a 32-bit word. • Concatenating two 32-bit subwords into a 64-bit word. • Padding a 16-bit word into a 32-bit word. • Padding an 8-bit word into a 16-bit word or a 32-bit word. • Swapping over the two 8-bit words in a 16-bit word.  5.5.1  Other Support Functions Written  A small number of other support functions needed by other parts of the code were written: • Compute the maximum of two integers. • Compute the minimum of two integers. • Perform integer division but round up instead of down. • Format a raw MAC address for human readable display.  115  5.6  Retrieving Configuration Parameters from the Alignment Appliance Hardware  This is done with the hardware already described in section 3.12 on page 83. The embedded software simply uses the PIO’s to select and read off each of the configuration parameters.  5.7  Initialization of Seed Registers and Comparison Module Enable Registers  As explained earlier, the seed registers and comparison module enable bits are not reset by the global reset signal in the alignment pipeline because the hardware to implement this would consume a lot of resources. Instead the embedded software has functions for writing default “blank” contents to these registers.  5.8  Design Decision: How the Reference Genome is Contained in the SDRAM Text Buffer  Figure 5.2 on page 117 shows the layout of the reference genome in the SDRAM. The memory bank is 32-bits wide (2 16-bit chips in parallel). Each byte has four nucleotides packed into it with 2b/nt encoding from right to left. The bytes of the sequence are numbered as shown in the diagram. This layout is created by software in the NIOS-II processor as it parses reference sequence 116  frames and writes them into the SDRAM. With this layout the DMA controller can shift the reference sequence from SDRAM into the sliding window a whole 32-bit (16nt) word at a time without any manipulation of the data. The NIOS-II processor can read from the SDRAM even though it only needs to write to it to update the reference genome - only the DMA controller actually needs to read from it. I made use of the fact the processor can read from the SDRAM by writing a function to perform debug output of the first and last 10 words of the SDRAM contents. This was very useful during development.  5.9  Other functions performed by the Embedded Software  Other functions performed by the embedded software include: • Buffering the seeds before they are loaded into the seed registers of the alignment pipeline. • Writing the currently buffered seed batch into the seed registers. • Writing to the seed enable registers. • Reading and writing to the PIO’s. • Writing to the seed registers. • Controlling the DMA device. • Generating flags with the aligner control flags generator module. • Initializing the seq pos register of the alignment pipeline. 117  Figure 5.2: The layout of the reference genome in the SDRAM of the appliance.  118  • Displaying the buffered seed data for debug purposes. • Starting and stopping the alignment process. • Reading results from the results FIFO. • Detecting and reading out a multi match from the alignment pipeline. • Hardware detection of when aligner has completed run. • Error message generation functions.  5.10  Developement of Driver Software for the DM9000A Ethernet Interface  5.10.1  Ethernet Interface Configuration  Choice of data rate: The DM9000A supports 10Mbps and 100Mbps data rates. The Ethernet interface on the workstation supports both 10Mbps and 100Mbps data rates. I used the 100Mbps data rate. I first implemented embedded software which used the 10Mbps mode as this was easier to write and then improved it to using 100Mbps. Choice of half-duplex mode or full-duplex mode: Both the DM9000A Ethernet controller and workstation Ethernet interface support both half-duplex and full-duplex. I implemented full-duplex. Auto-negotiation: I disabled the DM9000A’s auto-negotiation feature so I could configure it manually.  119  Ethernet frame padding: I enabled the Ethernet frame padding feature, so frames shorter than the minimum allowed length of 64 bytes would be padded to 64 bytes. Automatic CRC Checking: I enabled the automatic checking of Ethernet CRC’s (Cyclic Redundancy Check). This is the Error-detection scheme used by Ethernet. Automatic CRC Generation: I enabled the DM9000A’s CRC auto-generation feature. Ethernet Flow Control: I disabled both half-duplex flow control (aka. “backpressure”) and full-duplex flow control (aka. “802.3x”). The DM9000A supports both. Interrupts: The DM9000A can signal an interrupt to the processor when an Ethernet frame arrives. I chose not to use this functionality and instead wrote the embedded software so that the processor checks the DM9000A for packets periodically and reads them out of the DM9000A’s internal packet buffer when they arrive.  5.10.2  Method to Read from the DM9000A Ethernet Frame Receive Buffer  The processor queries the registers of the DM9000A to discover if there are any Ethernet frames buffered in its internal frame buffer. If there are each is read out and processed in turn. First the Ethernet frame is copied into the processor’s SSRAM, then it’s processed and then deleted from the SSRAM. After that the processor reads out the next frame from the DM9000A and so on until all the frames waiting in the DM9000A have been read out and processed.  120  The DM9000A performs checks on the packet and sets state registers accordingly. These checks identify things like whether the packet has a bad CRC, whether it’s a “runt packet” (i.e. smaller than allowed by the Ethernet specification). If the packet does violate any of these requirements the embedded software deletes the packet from the buffer without attempting to process it. Further checks are performed: • If the “frame type” field from the Ethernet header is bad then the frame is deleted. • If the source MAC address is not that of the host workstation then the frame is deleted. • If the “magic number” of the frame is not the one expected from the host workstation then the packet is deleted. If the frame has passed all those then it’s frame identifier type is read out and the frame’s data is passed to the corresponding processing function for that frame type. There are nine frame types. These are listed in table 5.2 on page 121.  5.10.3  Method to Send an Ethernet Frame with the DM9000A Ethernet Interface Controller  Sending a packet is much more straightforward than receiving one. Each packet type is buffered in an array of bytes in the SSRAM of the processor’s data memory. I wrote a function which will transfer one of them to the frame transmission buffer of the DM9000A. When the program wants to send a packet 121  Table 5.2: The nine frame types that can be sent from the host workstation to the alignment appliance. Name FRAME NUMBER CTRL WS INFO PACKET FRAME NUMBER CTRL GO TO IDLE STATE FRAME NUMBER CTRL START ALIGNMENT FRAME NUMBER CTRL START RECEIVING TEXT FRAME NUMBER CTRL START RECEIVING SEED BATCH FRAME NUMBER CTRL FINISH RECEIVING SEED BATCH FRAME NUMBER CTRL FINISH RECEIVING TEXT FRAME NUMBER DATA TEXT FRAME NUMBER DATA SEEDS  122  Description Contains information about the workstation and requests information from the appliance. Instructs the appliance to transition into the IDLE state. Instructs the appliance to start an alignment. Instructs the appliance to start receiving a new reference sequence. Instructs the appliance to start receiving a new seed batch. Marks the end of a stream of packets containing a seed batch. Marks the end of a stream of packets containing a reference sequence. A packet containing reference sequence data. A packet containing seed data.  it simply passes a pointer to the array containing the byte sequence of that function to the function I wrote which copies it to the DM9000A’s transmission buffer.  123  Chapter 6 Development of Appliance Control Application for the Workstation This chapter describes the design and implementation of the appliance control application for the host workstation of the alignment appliance. The workstation software was developed in Java 1.6.0 with the Eclipse 3.4.0 development environment.  6.1  What the Workstation Software Does  The workstation software is used 1. to load a reference genome into the appliance’s SDRAM text buffer, and 2. to initiate an alignment and collect the hit results.  124  6.2  Choice of Library for Sending and Receiving Ethernet Frames  My first idea was to use the built in network code of the Java standard API to send and receive packets to the alignment appliance. However this turned out not to be possible because the core Java API doesn’t give access to low-level network data. I decided to use the Jpcap library(Jpc) for packet processing. Jpcap provides an easy-to-use wrapper for the WinPcap(Win) library. Jpcap’s calls to the WinPcap library do the actual hardware-level network processing.  6.3  Threads  The application has two threads: the main thread started by the “main” method of the application and the packet receiver thread. The packet receiver thread is responsible for handling packet reception. The main thread is responsible for packet transmission and the rest of the application.  6.4  Type of Packets Used  The first version used IP packets rather than Ethernet frames for communication between the appliance and the workstation. This is because it was easier to write the code for IP packets than Ethernet frames. Using Ethernet frames is more difficult as it isn’t as well supported by Jpcap. I found a messageboard posting online from someone who’d worked out how to do it. Using Ethernet frames instead of IP packets simplified the parsing code in the embedded software. Using raw Ethernet packets saved the 18-bytes per frame  125  header that IP packets require.  6.5  Flow Control  The packets sent between the host workstation and alignment appliance were initially given sequence numbers. I removed these as the packet loss on the dedicated Ethernet segment was zero anyway. I considered implementing TCP but this would have been a waste of effort as a much simpler method of flow control could be used. When sending a large number of packets, such as when the host workstation is sending the reference sequence to the appliance, each packet is simply acknowledged by the receiver before the next is sent.  6.6  Receiving Ethernet Frames  The packet receiver runs in its own thread, separate from the rest of the application. The packet receiver thread communicates with the main thread via shared variables. Received packets are discarded if any of the following applies to them: • the packet is not an Ethernet packet, • the packet is a broadcast packet, • the packet address didn’t match the address of the host workstation, • the packet wasn’t sent from the alignment appliance. The type of the received Ethernet frame is identified from it’s frame ID number. The packet data is read into an array and passed to the appropriate parsing function for that type of Ethernet frame. The four types of frames are 126  “acknowledgement” frames, “appliance information” frames, “hit results” frames and “alignment run completion” frames.  6.7  Parsing the Command Line Arguments  Command line argument parsing code was written using the String manipulation methods from Java’s core API. If the command line arguments contain “-h”, “-help”, “--h” or “--help” the built-in help text is displayed and execution is aborted. The command line arguments are parsed to determine • whether the user wants to upload a reference sequence or run an alignment, • what the MAC address of the alignment appliance is, • what the MAC address of the Ethernet adapter of the host workstation is which is attached to the alignment appliance, • whether the application should display debug output, • the path to the reference genome FASTA file if uploading a reference genome, • the path to the seed dataset ELAND file and hit results output file if performing an alignment. Code was written to display these command line settings to the user once they’d been parsed out of the command line.  127  6.8  Writing Unsigned Integer Classes for Java  The reference sequence and reads are stored on the host workstation in unsigned integer types using 2-bits-per nucleotide encoding. Java doesn’t support unsigned integer types natively. One option would have been to develop the workstation application in a language which does support unsigned integer types natively, such as C++. I didn’t do this because I’d have to abandon the code I’d already written in Java and because using C++ instead of Java would increase the time and effort required to produce the application. The solution I used was to emulate unsigned integer types using a standard technique: simply store an n-bit unsigned integer in the n least significant bits of a 2n-bit native Java signed integer and mask off the unused bits of the result after performing arithmetic or logical operations on them. To hide this implementation technique from the main body of the program and integrate the unsigned integers into the language I wrote a Java class to represent each unsigned integer type used in the program. I called the 8-bit unsigned integer class ubyte and used short (Java’s native 16-bit signed integer) for its’ underlying representation. I called the 32-bit unsigned integer class uword and used long (Java’s 64-bit signed integer) for its’ underlying representation. I wrote 8-bit and 32-bit unsigned types because these were needed in the main program. I could have written a 16-bit unsigned type but didn’t spend the time doing so because it wasn’t needed in the main body of the program anyway. Creating an unsigned integer longer than 32-bits would have required a different implementation technique because the longest unsigned integer provided natively by Java is 64-bits. These weren’t needed anyway. The advantage of using these emulated unsigned types rather than Java’s  128  native signed types is that the program is easier and faster to read, write and modify. There is an associated performance penalty for using these emulated unsigned integers rather than Java’s native signed integers: • Speed: Performing the bit-masking operations every time an operation is performed on an emulated unsigned integer incurs a run time penalty. • Memory usage: Using a data type whose actual length in bits is twice the length of the represented type doubles memory usage. Memory usage is also increased by the fact that objects have a larger representation at run time than native types because Java stores hidden bookkeeping information for objects that isn’t needed for native data types. The following methods were written for the two classes because they were needed in the main body of the program: • Adding/subtracting a uint / ubyte to/from a uint / ubyte . • Testing equality of a uint / ubyte with a uint / ubyte . • Comparing the size of a uint / ubyte to a uint / ubyte (done the standard way by extending java.lang.Comparable ). • For uint : extracting/inserting a ubyte from/into each of the 4 abutting ubyte positions. • Get the value in the “next size up” native Java signed integer. • Get the value reinterpreted as the same size native Java signed integer. • A constructor to take the value from the “next size up” Java signed integer  129  • A constructor to reinterpret the bit pattern of the same size Java signed integer as an unsigned integer. • A constructor which takes no arguments and initializes the new unsigned integer to 0. The first version of the constructors threw an error if the initialization value was too large for the unsigned integer. In the final version the constructors did not throw an error but simply masked off the top half of the word if an initialization value was too large. I made this change as I believe if a user passes in an initialization value that is too large that they will be doing this intentionally and the behavior that they would want would be for the top half of the initialization value to be masked off. Testing: uint and ubyte were first developed and tested in a separate Eclipse project. The code was fairly simple so to test them I simply wrote a small number of test cases. I ensured each line of code was invoked by at least one test case. I put uint and ubyte into an unsigned integers Java package and wrote a simple test harness as the main part of the Eclipse project. After I’d finished testing I then copied the unsigned integers package into my main Eclipse project so I could import the two unsigned integer classes where they were needed in the main program. uint and ubyte weren’t used initially in the Java application. After they were introduced, all places where unsigned words had been represented with native Java signed integers were instead represented with the new unsigned integer classes.  130  6.9  Retrieving Configuration Parameters from the Alignment Appliance  Regardless of whether the application is being used to upload a reference sequence to the alignment appliance or to perform an alignment with the alignment appliance, the application needs to have the configuration parameters from the appliance. The first step in the execution of the application is to retrieve these parameters. First the host workstation sends a “workstation information frame” to the appliance at the MAC address specified on the command line by the user. The appliance responds with an “appliance information frame” which contains • The number of seed registers in the alignment appliance. • The length in nucleotides of each seed register in the alignment appliance. • The maximum reference sequence length (in nt) that can be stored by the SDRAM in the alignment appliance. If a response is not received within a specified time (3s default) the workstation software assumes the MAC address specified by the user is incorrect or the alignment appliance isn’t working correctly and aborts execution.  131  6.10  How a Reference Sequence is Loaded into the Alignment Appliance  6.10.1  Loading the Reference Genome from Disk to Memory  The whole reference sequence is first loaded into memory and then transmitted as a sequence of packets to the alignment appliance. The reference sequence is packed into a sequence of ubyte s with 2b/nt encoding. These ubyte s are then stored in a java.util.Vector , a variable-sized array.  6.10.2  Transmitting the Reference Genome from Memory to the SDRAM on the DE2-70 Board  The application fills a sequence of Ethernet packets with reference sequence data according to a standard the embedded software in the alignment appliance knows how to interpret. These packets are sent to the alignment appliance and individually acknowledged. The alignment appliance writes them to its SDRAM buffer and updates its state variables accordingly.  6.11  How Reads are Uploaded into the Alignment Appliance  6.11.1  Loading the Reads from Disk to Memory  Reads shorter than 28bp in length are simply discarded. Reads longer than 28bp are truncated to their first 28bp. This is the standard read-to-seed conversion 132  method used by MAQ and other aligners. Each 28bp seed is converted into 2b/nt encoding and stored in a 7- ubyte array. These 7- ubyte arrays, each corresponding to a seed, are loaded into a variable-sized java.util.ArrayList vector. If the seed dataset is too large to fit into the number of seed registers available in the alignment appliance then it’s simply broken into multiple batches and multiple passes of the reference sequence are performed.  6.11.2  Packaging Seeds into Packets  Each seed packet contains a variable number of seeds from 0 to a maximum of 128. The value 128 was chosen because it is a power of 2 and 7*128=896, but 7*256=1,792 ¿ 1,504 is larger than the maximum data payload that can fit into an Ethernet packet. All the seed registers must be of the same length. This length of the seed registers can only be changed by recompiling the source code. Each seed frame is acknowledged by the embedded software when it’s received.  6.12  Packet Transmission and Reception While in Operation  Figure 6.1 on page 133 shows the pattern of transmission of frames between host workstation and alignment appliance while the alignment appliance is in operation. When the application runs the host workstation sends a “workstation information” frame to the appliance. The appliance then responds with an appliance information frame. If the user has configured the application to upload a reference genome then the new reference genome is uploaded to the  133  Figure 6.1: The transmission and reception of frames between the alignment appliance and host workstation.  134  appliance as a sequence of packets. If the user has configured the application to perform an alignment, then the seed dataset is loaded into the appliance a batch at a time and aligned. Each batch is loaded as a sequence of multiple packets in a similar way to the reference sequence. The workstation then sends a “start alignment” packet to the appliance to instruct it to start aligning. The appliance continuously sends hit results back to the host workstation as they are generated. Once the alignment pass is complete and all results are returned the alignment appliance sends an acknowledgement frame to the host workstation. This whole procedure is repeated for each batch until all the seeds in the dataset have been aligned.  135  Chapter 7 Development of an Application to Resolve Ambiguities in Reference Genome Files 7.1  Introduction to Ambiguity Resolution in Reference Genomes  NCBI provides its reference genome DNA sequences as FASTA files. Other formats are available but FASTA is established as the standard format for storing and distributing finished reference genomes. Usually the finished DNA sequence for each chromosome is packaged into a single FASTA file which internally consists of a sequence of separate contigs. Reference genomes are approximate: DNA sequencing errors, limitations of assembly algorithms and limitations of sequence finishing methods cause differences between the reconstructed reference genome sequence and the actual genome sequence.  136  A FASTA file consists of one or more DNA sequences. Each DNA sequence starts with a header line which may contain the name of the DNA sequence and a variable-length sequence of codes. FASTA files are easy for humans to read and write and for programs to parse and generate. They use more space than binary formats which store DNA sequences using 2-bit/nt encoding (as each code is represented by an 8-bit ASCII character). However, FASTA files are used because a finished reference genome in FASTA format is typically only a few gigabytes in size (approximately 3.0GB for the human genome). The additional factor of 4 saving in space usage of a binary reference genome format wouldn’t make the storage and transmission of reference DNA sequences much easier and wouldn’t be as easy for a human to read and write as a FASTA file. It would also be a little harder for programs to parse and generate. If short read sequence aligners use some other format for the input reference genome file (e.g. MAQ uses the .bfa “binary FASTA” format) then they include a software tool to convert a FASTA file to their non-standard reference genome file format as part of the software package (MAQ includes fasta2bfa to do this). Alternatively they may just take the FASTA files as input and convert to the alternative format they require internally. This is the approach taken by the FASTA parser I’ve written which converts the reference sequence into 2-bit/nt encoding before uploading it to the SDRAM of the alignment appliance. Each position of the reference sequence in the FASTA file is assigned a “code”. As you would expect, FASTA files support codes for each of the four bases (A, C, G, T). However, they also support “ambiguous codes” - characters which represent sequence positions which the author/generator of the file could not represent as a single base. Fasta supports an ambiguous code for every 137  possible pair (e.g. R represents G or A), every possible triple (e.g. B represents G, T or C) and the quadruple (e.g. N represents any of the four bases A, C, G or T). Before I decided how to deal with the ambiguous character codes in reference genome files I investigated whether or not they were actually used. If they weren’t used then I wouldn’t have to solve the problem of how to interpret ambiguous sequence. While they are a valid part of the FASTA format, I thought it may be possible that groups who generate reference genomes deliberately eschew their use so that users of the reference genomes don’t have to decide how to interpret these ambiguous characters. This turned out not to be the case. I downloaded a contig of human chromosome 1 and parsed it to look for ambiguous character codes with a small Java program I wrote for the purpose. The file contained a sequence of exactly 100,000 “N” (any of the 4 bases) codes. Evidently the ambiguous characters codes do appear in reference genome FASTA files, so I’d have to decide how to deal with them.  7.2  Design and Implementation of an Ambiguity Resolution Application  After discussions with two colleagues (S. Jones and N. Thiessen) we produced a list of four possible ways to interpret ambiguous codes in reference sequence files: 1. Delete ambiguous codes method: Simply delete the ambiguous character codes from the reference sequence. 2. Random substitution method: Substitute each ambiguous code for 138  one of the non-ambiguous bases it can represent using a uniform distribution. For example, the code N in the reference sequence would be substituted by either A, C, G or T with probability 1/4 of each of the four possible substitutions. Pairs and triples would be handled similarly. This method is used by the BWA aligner and probably the MAQ aligner. 3. Automatic mismatch method: An ambiguous characters in the reference is considered to mismatch with any character in a read. 4. Probability-based interpretation of the reference: Probability based handling of ambiguous characters in the reference genome. Methods 1 and 2 resolve ambiguities by constructing an alternative non-ambiguous sequence. Methods 3 and 4 preserve information about the ambiguous codes during the alignment. Methods 3 and 4 cannot be used with 2-bits/nt encoding of the reference as they need to store more information than just the concrete sequence at each character position. Methods 1 and 2 can be used with 2-bits/nt encoding of the reference. I first implemented the delete ambiguous codes method. This was the easiest to implement and, once up and running, I could then develop the code further into using a better method. This method would have been sufficient if reference genome files didn’t contain ambiguous codes. Because reference genome files do contain ambiguous codes I decided to instead implement the random substitution method. This is the method used by the BWA aligner. There were two options for how the ambiguity resolution could be applied: 1. as a separate application, or 2. integrated into the reference genome loader of the appliance application control program. 139  I chose option 1 because it allowed me to bypass the ambiguity resolution of other aligners. This would be necessary during testing to ensure both aligners used were aligning to exactly the same reference sequence. This is explained in section 7.2 on page 140. I considered using the existing reference genome ambiguity resolution code in MAQ to save implementation time and effort. This turned out not to be possible as the ambiguity resolution code for MAQ is integrated into the binary and could not be run separately so I implemented my own ambiguity resolution application from scratch. The first version of the ambiguity resolution code was integrated into the FASTA parser of the reference genome loader I wrote, and the final version was written as a free standing application. Having the ambiguity resolver code as a separate free standing application increases the complexity of the software package a little but this isn’t a problem. Compatability with BWA and MAQ’s ambiguity resolution: The underlying method used by the ambiguity resolver I’ve written is the same as the method used by the ambiguity resolvers integrated into the FASTA parsers of MAQ and BWA. However, my ambiguity resolver doesn’t produce identical results as implementation-level details such as the random number generator used are different. It would have been possible to do this but time consuming and difficult as obtaining detailed information about how they were implemented could only be done by reading the source code. A potential problem caused by this is that when performing correctness testing of my aligner by validating against results produced by, say, MAQ, MAQ will produce different results because it would have resolved the ambiguities in the same ambiguous reference sequence to a different non-ambiguous reference sequence. To prevent this potential problem when I perform correctness testing 140  I will always run my ambiguity resolution application on the raw reference sequence to produce a non-ambiguous reference sequence which both my aligner and the known good aligner (MAQ, say) take as input. This bypasses any ambiguity resolution in the known good aligner and makes sure that both aligners are aligning to an identical non-ambiguous sequence. The incorrect way of doing correctness testing is shown in figure 7.1 on page 141. The right way of doing it is shown in figure 7.2 on page 142. The way shown in figure 7.1 would work if the two aligners had identical code for producing non-ambiguous reference sequences from ambiguous reference sequences but they don’t. Repeatability: When the ambiguity resolver is run twice on the same FASTA input file, the output files produced by the two different runs are always identical. Having this property was a deliberate design decision. A situation like this could potentially occur when I carry out correctness testing. If I disambiguated the reference genome, ran my aligner to produce results and deleted the disambiguated reference genome, I would need to be able regenerate an identical disambiguated reference genome again later to run the other aligner on. My first idea was to allow the user to specify the initialization seed for the random number generator on the command line, however I never actually implemented this. I changed this to having the seed for the random number generator set as a constant in the code. It’s simply initialized to an arbitrary constant Java long . This ensures that for a given input file the ambiguity resolver will always generate exactly the same output file. Speed: To resolve the ambiguities in a FASTA file of n characters requires O(n) time and O(1) space. For realistic-sized reference sequences the runtime is 141  Figure 7.1: Correctness testing the wrong way. Notice that the two aligners are aligning to different reference sequences because their ambiguity resolvers work differently.  142  143 Figure 7.2: Correctness testing the right way. Notice that the two aligners are aligning to identical reference sequences.  fast enough for a user to use it interactively on a standard workstation. Additional Features: Along with the ambiguities being resolved, additional processing is performed by the ambiguity resolution application: • The multiple sequences in the input file are concatenated into a single sequence in the output file. • The output file is formatted with exactly 80 characters per line, except possibly the final line because the number of codes in the FASTA file may not be an exact multiple of 80. This is recommended but not required by the FASTA format(FAS). • As required by the FASTA format description, lower case letter codes are accepted and processed the same way as their corresponding upper case codes(FAS). • All characters codes generated in the output FASTA file are upper case. Random number generator: The requirement I had was to be able to produce uniformly distributed random variables from 2, 3 and 4 outcomes. These would be used to disambiguate the 2-way, 3-way and 4-way ambiguous character codes respectively. The random number generator used in the implementation is the standard one from the Java library ( java.util.Random ).  7.3  Software Engineering Issues  I implemented the application in Java 1.6, the most recent version of Java at the time of implementation. No library code other than the Java standard  144  Figure 7.3: Internal structure of the ambiguity resolution Java application. library was incorporated into the implementation. I used the Eclipse 3.4.0 development environment(Ecl). Figure 7.3 on page 144 shows the modules and interfaces in the ambiguity resolution application implementation. The application is composed of three modules, each implemented as a Java class: 1. A FASTA file parser. It parses an input FASTA file and outputs the result as a sequence of calls to the FASTA File Parser Output Interface I wrote. 2. An ambiguity resolver. It takes the sequence of calls to the FASTA File Parser Output Interface interface, performs ambiguity resolution with the random-substitution method and outputs the results as a sequence of calls to the FASTA File Parser Output Interface 3. A FASTA file writer. I also implemented two Java interfaces: 145  • A FASTA File Parser Output Interface . If a Class implements this interface, it can take input from the FASTA file parser I wrote through a sequence of calls to that interface. It includes a method for each code in the FASTA standard, a method for a header line, a method for a blank line, a method for invalid codes and a method for gap characters. • A FASTA File Writer Interface . The FASTA file writer module I wrote implements this interface and produces an output FASTA file from a sequence of calls to this interface. It includes a method to generate each character code, a method to generate a sequence header line and a method to generate the gap character.  7.4  Testing the Ambiguity Resolution Application  Things I specifically wanted to test: 1. If there are multiple sequence header lines in the input file, only 1 (the first) appears in the output file. 2. The sequence data in the output file is presented at 80 codes per line, except the last line which may be shorter. 3. Each non-ambiguous code passes through the ambiguity resolution application unaltered. 4. Each ambiguous character code is replaced by one of the non-ambiguous codes that it’s meant to be replaced by and not some other character.  146  5. Whether the application produces identical output when run on the same input file twice (it should). Even though the base substitution is random it should still be repeatable (explained in the text earlier). I hand wrote a small FASTA file, ambiguity resolver test input FASTA file.FASTA , with the following properties: • It containes long stretches of non-ambiguous characters. • It containes at least one of every ambiguous character. • It contains 2 sequence headers. I ran the ambiguity resolution application on this test file twice to produce two separate output files. Tests 1-4 were carried out by visually inspecting the contents of one of the output FASTA files with a text editor. Test 5 was carried out by comparing the two output FASTA files visually with a text editor. All 5 of the tests were passed. Because of the code-by-code way the ambiguity resolution application works, I think it’s safe to infer that because the ambiguity resolver worked correctly on a small test FASTA file of 362 codes, it will work on a FASTA file of any size.  147  Chapter 8 Correctness Testing The purpose of correctness testing is to determine if the alignment appliance produces correct hit results. In this section I explain how I demonstrated this. The build of the FPGA binary I used for correctness testing was configured to have 512 seed registers each of length 28nt (56 bits). The alignment pipeline was clocked at 80.0MHz (i.e. 80.0Mnt/s), but could have been clocked at 100.0MHz without causing timing problems.  8.1  Choice of Test Reference Sequence  I downloaded the first contig of human chromosome 1 from NCBI. I decided that this was an acceptable reference sequence for testing because it was real genome sequence and included a substantial number of ambiguous character codes. I discarded all but the first 1Mbp of this sequence. I decided that 1Mbp was long enough for a realistic test. I decided that this 1Mbp sequence was large enough that it would have the properties of whole human reference sequences.  148  8.2  Choice of Test Seed Dataset  The test seed dataset consists of 500 X 28bp synthetic reads in an ELAND file. The reads are generated by a Java application I wrote specifically for this purpose. The application scans a 28bp sliding window along the reference sequence. At each position a random number generator is used to decide whether or not that subsequence position should be selected from the reference sequence as a synthetic read. I configured the random number generator to select a read at approximately one in every 2,000 sequence positions of the reference sequence. Over the 1Mbp reference sequence this would be expected to result in the generation of approximately 1, 000, 000/2, 000 = 500 synthetic reads. When I ran it the actual generated set of synthetic reads was 516 reads in size. Figure 8.1 on page 149 shows how the test reference sequence and test seed dataset were generated.  8.3  Generation of Correct Hit Results  I need to generate a dataset of correct hit results for the test reference sequence and test seed dataset to validate the hit results generated by my hit finder against. The first solution I thought of using was to use an existing aligner to produce correct results, then validating the results produced by the hit finder of my appliance against those results. The problem with this is that the existing aligners are usually packaged in such a way that you can’t use just the hit finding phase without the seed extension phase. One possible solution to this would be to modify the source code of an existing aligner to output raw hits with no seed extension. I considered doing this for the MAQ aligner but after 149  Figure 8.1: How the Test Datasets were Generated.  150  looking at the source code decided it would be too difficult. The solution I finally decided on was to write a whole aligner in Java purely to produce correct results. The aligner could be very slow as long as it produced correct results. I reused the fasta file parser code I wrote for the ambiguity resolution application for the correctness testing hit finder. The correctness testing hit finder stores the reference sequence internally as an array of characters. Each seed is stored in a java.lang.String . Queries are generated with the sliding-window method and seeds are compared to the reference sequence using the Java library’s built in string operations. It uses the 2-of-4 subsequence matching method so should generate identical results to the hit finder of my alignment appliance. No performance optimizations are made but this doesn’t matter as it only has to be fast enough to align ∼500 seeds to a 1Mbp reference once. It wasn’t possible to get the test synthetic read dataset generator application to output correct hit results as it generated synthetic reads because each read could potentially align to more than just the position it was originally taken from, especially since we are using inexact matching anyway. The criteria I used to decide whether or not the alignment appliance hit finder was working was whether or not it generates identical hit results to the correctness testing aligner. A potential problem with this is that if they both have the same bugs they could both generate the same wrong hit results. In practice this is extremely unlikely to happen as the implementations were written at different times, share no code and run on different types of hardware. Also, the correct aligner I wrote was extremely simple so it’s unlikely to contain any bugs. Figure 8.2 on page 152 shows how the two hit results datasets were 151  generated from the test datasets. The first from the alignment appliance and the second from the correctness testing aligner.  8.4  How the Two Sets of Hit Results are Demonstrated to be Identical  We need to be able to determine if the two sets of hit results are identical, as this is how we decide whether or not the alignment appliance is working correctly. I ran the alignment appliance on the test reference sequence and test seed dataset as shown in figure 8.2. I also ran the correctness testing hit finder on the same two datasets. Both produced a hit results dataset which contained 10,971 hit results. I copied the hit results from the alignment appliance and the hit results from the correct hit finder into a spreadsheet. I set the spreadsheet up to report whether or not there were any differences between the two sets of results. There were no differences, so I consider the aligner to have worked correctly.  152  Figure 8.2: Producing hit finding results for correctness testing.  153  Chapter 9 Performance Measurement The purpose of performance measurement is to produce “wall-clock time” performance measurements of the FPGA alignment appliance hardware.  9.1  Configuration of FPGA Binary Used for Performance Measurement  The build of the FPGA binary used for performance measurement had 768 x 28nt (56 bit) seed registers. The alignment pipeline was clocked at 100.0MHz but could have been clocked at up 111.46MHz without causing timing problems.  9.2  Choice of Reference Sequence  Ideally we want to perform performance testing with a whole human genome as the reference sequence. This is because the finished product is intended to be used with a human reference sequence, so that would be a realistic test. This wasn’t possible as the human reference genome of ∼3.0Gbp requires ∼750MB of 154  reference sequence buffer space. There is only 64MB of SDRAM on the board. Instead I used just human chromosome 1, the longest human chromosome. The length of the particular version of human chromosome 1 I used was 226,212,984nt. I used my ambiguity resolution application to resolve all ambiguities in it. This occupies 53.9MB and so comfortably fits into the 64.0MB of SDRAM on the DE2-70 board.  9.3  Choice of Read Dataset  I decided to use a read dataset that was the size of a single batch of reads, so 768 reads in this case. These were obtained from a subset of a real short DNA read dataset. The dataset was originally used for methylation analysis. All reads shorter than 28bp were removed from this dataset. All reads containing the “.” ambiguous character were also removed. Then all reads other than the first 768 were removed.  9.4  Method of Measuring Performance  The run time of the aligner was measured using the “system clock timer” component provided by Altera. Code was written so that the embedded software started the timer when the alignment started and stopped the timer when the alignment finished. The amount of time elapsed is then simply read off and displayed on the debug output.  155  9.5  Performance Measurement Values  The 768 seeds were aligned to the 226.2Mbp reference sequence in 2.93 seconds. During the alignment, exactly one seed register matched the reference sequence at 137,730 sequence positions and multiple seed registers matched the reference sequence at 184 sequence positions. The hit results generated were returned to the host workstation in a total of 1,381 results frames. The performance test was run multiple times and the run time differed by less than 0.01 seconds between runs. Running at 100.0 MHz, the absolute fastest the alignment pipeline can process the 226.2Mnt reference sequence is in 226.2/100.0 = 2.26 seconds. The efficiency of the alignment pipeline is therefore ((2.26/2.93) ∗ 100%) = 77%. The reduced performance is as a result of the pipeline stalling. This occurs for the reasons given in section 4.1 on page 103.  9.6  Extrapolation of Performance Values to a Human-Size Reference Sequence  Our test reference sequence only had length 226.2Mbp, whereas the human reference genome has length 3.0Gbp, due to there only being 64MB SDRAM on the DE2-70 board. Had sufficient memory been available we could have run the full human reference sequence. This is (3.0 ∗ 109 )/(226.2 ∗ 106 ) = 13.3 times longer, so would have taken 13.3 times longer to complete its run. So we can assume the run for the whole human reference genome would have taken 13.3 ∗ 2.93 = 39.0 seconds to align the 768 seed dataset. This corresponds to an alignment rate of 71,000 seeds per hour.  156  Chapter 10 Conclusion and Future Work 10.1  Comparison of FPGA-Based Hit Finder and Microprocessor-Based Hit Finders  10.1.1  Hardware Cost  FPGA-Based Hit Finder The performance of the FPGA based hit finder was measured in chapter 9 to be 71,000 seeds per hour. Using a per-FPGA cost of $200, this corresponds to a hardware cost of 71, 000/200 = 355 (seeds/hour) per dollar. Microprocessor Sliding Window-Based Hit Finder Using the same alignment policy (first 28bp of read used as seed, 2 mismatches allowed by hit finder), Maq can align 0.27M reads per hour to a human-sized reference an estimated $4,000 server. These figures obtained from (Langmead, Trapnell, Pop, and Salzberg). This corresponds to a hardware cost of (0.27M/4, 000) = 67.5 (seeds/hour) per dollar. Note that this also includes the 157  time to perform seed extension whereas the figure for the FPGA hit finder is for hit finding only. However, seed extension is very fast compared to hit finding so we can assume that virtually the whole runtime of Maq is for performing hit finding. Microprocessor BWT-Based Hit Finder Using figures from (Langmead, Trapnell, Pop, and Salzberg), Bowtie can align 28.8M reads per hour to a human-sized reference on an estimated $4,000 server. This is with the default 28bp seed, 2-mismatch hit finding policy. We make the same assumption that virtually zero time is spent in hit extension. The corresponding hardware cost is ((28.8M)/4, 000) = 7, 200 (seeds/hour) per dollar.  10.1.2  Inexact Matching  Using the “n-of-2n” mismatch method described in section 1.3.3 (p. 30) with a hash table based software aligner multiplies the runtime of the aligner by  2n n  .  This quickly leads to enormous, unmanageable runtime requirements as n grows. With the FPGA hit finder, the performance is proportional to the number of comparison units we can instantiate in the FPGA. The FPGA has a finite capacity, suppose we call this C and measure it in number of logic gates. If each comparison module requires g logic gates, then we can fit approximately (C/g) comparison modules into an FPGA. So doubling the resource requirements of a comparison module halves the performance of the hit finder. A comparison module (see fig. 2.9, p. 56) consists of a 56-bit register, 56 1-bit comparators to check equality of each bit of the register with each bit of  158  the query and the logic to generate a hit signal from the output of the 56 1-bit comparators. When performing exact matching, the resource requirements for a comparison module are: 56 x 1-bit registers, 56 x 1-bit comparators and 27 AND gates. When performing 1-mismatch-of-2 matching, the resource requirements for a comparison module are: 56 x 1-bit registers, 56 x 1-bit comparators, 26 AND gates and 1 OR gate. When performing 2-mismatch-of-4 matching, the resource requirements for a comparison module are: 56 x 1-bit registers, 56 x 1-bit comparators, 25 AND gates and 6 OR gates. In general the resource requirements for the n-of-2n method are: 56 x 1-bit registers, 56 x 1-bit comparators, (27 − n) AND gates and  2n n  Notice that the number of OR gates does actually increase as  OR gates. 2n n  so  asymptotically the resource requirements are the same as for the multi-hash table method. Because the OR gate requirement is very small compared to the rest of the resources required for each comparison module, the value of n can get quite large before the resource requirements are a problem. For the hash table method we can only use 2/3 mismatches before the number of hash tables required becomes enormous. Usually in the hit finder implementation, the registers are the limiting resource in the FPGA, not the logic.  159  10.2  Demonstration that the FPGA Hit Finder Produces Identical Results to MAQ  The FPGA-based hit finder uses the same 2-of-4 hit finding technique with a sliding window query generator that MAQ and ELAND use. I didn’t explicitly test whether the hit results generated by my aligner are identical to the hit results generated by MAQ as I cannot run just the hit finder of MAQ on its own. However, when I performed correctness testing of my aligner, I showed it produced identical results to the testing aligner I wrote. This aligner also uses the same 2-of-4 hit finding method. I think it’s safe to assume that the FPGA hit finder produces identical results to MAQ’s hit finder. This is a desirable property for a hit finder to have because it means this new hit finder can be used as a drop-in replacement for MAQ’s hit finder.  10.3  Ideas for Future Work  • Instead of storing a seed in each seed register, we could store a hash of the seed instead. We compute the hash of each seed on the workstation as we load them into the seed registers. At each position of the reference genome we compute the hash of the query in hardware before as we broadcast it to the comparison modules. This is the idea the Rabin-Karp algorithm(Karp)(Cormen) is based on. We can’t use this method with the n-of-2n mismatch method, only the exact matching method, unless we carefully designing our hashing function to work with the n-of-2n method. A hash function which may work well with this method are using “spaced seeds”, an idea introduced by the PatternHunter(Ma, Tromp, and Li)(Li,  160  Ma, Kisman, and Tromp) search tool. These cover the whole length of the seed but without requiring every position to be looked at, so saving space in the seed register. • The current design effectively builds an index of the seeds by loading them into the hardware registers and then scans the reference sequence against the index of seeds. Some software aligners do this the other way around: They build an index of the reference sequence and then scan the seeds against it. It might be possible to build a data structure in an FPGA representing a segment of a reference sequence and then scanning the seed against it. If we had multiple FPGA’s on a board each indexing a different part of the reference sequence, a seed could be broadcast to all the chips on the board concurrently and compared against the whole reference sequence. • Use the fact that the FPGA consists of reconfigurable hardware to improve the hit finders performance. In the current implementation we compare a query sequence in parallel to hundreds of writable seed registers. Instead of comparing the query to writable seed registers we could compare it to fixed, constant logic. If each seed we were searching for was represented as a fixed constant the space in the FPGA would be used much more efficiently. This would allow more seeds to be stored in the FPGA at once, leading to improved performance. • Investigate using FPGA’s in the implementation of other kinds of bioinformatics algorithms such as protogenomic mapping, BLAST-type algorithms and others. These two examples are similar enough to short  161  read alignment that this work could be used as a starting point but different enough that a different approach would be required. • Investigate implementing the BWT-based short read alignment algorithm of Bowtie(Li and Durbin) and BWA(Li and Durbin) in an FPGA. This algorithm is fundamentally different to the sliding window hit finding algorithm so would require a completely different approach to be implemented in an FPGA. One possible approach would be to implement a soft-core processor in the FPGA with specially-designed functional units used in the BWT-based alignment algorithm. • We could investigate build an ASIC (Application Specific Integrated Circuit) version of this design. This would be based on actual custom logic chips rather than emulated custom logic in FPGA’s. An ASIC offers approximately 10 times the chip capacity and runs at approximately 10 times the clock speed as the same priced FPGA. The problem is that they can only be manufactured economically in huge quantities. It would be interesting to find out exactly what quantities those are and whether there is enough demand to make such a product feasible.  10.4  Future Work: Handling Ambiguous Characters in the Reference Genome in Hardware  In my implementation I wrote an application to disambiguate ambiguous reference sequences into non-ambiguous reference sequences which was  162  Figure 10.1: Hardware for Comparing a 4-bit Encoded Nucleotide with a 2-bit Encoded Nucleotide. described in chapter 7. An alternative way to deal with ambiguous reference sequences would be to preserve the ambiguous data at run time. This idea was originated by Dr. N. Malhis and developed by both of us. The method is to simply represent each position of the reference sequence as a 4-bit word rather than a 2-bit word. Each of the 4 bits corresponds to one of A, C, G or T. To represent an ambiguous character we can set the bits corresponding to which if the four bases it can represent. so “ambiguous A or T” would be represented by “1001”. We can then compare a 4-bit encoded ambiguous nucleotide from our query window with a 2-bit encoded non-ambiguous nucleotide from a seed register with the hardware shown in figure 10.1 on page 162. There are some drawbacks:  163  • The reference sequence now takes up twice the space in the SDRAM (as using 4b/nt rather than 2b/nt). • The query sequence is now twice the length, using up more routing resources in the FPGA for the broadcast bus. • This comparator is larger than the 2b/nt-to-2b/nt comparator, making the comparison module larger and reducing the performance of the alignment appliance because we have fewer comparison modules. I didn’t implement this scheme due to lack of time but it would have been interesting to do so. We could potentially also construct 4b/nt seed registers instead of our current 2b/nt ones. This would halve the number of seed registers in the appliance, halving performance. Also the comparison logic would be larger, slower and more complex. There isn’t much point in doing this as ambiguous characters in seeds are not very useful anyway. Extending this approach even further, we could store actual probability values in hardware, although this would result in large, slow hardware for little or no real benefit.  10.5  Improvements that Could be Made to the Implementation  • Making it so that the alignment appliance can be run without a programming workstation. The FPGA configuration binary would have to be uploaded to the onboard flash of the DE2-70 board and the FPGA would be configured to auto-upload this binary and configure itself at startup. Because the FPGA contains embedded software this would also 164  have to be stored in the DE2-70’s onboard flash and a boot loader would have to be written to load the program code into memory at power up. • It would be possible to have seed registers in a single system that were of different length to each other. This was not implemented but easily could be. This would make the implementation slightly more complex. Some software aligners, such as ELAND, have the restriction that all the seeds loaded into it must have the same length. • A hit extension stage could be added to the alignment appliance I wrote. A possibility would be to “bolt-on” a seed extension phase to the workstation application I’ve already written by finding an aligner which allows the user to invoke just the seed extension phase of the program. The seed extension phase could alternatively run on the embedded soft processor in the aligner controller, or on custom dedicated hardware in the FPGA.  10.6  Performance Scaling with Larger Chips  Since 1965 the number of transistors that can be fabricated on a chip of fixed size and cost approximately doubles every two years. This phenomenon is known as Moore’s Law(Moore et al.). Until recently (around 2004) microprocessor designs became increasingly more complex with every iteration of Moore’s Law to make use of the larger number of transistors that could be fabricated on a chip. Roughly speaking, every time the size of the chip doubled, the performance of the processor that could be fabricated on it doubled. With every iteration of Moore’s Law, the  165  number of whole processor cores that are fabricated on a chip is roughly doubled. The number of transistors in each processor doesn’t substantially increase as processors are now so complex that more transistors won’t substantially increase their performance. Before microprocessors became multi-core, it was easy to take advantage of Moore’s Law: New processors would run existing code faster. When microprocessor’s became multi-core it became harder to take advantage of newer faster processors as software has to be rewritten to take advantage of the larger number of cores available. While the number of cores per chip doubles with every iteration of Moore’s Law, the number of memory banks doesn’t increase substantially because this is physically limited by the number of pins that can fit on the chip and this only increases slowly, nowhere near as fast as the rate at which the number of cores per chip increases. How this Affects Sliding Window Software Aligners: The index data structure containing seeds used by each core is typically 64MB (224 × 32 bits) and we have typically 6 of them. The core has to have random access to this data structure. This is too large to fit into a processor’s cache so each processor must have access to main memory to use the sliding window hit finding algorithm. The consequence of this is that, as with the BWT-based algorithm, performance of sliding window algorithms less than doubles with each iteration of Moore’s Law. How this Affects BWT-based Software Aligners: The BWT-based algorithm needs to be able to randomly access a multi-gigabyte index contained in main memory. The index is too large to fit into an on-chip cache of a microprocessor, which are typically a few megabytes. For this reason performance of a multicore chip will be limited by the number of memory banks 166  connected to the chip. Even if there are 10’s of cores on a chip, each with its own cache of a few megabytes, only two of these cores will be able to run the alignment algorithm if there are only two memory banks available and the rest of the cores will be wasted. Their local on-chip caches are too small for storing the index in. Consequently the performance of the BWT-based alignment algorithm less than doubles with each iteration of Moore’s Law. How this Affects FPGA-based Hit Finders: With each iteration of Moore’s Law, FPGA’s double in size. My alignment appliance FPGA design should scale up to these larger chips easily. It can simply have twice the number of registers. It’s hard to predict the effect on clock speed but it should stay the same or probably increase as the larger, newer FPGA’s will be better designed and the signals are moving the same physical distance as in the older FPGA’s. When we have twice as many seed registers in the chip, we also require twice the amount of time to load data into those seed registers. This prevents the design from scaling linearly. However it isn’t much of a problem in practice. The seed loading time is tiny (much less than 1%) of the time taken to perform an alignment pass, so many doublings have to occur before it becomes more than virtually zero. Even then, there are implementation techniques to get around this. For example, we could break the seed dataset into two halves and load both halves concurrently. Instead of loading them from off-chip memory we could load them from the high bandwidth, small on-chip, distributed memories in the FPGA. So, in practice, the performance of the FPGA hit finder doubles with every iteration of Moore’s Law.  167  Bibliography “The Eclipse Website.” 2010. Http://www.eclipse.org/. “The ELAND Website.” 2010. Http://bioinfo.cgrb.oregonstate.edu/docs/solexa/ Whole%20genome%20alignments%20using%20ELAND.html. “FASTA format definition.” 2010. Http://www.ncbi.nlm.nih.gov/blast/fasta.shtml. “Harvard Architecture Definition.” 2010. Http://www.wordiq.com/definition/Harvard architecture. “The JPcap Website.” 2010. Http://netresearch.ics.uci.edu/kfujii/jpcap/doc/index.html. “Website about Ross Freeman.” 2010. Http://www.xilinx.com/company/history.htm. “The WinPcap Website.” 2010. Http://www.winpcap.org/. Adjeroh, D., Bell, T., and Mukherjee, A. The burrows-wheeler transform: data compression, suffix arrays, and pattern matching. Springer-Verlag New York Inc, 2008.  168  Aho, Corasick M. J., A. V. “Efficient String Matching: An Aid to Bibliographic Search.” Communications of the ACM (1975). Aho, Sethi R., A. and Ullman, J. Compilers: Principles, Techniques and Tools. Addison Wesley, 1986. Altera. Quartus II Handbook Version 10.0, Volume 4: SOPC Builder. Altera, 2010. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. “Basic local alignment search tool.” Journal of molecular biology 215 (1990).3: 403–410. Baker, M. “Next-generation sequencing: adjusting to data overload.” Nature Methods 7 (2010).7: 495–499. Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I.F., Soboleva, A., Tomashevsky, M., Marshall, K.A., et al. “NCBI GEO: archive for high-throughput functional genomic data.” Nucleic acids research 37 (2009).Database issue: D885. Baxevanis, A.D. and Ouellette, B.F.F. Bioinformatics: a practical guide to the analysis of genes and proteins. John Wiley and sons, 2001. Benson, DA, Mizrachi, IK, Lipman, DJ, Ostell, J., and Wheeler, DL. “GenBank, 2005.” Nucleic Acids Res 33 (????). Boyer, Moore J. S., R. S. “A Fast String Searching Algorithm.” Communications of the ACM (1977).  169  Brooksbank, C., Cameron, G., and Thornton, J. “The European Bioinformatics Institute’s data resources: towards systems biology.” Nucleic acids research 33 (2005).Database Issue: D46. Buhler, J.D., Lancaster, J.M., Jacob, A.C., Chamberlain, R.D., et al. “Mercury BLASTN: Faster DNA sequence comparison using a streaming hardware architecture.” Reconfigurable Systems Summer Institute (2007). Campagna, D., Albiero, A., Bilardi, A., Caniato, E., Forcato, C., Manavski, S., Vitulo, N., and Valle, G. “PASS: a program to align short sequences.” Bioinformatics 25 (2009).7: 967. Compton, K. and Hauck, S. “Reconfigurable computing: a survey of systems and software.” ACM Computing Surveys (CSUR) 34 (2002).2: 210. Cormen, Leiserson C. E. Rivest R. L. Stein C., T. H. Introduction to Algorithms: Second Edition. The MIT Press, 2003. Crochemore, M., Hancart, C., and Lecroq, T. Algorithms on strings. Cambridge Univ Pr, 2007. Dandass, Burgess S. C. Lawrence M. Bridges S. M., Y. S. “Accelerating String Set Matching in FPGA Hardware for Bioinformatics Research.” BMC Bioinformatics (2008). Ewing, B., Hillier, L.D., Wendl, M.C., and Green, P. “Base-calling of automated sequencer traces usingPhred. I. Accuracy assessment.” Genome research 8 (1998).3: 175. Gupta, S. “Hardware acceleration of hidden markov models for bioinformatics applications.” 2004. 170  Gusfield, D. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge Univ Pr, 1997. Hennessy, J.L., Patterson, D.A., and Goldberg, D. Computer architecture: a quantitative approach. Morgan Kaufmann, 2003. Herbordt, M.C., Model, J., Sukhwani, B., Gu, Y., and VanCourt, T. “Single pass streaming BLAST on FPGAs.” Parallel computing 33 (2007).10-11: 741–756. Hoang, D. and Lopresti, D. “FPGA implementation of systolic sequence alignment.” Field-Programmable Gate Arrays: Architecture and Tools for Rapid Prototyping (1993): 183–191. Jacobi, R.P., Ayala-Rinc´on, M., Carvalho, L.G.A., Llanos, C.H., Hartenstein, R.W., et al. “Reconfigurable systems for sequence alignment and for general dynamic programming.” Genetics and Molecular Research 4 (2005).3: 543–552. Jiang, H. and Wong, W.H. “SeqMap: mapping massive amount of oligonucleotides to the genome.” Bioinformatics 24 (2008).20: 2395. Jiang, X., Liu, X., Xu, L., Zhang, P., and Sun, N. “A Reconfigurable Accelerator for Smith–Waterman Algorithm.” Circuits and Systems II: Express Briefs, IEEE Transactions on 54 (????).12: 1077–1081. Jones, S.J.M., Laskin, J., Li, Y.Y., Griffith, O.L., An, J., Bilenky, M., Butterfield, Y.S., Cezard, T., Chuah, E., Corbett, R., et al. “Evolution of an adenocarcinoma in response to selection by targeted kinase inhibitors.” Genome Biology 11 (2010).8: R82. 171  Kapushesky, M., Emam, I., Holloway, E., Kurnosov, P., Zorin, A., Malone, J., Rustici, G., Williams, E., Parkinson, H., and Brazma, A. “Gene expression atlas at the European bioinformatics institute.” Nucleic Acids Research (2009). Karp, Rabin M. O., R. M. “Efficient Randomized Pattern-Matching Algorithms.” IBM Journal of Research and Development (1987). Knuth, Morris J. H. Pratt V. R., D. E. “Fast Pattern Matching in Strings.” SIAM Journal of Computing (1977). Kuon, I. and Rose, J. “Measuring the gap between FPGAs and ASICs.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 26 (2007).2: 203–215. Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.L. “Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.” Genome Biol 10 (2009).3: R25. Lavenier, D., Xinchun, L., and Georges, G. “Seed-based genomic sequence comparison using a FPGA/FLASH accelerator.” International IEEE Conference on Field Programmable Technology (FPT), Bangkok, Thailand. Citeseer, 2006. Li, H. and Durbin, R. “Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform.” Bioinformatics (2009). ———. “Fast and accurate long-read alignment with Burrows-Wheeler transform.” Bioinformatics 26 (2010).5: 589.  172  Li, H., Ruan, J., and Durbin, R. “Mapping short DNA sequencing reads and calling variants using mapping quality scores.” Genome research 18 (2008).11: 1851. Li, I.T.S., Shum, W., and Truong, K. “160-fold acceleration of the Smith-Waterman algorithm using a field programmable gate array(FPGA).” BMC bioinformatics 8 (2007).1: 185. Li, Li Y. Kristiansen K. Wang J., R. “SOAP: Short Oligonucleotide Alignment Program.” Bioinformatics (2008). Li, M., Ma, B., Kisman, D., and Tromp, J. “PatternHunter II: Highly sensitive and fast homology search.” GENOME INFORMATICS SERIES (2003): 164–175. Ma, B., Tromp, J., and Li, M. “PatternHunter: faster and more sensitive homology search.” Bioinformatics 18 (2002).3: 440. Malhis, N., Butterfield, Y.S.N., Ester, M., and Jones, S.J.M. “Slidermaximum use of probability information for alignment of short sequence reads and SNP detection.” Bioinformatics 25 (2009).1: 6. Manku, G.S., Jain, A., and Das Sarma, A. “Detecting near-duplicates for web crawling.” Proceedings of the 16th international conference on World Wide Web. ACM, 2007, 141–150. Masuno, S., Maruyama, T., Yamaguchi, Y., and Konagaya, A. “Multiple Sequence Alignment Based on Dynamic Programming Using FPGA.” IEICE Transactions on Information and Systems 90 (2007).12: 1939.  173  McMahon, Peter Leonard. “Accelerating Genomic Sequence Alignment using High Performance Reconfigurable Computers.” 2008. Moore, G.E. et al. “Cramming more components onto integrated circuits.” Proceedings of the IEEE 86 (1998).1: 82–85. Navarro, G. “A guided tour to approximate string matching.” ACM computing surveys (CSUR) 33 (2001).1: 31–88. Needleman, S.B. and Wunsch, C.D. “A general method applicable to the search for similarities in the amino acid sequence of two proteins.” Journal of molecular biology 48 (1970).3: 443–453. Shumway, M., Cochrane, G., and Sugawara, H. “Archiving next generation sequencing data.” Nucleic Acids Research (2009). Smith, TF and Waterman, MS. “Identification of common molecular subsequences.” J. Mol. Bwl 147 (1981): 195–197. Sotiriades, E. and Dollas, A. “A general reconfigurable architecture for the BLAST algorithm.” The Journal of VLSI Signal Processing 48 (2007).3: 189–208. Stein, L.D. “The case for cloud computing in genome informatics.” Genome Biology 11 (2010).5: 207. Stevens, W.R. TCP/IP illustrated, vol. 1. Addison-Wesley, 1994. Sugawara, H., Ogasawara, O., Okubo, K., Gojobori, T., and Tateno, Y. “DDBJ with new system and face.” Nucleic Acids Research (2007).  174  Trapnell, C. and Salzberg, S.L. “How to map billions of short reads onto genomes.” Nature biotechnology 27 (2009).5: 455. Walter, C. “Kryder’s law.” Scientific American 293 (2005).2: 32. Xia, F., Dou, Y., Zhou, X., Yang, X., Xu, J., and Zhang, Y. “Fine-grained parallel RNAalifold algorithm for RNA secondary structure prediction on FPGA.” BMC bioinformatics 10 (2009).Suppl 1: S37. Ypma, T.J. “Historical development of the Newton-Raphson method.” SIAM review 37 (1995).4: 531–551.  175  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0071441/manifest

Comment

Related Items