UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Automatic formal verification for scheduled VLIW code Feng, Xiushan 2002

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_2003-0046.pdf [ 2.79MB ]
JSON: 831-1.0051647.json
JSON-LD: 831-1.0051647-ld.json
RDF/XML (Pretty): 831-1.0051647-rdf.xml
RDF/JSON: 831-1.0051647-rdf.json
Turtle: 831-1.0051647-turtle.txt
N-Triples: 831-1.0051647-rdf-ntriples.txt
Original Record: 831-1.0051647-source.json
Full Text

Full Text

Automatic Formal Verification for Scheduled VLIW Code by Xiushan Feng B . E . , Harbin Institute of Technology, 1997 M . E . , Institute of Computing Technology, Chinese Academy of Sciences, 2000 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F M a s t e r of Science in T H E F A C U L T Y O F G R A D U A T E STUDIES (Department of Computer Science) we accept this thesis as conforming to the required standard The University of British Columbia August 2002 © Xiushan Feng, 2002 I n p r e s e n t i n g t h i s t h e s i s i n p a r t i a l f u l f i l m e n t o f t h e r e q u i r e m e n t s f o r an a d v a n c e d d e g r e e a t t h e U n i v e r s i t y o f B r i t i s h C o l u m b i a , I a g r e e t h a t t h e L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e f o r r e f e r e n c e and s t u d y . I f u r t h e r a g r e e t h a t p e r m i s s i o n f o r e x t e n s i v e c o p y i n g o f t h i s t h e s i s f o r s c h o l a r l y p u r p o s e s may be g r a n t e d b y t h e h e a d o f my d e p a r t m e n t o r b y h i s o r h e r r e p r e s e n t a t i v e s . I t i s u n d e r s t o o d t h a t c o p y i n g o r p u b l i c a t i o n o f t h i s t h e s i s f o r f i n a n c i a l g a i n s h a l l n o t be a l l o w e d w i t h o u t my w r i t t e n p e r m i s s i o n . The U n i v e r s i t y o f B r i t i s h C o l u m b i a V a n c o u v e r , Canada Abstract V L I W processors are attractive for many embedded applications, but V L I W code scheduling, whether by hand or by compiler, is extremely challenging. In this paper, I extend previous work on automated verification of low-level software to handle the complexity of modern, aggressive V L I W designs, e.g., the exposed parallelism, pipelining, and resource constraints. I implement these ideas into two prototype tools for verifying short sequences of assembly code for TI's C62x family of V L I W DSPs and Fujitsu's FR500 V L I W processor, and demonstrate the effectiveness of the tools in quickly verifying, or finding bugs in, several difficult-to-analyze code segments. ii Contents Abstract ii Contents iii List of Tables vi List of Figures vii Acknowledgments viii Dedication ix 1 Introduction 1 1.1 Research Overview 2 1.1.1 Problem 2 1.1.2 Solution 2 1.1.3 Contribution 3 1.2 Thesis Outline 4 2 Background 5 2.1 Previous Work 5 iii 2.2 VLIW Processors 7 2.2.1 Texas Instruments'TMS320C62x VLIW DSP . . . 9 2.2.2 Fujitsu's FR500 14 3 Verification Method 19 3.1 Basic Symbolic Simulation Approach 20 3.1.1 Symbolic Simulation 21 3.1.2 Uninterpreted Functions 22 3.1.3 Memory 24 3.2 My Extensions 25 3.2.1 Parallel Execution 25 3.2.2 Predication 26 3.2.3 Control Flow Analysis 27 3.2.4 Branch Delay Slots 32 3.2.5 Read-After-Write Delay Slots 33 3.2.6 Resource Conflicts 35 3.2.7 Condition Code 37 4 Examples 39 4.1 T l TMS320C6x 39 4.1.1 Software Pipelining Example 39 4.2 Fujitsu FR500 45 4.2.1 Two pieces of equivalent FR500 code 45 5 Limitations, Conclusions and Future Work 47 5.1 Limitations 47 5.1.1 Bit Reasoning 48 iv \ 5.1.2 Uninterpreted Functions and Axiom Handling 49 5.1.3 Memory Size 49 5.2 Future Work 50 Bibliography 51 Appendix A Two equivalent FR500 programs 54 A . l C Functionality 54 A.2 convolv.asm (with no optimization) 57 A.3 convoLthree.asm (-03) 60 v List of Tables C62x Processor Pipeline 12 vi List of Figures 2.1 Packet example 9 2.2 T l C62x Datapath Block Diagram 10 2.3 Branch Delay Slot Example 13 2.4 Fujitsu FR500 Datapath Block Diagram 16 2.5 Fujitsu FR500 Integer Execution Unit 17 2.6 Fujitsu FR500 Floating Execution Unit 18 3.1 A CFG Example 28 3.2 Register Model 34 4.1 Software Pipelining Example 41 4.2 Unpipelined Assembly Code 42 4.3 Software Pipelined Assembly Code 44 vn Acknowledgments It has been a great privilege to be supervised by Alan J. Hu, who was and is an inspiration to me in many ways. Without his ideas and encouragement, my research would not have been nearly as successful. Some of the material in this thesis was prepared in collaboration with Alan J. Hu as a paper for the ACM SIGPLAN Joint Conference LCTES/SCOPES'02. This research was supported in part by a research grant from Fujitsu Labo-ratories of America. X I U S H A N F E N G The University of British Columbia August 2002 viii To my parents, for their endless patience and support. ix Chapter 1 Introduction Very Long Instruction Word (VLIW) processors are attractive for many embedded applications because of their promise of very high performance without the power-and die-area-consuming instruction-scheduling logic of superscalar processors, and because backward object-code compatibility is typically unnecessary for embedded applications. In such applications, size and speed requirements can often be critical. In order to meet such requirements, compiled code often needs to be hand tuned, and smart compilers are needed to exploit the potential parallelism. Unfortunately, the exposed parallelism, pipelining, and resource conflicts in an aggressive VLIW design, coupled with the lack of interlocks, makes generating and optimizing code for a VLIW processor extremely challenging. The result is that bugs are easily introduced, or the code is excessively conservative, and therefore, sub-optimal for VLIW processors. My hypothesis is that formal verification techniques can help. In particular, a symbolic simulation tool that can decide equivalence of VLIW programs can help to find bugs and optimize code. 1 1.1 Research Overview I extend previous work on verifying simple assembly code to the more complex assembly code of a VLIW processor. This work builds directly on the work of Currie et al. [7]. 1.1.1 Problem My method and Currie's both analyze programs at the instruction set architecture (ISA) level. The ISA level shows the functionality of individual machine instructions, but hides the hardware implementation details. This is the lowest level that a programmer sees. However, for a modern, aggressive VLIW design, the ISA level actually exposes much of the underlying microarchitectural complexity, making the verification problem much trickier. For example, a VLIW processor has multiple function units allowing parallel execution, but limited resources place constraints on parallel instructions. And some VLIW processors may expose pipelining for certain operations. Such operations will have several cycles of delay slots — the number of cycles required after the source operands are read before the results can be read. In order to correctly verify programs for such a processor, the verification tools have to handle such complexity. 1.1.2 Solution My thesis develops formal verification tools to handle this complexity. Such verifi-cation tools could automatically verify the functional equivalence of two pieces of VLIW assembly code. One piece of code is supposed to be written by an expert or proven to be bug-free, while the other piece of code is generated by a potentially buggy compiler or optimizer, or is the result of hand-optimization. The verification 2 tool tries to prove that the two programs are equivalent. If they are not equivalent, the verification tool reports debugging information, which is useful for the user to find the bugs. In order to handle the complexity that the ISA exposes, I introduce several innovations. First, I model the parallel execution of multiple functional units. To accurately model pipeline delays, I introduce a simplified model of the pipeline that allows us to compute the correct result and also verify the presence/absence of resource conflicts. I also model predicated execution precisely, which permits a very detailed and accurate analysis of resource conflicts. In this thesis, my verification tools use symbolic simulation with uninter-preted functions. I use symbolic simulation due to its flexibility. By using uninter-preted functions, my verification tools can handle much more complex instructions, which can not be handled by traditional verification methods without space explo-sion. For example, the size of the Binary Decision Diagram (BDD) [4] for multipli-cation is exponential in the number of bits being considered [5]. I will explain these ideas in more detail in Chapter 2. 1.1.3 Contribution The primary contribution of my work is a method for the formal verification of VLIW software. Obviously, the quality of code can be improved by verifying the correctness of optimizations performed by the compiler or by hand. My approach is implemented in two prototype tools. Though limited in scope, they are able to verify examples of real life code. One prototype verifica-tion tool targets the Texas Instruments TMS320C62x family of VLIW digital signal 3 processors and another prototype verification tool targets the Fujitsu FR500 VLIW processor. Run times on short code segments are negligible, yet the tools are able to verify/debug code optimizations and verify the absence of resource conflicts, includ-ing in a situation previously considered unanalyzable. These examples demonstrate the practical usefulness of my ideas. 1.2 Thesis Outline I will give a brief overview of background in Chapter 2 that helps to show the gap between my problem and previous work. In Chapter 3, I will describe the challenges for my verification methods and detailed solutions to them. To test the effectiveness of my ideas, I apply them to modified industrial and published examples, and demonstrate the capabilities of my tools and the bugs that they are able to find in these examples. Chapter 4 presents examples for the two processors I tested and gives verification results for each tool. Chapter 5 goes into details about some of the limitations of the tools, conclusion, and the things I would like to implement in the future. 4 Chapter 2 Background 2.1 Previous Work This work builds directly on the work of Currie et al. [7]. In that work, the authors demonstrated the feasibility of automatic verification of low-level software via sym-bolic execution and an automated decision procedure. The verification task was to compare two short assembly language routines for equivalence, assuming consider-able similarity in control flow, as would occur, for instance, after hand-optimizing a performance-critical kernel. The verification method was demonstrated by building a verification tool targeting a very simple, 16-bit fixed-point digital signal proces-sor. My thesis is the natural extension of Currie et al.'s work to a highly complex, modern VLIW processor. Hamaguchi et al. [11] employed a similar approach to verify a high-level specification against a low-level implementation. Furthermore, their low-level imple-mentation included a VLIW processor with assembly-level instructions. Subsequent work [10] enhanced performance with better heuristics. Their work addresses the 5 harder problem of high-level-versus-low-level verification, whereas I consider only the problem of comparing two low-level programs. However, their V L I W processor was a simple, academic design (4-wide, 2-stage pipeline, no unusual architectural features), whereas I tackle a high-end commercial VLIW. The emphasis of my work is on how to deal with the architectural features of V L I W processors, rather than on improving the efficiency of the verification approach, and I believe both directions of research are needed for widespread, practical impact. Blank et al. [3] have recently presented a broad survey of the general verifica-tion paradigm (symbolic simulation/execution with uninterpreted functions) along with techniques for greatly improved efficiency. Unfortunately, I became aware of their work after my project was well underway, so I have not harnessed their ideas yet. In particular, I am building symbolic expressions for results as functions of the original inputs, and these expressions can blow-up exponentially. In contrast, their approach introduces temporary variables for intermediate results, eliminating this expression-size blow-up entirely. To date, I have avoided the expression-size blow-up by some rewriting techniques, but their translation approach is likely the superior solution in general. In the compiler-research community, Necula [17] has proposed a very similar approach to that of Currie et al., but targeting the verifier for use by the compiler between optimization passes. His work uses hints from the compiler and more so-phisticated control-flow analysis, and was demonstrated by verifying the compilation of the Gnu C compiler on itself. I believe that this translation-validation-during-compilation is a promising approach, and that my methods for dealing with the complexity of a high-performance V L I W processor could be integrated with Nec-ula's methods for verifying large programs with complex control flow. 6 2.2 V L I W Processors Very Long Instruction Word (VLIW) describes a computer processing architecture in which multiple instructions are put into a very long instruction word. VLIW processors can then take apart the sub-instructions in the very long instruction word without further analysis and run them in parallel. In such architectures, the compiler is in charge of packing sub-instructions into the very long instruction words. Therefore, VLIW architectures execute multiple operations per cycle with more parallelism, simpler hardware and higher performance, but require more compiler support. Compared to superscalar processors, VLIW architectures move the work of instruction scheduling and maintaining data dependencies to the compiler. Using the compiler to perform this work has lots of advantages. First, the compiler has the ability to look at much larger windows of instructions than hardware can [20]. Because the compiler has less constraints on resources, it can have arbitrarily large software windows to look for parallelism in a big program. Second, the compiler has knowledge of the source code of the program, and source code typically has more information to track program behavior. The compiler can take advantage of this information [20]. Meanwhile, because of smart compilers, VLIW processors can have simpler hardware. Fujitsu researchers estimate that to execute four instructions simultaneously, the VLIW architecture will require a 50% smaller circuit than the superscalar method [21]. Usually, a VLIW processor has the following characteristics [2]: • Multiple independent instructions per cycle, packed into a single large "in-struction word" or "packet". 7 • A large complement of independent execution units • More regular, orthogonal, RISC-like instructions • Large, uniform register sets • Wide program and data buses In this thesis, I have two examples (Tl TMS320C6x, and Fujitsu FR500) for VLIW processors. They show the above characteristics. Currently, VLIW processors are very popular, especially for DSP applica-tions. For such applications, we need cheaper hardware, higher performance, and specific instructions such as those handling media data, and do not need to worry as much about compatibility as do desktop microprocessors. Therefore, in such applications, VLIW processors allow increasing performance far more easily [1]. Although VLIW architectures bring lots of benefits, they introduce some challenges for compilers or optimizers as well. A VLIW compiler should find the parallelism in application programs and convert it into parallel streams of executable instructions. Such challenges together with the greater complexity of VLIW pro-cessors will easily create compiler bugs, and make generating and optimizing code for VLIW processors extremely challenging. In Chapter 4, I will demonstrate a bug which appears in a published code optimization example written by a recognized expert. The following subsections present some details of the two VLIW processors considered in this thesis. 8 S T H . D l A 5 , * A 8 + + [ 2 ] I I S T H . D 2 B 5 , * B 8 + + [ 2 ] I| SADD . L I A 2 . A 7 . A 2 I| SADD . L 2 B 2 . B 7 . B 2 I| SMPYH .M2X B 3 . A 3 . B 2 I I SMPY . M I X B 3 , A 3 , A 2 I I [ B I ] B . S I L 0 0 P 1 I| [ B I ] SUB . S 2 B l . l . B l Figure 2.1: Packet example [22, Example 6-1]. 2.2.1 Texas Instruments' TMS320C62x V L I W D S P Texas Instruments' TMS320C6x family of VLIW digital signal processors [22] is both commercially important and also the epitome of this architectural style, so I have targeted this family for my first research prototype. In particular, I am targeting the C62x family, which are 32-bit fixed-point DSPs that are code-compatible with the more powerful C64x fixed-point family and the C67x floating-point family. In this subsection, I briefly highlight the salient architectural features that make verification difficult; subsequently, I present my solutions to these difficulties in Chapter 3. The C6x processors are 32-bit VLIW digital signal processors. Instructions are grouped into "execute packets" of up to eight instructions, and these packets can be executed one per clock cycle. Figure 2.1 is a packet example, which is taken from the processor reference guide [22, Example 6-1]. It shows one packet with eight instructions which will be executed in parallel. ' ' I I ' ' characters signify that an instruction is to execute in parallel with the previous instruction.Figure 2.2 shows the datapath. Of particular note are both the capability of very high performance, but also the striking non-orthogonality of the design. Functional units have specific 9 ST1 •• Data path A LD1 D A I •A2 Data path B LD2 ST2 Figure 2.2: T l C62x Datapath Block Diagram. Registers are in two banks, with limited pathways connecting them (IX and 2X). The .L and .S units are integer ALUs with slightly different properties. The .D units are also integer ALUs, but designed for address calculation. The .M units are multipliers. LD and ST are load and store paths, with DA being data address paths. This figure is taken from the processor reference guide [22, Figure 2-1]. 10 capabilities, and there are limited resources for routing data among the functional units and register files. Careful code-tuning is imperative to achieve maximum performance. However, the processor has no interlocks — most potential conflicts in resource utilization are disallowed statically during code generation, but some cases produce undefined results. I will elaborate on this point below. The other main architectural feature that both enhances performance and complicates code generation and verification is the pipeline. For maximum through-put, the processor is heavily pipelined, with instructions taking up to 11 cycles to complete. Table 2.1 summarizes the pipeline. The pipeline has no interlocks, which simplifies the hardware (more performance at lower cost) but complicates the code. For most instructions, the operand read and result write occur in a single stage, so there are no hazards. However, multiply and load instructions have long latencies and require 1 and 4 delay slots, respectively. Instructions in these delay slots see the old value of the register. Multiple writes to a register in a single clock cycle are illegal, and in most cases, this can be detected during code generation. Another artifact of the programmer-visible pipeline is that there is a long branch delay. In particular, branches have five delay slots (i.e., the next five execute packets following a branch always execute1, regardless of whether the branch is taken or not), because the branch doesn't affect the Program Address Generation stage until it reaches the Execute 1 stage. Unlike many processors, branch instructions may appear in the delay slots of other branches. The results in that case may be unintuitive, but they follow naturally from the pipeline definition. Figure 2.3 shows an example of delayed branches. To ease coding, the architecture provides a multicycle NOP instruction, 1 Unless we're already in the delay slots of a preceding branch. 11 Stage Description Program Address Generate Determine address of the fetch packet. Program Address Send Send address of the fetch packet to memory. Program Wait A program memory access is performed. Program Data Re-ceive Fetch packet received by C P U . Dispatch Determine the next execute packet in the fetch packet and send it to the appropriate functional units. Decode Decode instructions in functional units. Execute 1 Evaluate predicates. Read operands. For load and store instructions, do address gener-ation and write address modifications to reg-ister file. For branch instructions, affect the Program Address Generate stage. For single-cycle instructions, write results to a register file. Execute 2 Load and store instructions send address (and data for store) to memory. Most multi-ply instructions complete. Execute 3 Data memory accesses are performed. Store instructions complete. Execute 4 For load instructions, data received by C P U . Execute 5 For load instructions, data is written into a register. Table 2.1: C62x Processor Pipeline. The processor pipeline is 11 stages deep. (Tl literature refers to the pipeline stages as "phases", but I adopt standard terminology here.) Instructions are grouped into "execute packets" of up to eight instructions that proceed through the pipeline in parallel. Note that different instruction types write results with different latencies. This table is summarized from [22, Table 6-1]. 12 ; * * — S t a r t : S t a r t p o i n t o f a p r o g r a m B . S I l a b e l 1 A b r a n c h w a i t s 5 d e l a y s l o t s t o c o m p l e t e SUB . D 2 B I , 1, B I D e l a y s l o t 1 SUB . D 2 B I , 1, B I D e l a y s l o t 2 SUB . D 2 B I , 1 , B I D e l a y s l o t 3 B . S I l a b e l 2 D e l a y s l o t 4 , a n o t h e r b r a n c h w a i t i n g SUB . D 2 B I , 1, B I D e l a y s l o t 5 c o n t r o l f l o w w i l l now j u m p t o l a b e l l D e l a y s l o t 1 f o r t h e s e c o n d b r a n c h SUB . D 2 B I , 1, B I N o t e x e c u t e d d u e t o c o m p l e t i o n o f t h e f i r s t b r a n c h l a b e l l : SUB . D 2 B I , 1, B I SUB . D 2 B I , 1, B I SUB . D 2 B I , 1, B I SUB . D 2 B I . 1. B I SUB . D 2 B I , 1, B I D e l a y s l o t 2 f o r t h e s e c o n d b r a n c h D e l a y s l o t 3 f o r t h e s e c o n d b r a n c h D e l a y s l o t 4 f o r t h e s e c o n d b r a n c h D e l a y s l o t 5 f o r t h e s e c o n d b r a n c h c o n t r o l f l o w w i l l now j u m p t o l a b e l 2 N o t e x e c u t e d d u e t o c o m p l e t i o n o f t h e s e c o n d b r a n c h l a b e l 2 : Figure 2.3: A n example of branch instructions which appear in the delay slots of other branches. To make the example simpler, I have made it a serial program, i.e., only one instruction is grouped for each packet, which is either a SUB instruction or a branch instruction. In the above example, both branches will occur due to the existence of delay slots for branch instructions. Of the packets after "labell", only the first 4 are executed, because the branch to "label2" is in the delay slots of the first branch. 13 which is equivalent to a series of n NOPs executed in sequence. To complicate matters further, but also to allow very compact and efficient code, all instructions are predicated, i.e., each instruction in each execute packet specifies a register and a condition (equal-zero or not-equal-zero), and only executes if that register is zero or not zero. Predication is known to eliminate many branches, thereby increasing the size of basic blocks and the amount of instruction-level par-allelism available [14]. Unfortunately, predication and the absence of pipeline inter-locks means that many register-write conflicts may or may not happen, depending on the values of registers at runtime. The processor manual specifically states that these situations cannot be detected, but that the result is undefined when they hap-pen [22, Section 3.7.6]. I will demonstrate that these situations can sometimes be analyzed. In sum, this architecture follows the V L I W philosophy and is optimized for maximum performance with minimal hardware, with no consideration for easy code generation or verification. The apparent complexity of the programmer's model, however, is not arbitrary, but the logical consequence of the exposed parallelism and pipeline. Accordingly, I claim that although the code is very error-prone for a human to read or write, it is relatively easy to create a simulator for the processor — even a symbolic simulator — that captures the correct semantics. The following chapter describes how I extended the basic verification approach to handle the challenges of this architecture. 2.2.2 Fujitsu's FR500 The FR500 is a general-purpose microprocessor tailored for embedded applications. It integrates system control and media processing with the highest performance 14 among the embedded processors [16]. This series is specifically designed for media-rich requirements for digital consumer products, such as laser printers, digital still cameras, digital televisions, and other multimedia devices. FR500 is the first prod-uct in the FR-V family, which is Fujitsu's generic name for its VLIW architecture microprocessors [21]. The key feature of the FR500 series is a built-in media-execution unit that can simultaneously process eight 16-bit fixed-point MAC calculations. The unit provides a 63-instruction instruction set that can process media data such as MPEG streams, as well as filtering operations [16]. In addition to the media instruction set, the FR500 series also includes a set of integer instructions (including logic operations, load/store instructions, control instructions and compiler support), and a floating-point instruction set. The floating-point execution unit can perform four single-precision floating-point calculations simultaneously, thus accelerating geometry pro-cessing (coordinate calculation) of three-dimensional graphics [16]. Figure 2.4 shows the block diagram. The FR500's 4-slot VLIW architecture enables concurrent processing of four 32-bit instructions, i.e. four 32-bit instructions are packed into one 128-bit very long instruction word and will be executed simultaneously. These four instructions are allocated into two integer instruction slots and two media processing or floating-point instruction slots. The integer instruction slots (see Figure 2.5) can be used to load data for the floating-point or media processing operations. By this means, we can reduce the possibility of bottlenecks in the data supply to the floating-point or media execution units. The floating-point execution unit has a Single Instruction Multiple Data (SIMD) architecture, permitting two operations using one instruction, while the media execution unit has a double-SIMD architecture 15 Figure 2.4: Fujitsu FR500 Datapath Block Diagram. This figure is taken from [21, Figure 2]. 16 Instruct ion fe tch D e c o d e & P ipe l i ne cont ro l I slotO I slotl I slot21 s!ot31 GR Branch-1 B r a n c h - 0 Branch unit lnteger-1 ln teger -0 Integer unit U •I Y~Y ^ _TJ 32-bit DIV(18T)p j 32-bit A L U (1x) 32-bit M U L (2x) Integer-unit Figure 2.5: Fujitsu FR500 Integer Execution Unit. This figure is taken from [21, Figure 3]. permitting simultaneous execution of four operations. Figure 2.6 shows the media execution unit. In addition, the execution units have two register files, each with 64 registers, that enable concurrent processing of operations in a very long instruction word. Such a design allows the FR500 to attain peak performance at 266MHZ of 532MIPS integer, 1064MFLOPS floating-point, and 4256 MOPS for media processing. The performance is comparable to a high-end microprocessor and handles media-rich data as well as a DSP unit [16]. 17 F R TT Floating Media unit Float-1 Float-0 I 32-bit FADD {Zz) 1 32-bit FMUL (2c) 1 32-bit FDIV (9T) 2SIMD SP floating unit Media-1 Media-0 16-bit MALI! (2T) 16-bit MMAC(2T) 16-bit MSFT(2T) 4SIMD 16-bit media unit Figure 2.6: Fujitsu FR500 Floating Execution Unit. This figure is taken from [21, Figure 3]. 18 Chapter 3 Verification Method The basic goal for my verification method is to verify that two short code segments compute the same result — for example, whether an optimized implementation matches a known-good one. My verification problem is analogous to equivalence checking of combinational circuits, which has been very successful in an industrial setting. In this thesis, what I am attempting to prove is the equivalence of two assembly-language routines. The formal verification tool, which I will refer to as an "equivalence checker" in this thesis, should first read the specifications of what inputs are initially equal and what outputs should be equal when the routines terminate. The equivalence checker has a simple model of the processor at the ISA level, and then uses this model to symbolically simulate and check equivalence for the two routines. Due to the difficulty of software verification, I have to introduce some simplifying assumptions: • The two routines have very similar control flow. Such an assumption is based on my verification target — to verify code tuning and optimizations. • I ignore rounding, precision, and word length. My verification methods (sym-19 bolic simulation, uninterpreted functions) cannot model actual bit precisions. • The control flow must be analyzable — no computed branches, no recursion, loops must be unreliable. I also ignore interrupts. • I do not model memory size limits or protection. The tool assumes all address are accessible. Given these assumptions and limitations, the equivalence checker might declare inequivalent two routines that really have the same outputs, but it will not claim equivalence for two routines that do not. I will show some examples in Chapter 5 where the tool will incorrectly declare equivalence when the above assumptions do not hold. 3.1 Basic Symbolic Simulation Approach This section introduces symbolic simulation, a formal verification method. My equivalence checkers are built upon such a basic symbolic simulation approach and borrow the Stanford Validity Checker (SVC) as the decision procedure. The user initially specifies what registers and memory location are equivalent before execu-tion, defines any axioms, and declares what value should be equivalent after execu-tion. The equivalence checker performs symbolic execution of the routines, and then passes the symbolic expressions of memory or registers to SVC for validity checking. Among the research on equivalence checking of combinational circuits, Bi-nary Decision Diagrams (BDD) [4] are a very successful choice to represent the outputs. BDDs can be constructed, manipulated, and compared very efficiently. The drawback of BDDs is that, for most functions the size of it can be exponential in the number of variables being considered. This motivates the development of 20 tools which can operate at a higher level of abstraction. In this thesis, I use sym-bolic simulation with uninterpreted functions to represent registers and memory at the word-level. 3.1.1 Symbolic Simulation In symbolic simulation, variables are left as symbols that can represent any arbitrary value instead of being a specific instance of an integer or floating-point number. This allows the simulation tool to represent an entire class of values in a single simulation run. For example, using the method of symbolic simulation to execute the following code, we can get the results for each step, as shown in the comments of the code. A l = A l _ I n i t , A2 = A 2 _ I n i t , A 3 = A 3 _ I n i t , A4 = A 4 _ I n i t B2 = B 2 _ I n i t , T h e a b o v e a r e t h e i n i t i a l v a l u e s . ADD A l , A 2 , B I ; B I = (+ A l _ I n i t A 2 _ I n i t ) MPY . M l B I , B 2 , B 3 ; 1 d e l a y s l o t f o r M P Y , d e l a y w r i t e . NOP ; B 3 = ( * (+ A l . I n i t A 2 _ I n i t ) B 2 _ I n i t ) ADD A 3 , A 4 , B2 ; B2 = (+ A 3 _ I n i t A 4 _ I n i t ) ADD B 3 , B 2 , B 3 ; B3 = (+ ( * (+ A l . I n i t A 2 _ I n i t ) B 2 _ I n i t ) (+ A 3 _ I n i t A 4 _ I n i t ) ) In the above example, all of the values for registers are symbolic expressions. There-fore, one simulation run contains the result of the code for all possible values. To check equivalence of two pieces of code, a decision procedure takes the symbolic results and compares them. 21 Unfortunately, comparing some complicated computations, such as expres-sion including multiplication, is undecidable in general. In order to solve this prob-lem, the equivalence checker needs to borrow the idea of uninterpreted functions. 3.1.2 Uninterpreted Functions Uninterpreted functions are a powerful abstraction mechanism for formal verifica-tion. In this approach, meaningless function symbols replace actual computation. The function symbol is "uninterpreted" and can represent any function, similar to how a variable symbol represents any value. Therefore, verification can avoid the complex details of the actual computations. My equivalence checker divides the functions used in a program into three different kinds: The first is constant propagation for traditional arithmetic operations. For example: add g r l , gr2, gr3 ; Current values of reg i s ter g r l and gr2 are ; integers(e .g . 2, 3 re spec t ive ly ) , In this example, the operands are all constant — integers. Therefore, the addition operation will add up the constants and update target register "gr3" to "5". The second is interpreted functions. Interpreted functions are functions where the equivalence checker knows their meaning. For example, by using an interpreted function, e.g.,"+", the equivalence checker knows that the symbol "+" denotes addition, so it can interpret that "(+ 2 4)" = "(+ 1 5)", "(+ a b)" = "(+ b a)" , and more complicated "(+ (+ a 2) 4) = (+ (+ a 4) 2)" (i.e., the equivalence checker knows that "+" is a commutative and associative operator). 22 The third is uninterpreted functions. Uninterpreted functions use meaning-less function symbols to hide the actual operations of data. The equivalence checker doesn't interpret such functions. Instead, it takes them as symbolic expressions, and decides equality if the arguments are equal. For example, if a = a' and b = b', then we know that (FUN a b) — (FUN a' b') for any function F U N . In general, the theory of uninterpreted functions with equality is decidable. For example, here is an operation for the T l C6x: "SHL src2, srcl, dst". It is an operation which shifts the src2 operand to the left by the srcl operand and puts the result in dst. The word-level equivalence checker can not handle such kinds of bit operations. However, an uninterpreted function "SHL" can be used to abstract the bit computations. The equivalence checker doesn't need to know the meaning of SHL and what happens for this instruction. It takes SHL as a function symbol. When the checker wants to verify whether (SHL src2 srcl) = (SHL src2' srcl'), it claims true if src2 = src2' and srcl = srcl', else false. Applying the idea to the example in the previous subsection, the verification tool replaces the operator of multiplication by a meaningless function symbol, for example, "MPY". After that, B3 will be updated to "(MPY (+ A U n i t A2_Init) B2Jnit)" instead of "(* (+ A U n i t A2Jnit) B2Jnit)". However, after such replace-ments, multiplication loses the basic properties such as commutivity. In order to handle commutivity, the equivalence checker needs to have the axiom that "(MPY A B) = (MPY B A)". The initial values for registers and memory also use symbolic values. For example, in my equivalence checker, I use "init_mem" to represent the initial values in the memory. It is obvious that by using uninterpreted functions the verification tool can re-23 duce some complexity of the program. Certain mathematical computations needn't happen. Therefore, we can abstract datapath complexity. 3.1.3 Memory Almost all computations need to get data from memory, and then save results to memory. Efficiently handling memory is important because symbolic expressions for memory tend to become too large. In symbolic simulation, memory is also introduced as a symbol. It is modeled by using the interpreted functions read and write. The expression ( r e a d mem a d d r e s s f o o ) returns the last value written to location addressfoo, and the expression ( w r i t e mem a d d r e s s f o o v a l u e b a r ) returns the new memory where location addressfoo is updated to valuebar. The decision procedure knows that (read (write a b c) b) = c, so for example, the decision procedure can determine that if b ^ d, then (read (write a b c) b) = (read (write (write a b c) d e) b) - c I should point out that my equivalence checkers use a single memory (only one memory addressing space) for each program, and all reads and writes are targeted to this memory. Handling long programs or programs with big loop sizes generates very large expressions involving memory. To reduce the size of these expressions, I introduce some memory simplification methods. These methods are similar to those of Currie's thesis [6]. For example, if the equivalence checker knows that b ^ d, it then knows 24 ( r e a d ( w r i t e ( w r i t e a b c ) d e ) b ) = c In general, if it can locate the address being read in the memory write history, and if it can prove that no other writes interfered, then it can replace the read expression by the last value written. Another example is ( w r i t e ( w r i t e a b c ) b e ) = ( w r i t e a b e ) The equivalence checker scans its write history, and overrides the previous write if the address being written to has already been used. Using similar ideas, the equivalence checker uses some other methods to simplify expressions involving memory. Such simplifications can reduce the size of memory expressions a lot. 3.2 M y Extensions In this part, I will give some details of my approach, which is built upon the basic symbolic simulation methods. 3.2.1 Parallel Execution The most obvious difference of a V L I W is the multiple instructions executing si-multaneously. This is trivially handled in the symbolic simulator. I simply simulate each instruction 1 in an execution packet one-by-one, but I don't update the register file until after all instructions have read their operands. The resulting behavior is identical to parallel execution. For example, here is an execution packet of T l C62x. *My current tool implements only a subset of common instructions, since the point of my research is to explore the verification technique rather than build a production tool. 25 LDW . D 2 * B 5 , B 4 I I MPY . M I X B 4 , A 0 , A 5 I I B . S I L 0 0 P 1 I I SUB , S 2 B l . l . B l The execution results for it are 0 c y c l e : c u r _ B l = ( - p r e v _ B l 1) 1 c y c l e : c u r _ A 5 = (MPY p r e v _ B 4 p r e v _ A 0 ) 2 c y c l e : n o c h a n g e s f o r t h i s p a c k e t 3 c y c l e : n o c h a n g e s f o r t h i s p a c k e t 4 c y c l e : c u r _ B 4 = ( r e a d p r e v _ m e m p r e v _ B 5 ) 5 c y c l e : b r a n c h h a p p e n s where the c u r _ and p r e v _ symbols are abbreviations for the current and previous expressions for the values of the registers. Note: 0 cycle denotes the cycle right after this example packet. "SUB" has no delay slots, therefore, the update of register B l happens in this cycle. I have already mentioned that "LDW" and "B" have 4 and 5 cycles delay respectively. 3.2.2 Predication Three issues arise in dealing with predication: the basic functionality, conditional branching, and resource conflicts. The basic functionality of predication is easily expressible in my logic. Indeed, using the "ITE" (if-then-else) operator of SVC logic, the tool simply generates the expression that evaluates to the new value if the predicate is true, or the old value if the predicate is false. For example, the value of the register A 3 after this instruction: 26 [BO] ADD . L I A l , A 2 , A3 is ( I T E (/= c u r _ B O 0 ) (+ c u r _ A l c u r _ A 2 ) ( c u r _ A 3 ) ) where /= denotes not-equal. In the next four sections, I will describe how the equivalence checker handles the conditional branching and resource conflicts. 3.2.3 Control Flow Analysis The control flow analysis in my thesis is standard, using a control flow graph (CFG). I build the CFG similarly to what Currie et al. [7] did. My equivalence checker verifies programs using the control flow graph, but the approach is different from Currie's. I build a single expression for the outputs over all possible traces through the CFG and compare the whole expression for the two programs (see Figure 3.1), whereas Currie et al. transforms the CFG into a tree, performs a depth-first traversal over the tree, symbolically simulating along the way. At each branch, the method of Currie checks the branch condition. There are two possible outcomes which lead to the equivalence for the two routines: the conditions are the same, or the conditions are complementary (the two branch directions of one routine need to be reversed in order to match the traces of the two CFGs). If the check fails (a return other than the above two), Currie's tool reports different control flows. At the end of each trace, his tool checks the equivalence of the expressions for the two routines, and recursively does these checks until an invalidity is found or the simulation terminates. The detailed description of Currie's method can be found in his thesis [6]. The following is a short piece of code to show how my procedure works using C62x code. 27 Condition I exprl expr2 Figure 3.1: A C F G Example. This figure shows a control flow graph, which has two branch instructions to test "Condition I" and "Condition II" respectively. "T" or "F" here means "True" or "False". For this example, the equivalence checker outputs a single expression such as "(ITE (Condition I) (ITE (Condition II) exprl expr2) expr3)". 28 SUB . L 2 B O . l . B O [BO] B . S 2 L l NOP 5 ADD . L l A O , l , A O B . S 2 L_end NOP 5 L l : ADD . L l A O , 2 , A O L _ e n d : ; f i n i s h Remember that for branch instructions " B " , we have 5 delay slots. According to my method of control flow handling, the final expression of AO will be a single expression like this: AO = ( I T E (/= ( - p r e v i o u s . B O 1) 0 ) (+ p r e v i o u s _ A O 2 ) (+ p r e v i o u s _ A O 1) ) where Currie will get two expressions, and compare them with another routine based on the branch condition. In this example, if the equivalence checker can prove BO is a constant integer or can be reduced to a constant integer, there wil l be no conditional branches. AO will be a simple expression without introducing " ITE" It's obvious that using a single expression for outputs has advantages. For some kinds of routines, my equivalence checker can prove equivalence while Cur-rie's approach misses it. Consider the following examples (using pseudocode): 29 routine 1: i f condit ion_l = true a := 1; else a := 2; endif i f condition_2 = true b := 1; else b := 2; endif routine 2: i f condition_2 = true b := 1; else b := 2; endif i f condit ion.1 = true a := 1; else a := 2; endif Currie's approach reports that the two routines have different control flows, although they are equivalent. So do such kinds of routines. 30 routine 1: routine 2: i f a < b i f c < d i f c < d i f a < b x := 1; x := 1; else else x := 2; x := 3; endif endif else else i f c < d i f a < b x := 3; x := 2; else else x := 4 ; x := 4 ; endif endif endif endif My approach can find the equivalence for these routines. However, my checker uses a single expression for each register or memory output. This approach generates longer expressions sent to the decision procedure, which raises performance issues with SVC. I didn't do performance comparisons with these two approaches, because run times of the examples available for me have been negligible. What I care about more is to get better control flow graph matching. As I describe above, predicated branches are handled exactly as conditional branches. The only difference is the lengthy branch delay, which I will describe in the next subsection. 31 3.2.4 Branch Delay Slots Because branches have five delay slots, there exists the possibility that other branch instructions will occur in those slots. In fact, this occurs frequently in tight loop kernels. For example, consider the following code fragment: B .SI Loopl MVK .S2 10, Bl I| MVK .SI 0, AO Loop2: ADD . L l AO, 1, AO I| [Bl] B .SI Loop2 I| [Bl] SUB .S2 B l , 1, B l STW .Dl AO. *A1++ NOP 2 Loopl: [Bl] ADD . L l AO, 1, AO I| [Bl] SUB .S2 B l , 1, Bl NOP 5 In this example, the branch to Loop2 occurs in the delay slots of the branch to Loopl, as well as in its own delay slots on subsequent iterations. The net effect is that the code loops until register Bl reaches 0. Although this example looks contrived, coding in this style is common. 32 I would like to avoid explicitly modeling the instruction packet fetch mech-anism, which is actually somewhat more complex than I have described. Instead, I introduce a small queue for pending branches. Each branch instruction gets queued, along with a counter indicating how many delay slots remain for that branch. On each clock cycle, all counters are decremented. Before each instruction fetch, I check whether a branch at the head of the queue is ready to be taken. 3.2.5 Read-After-Write Delay Slots As with branch delay slots, delay slots resulting from different instruction latencies are the result of the programmer-visible pipeline, and I handle them in a similar fashion. Conceptually, these delay slots simply delay the assignment of results to the register file. Accordingly, modeling the full pipeline can be avoided by using small queues to delay the writes to the register file. More precisely, I extend the data structure used for register values. In the basic verification approach, each programmer-visible register is modeled by a vari-able that can hold the symbolic expression for the value of that register. Instead, I replace this variable with a queue of symbolic expressions. These queues are very short — for the T l C62x DSP, the longest latency is four delay slots for loads, so the queue is only five entries long (current value and four delay slots). For the Fujitsu FR500, there is no latency for any instruction, therefore, there is no problem for read-after-write delay slots. (See Figure 3.2.) Whenever a result is written to a register file, the latency of the operation is checked in a table, and the result is writ-ten into the appropriately delayed queue entry. Reads from the register file return the current value. After every clock cycle, all the register-value queues advance one step; if there is no write to a register in that cycle, it keeps its current value. 33 Reads Writes value Current Value Delay Lookup Table Conflict Checker Pending Writes Queue pred value | Write Histories pred value pred value Figure 3.2: Register Model. This figure shows the data structures used to model each register. In the basic verification algorithm, only the "Current Value" variable would exist, and reads and writes would access and update this variable. To handle delay slots, I add a queue of pending writes. Reads still come from the Current Value, but values to be written are queued depending on their latency. The queue advances one position to the left at each clock cycle, updating the Current Value as appropriate. The attempt to write to a queue slot that is already full indicates a resource conflict. To handle.predication with delay slots, I store write-histories in the pending writes queue, rather than values. Each write-history is a list of predicate-value pairs. It is easy to verify that at most one predicate in a write history can be true, otherwise there is a conflict. Since the simulator is symbolic, the "values" are actually symbolic expressions denoting the value as a function of the initial values. 34 3.2.6 Resource Conflicts There are four pages of resource constraints described in the T l C62x processor reference guide [22, 3-17..3-20]. A complete verification tool needs to check all the constraints. However, the checks for some statically detectable resource conflicts are uninteresting from a research perspective, because they are straightforward checks of execution packets against various simple rules, and existing tools handle these checks already. Therefore, my equivalence checker only handles a subset of the resource constraints. The resource constraints I am handling are impossible to be handled statically. Such constraints may introduce undefined results for registers. The interesting case is when multiple instructions attempt to write to the same register. This could happen because of multiple instructions within one execute packet, or because of instructions with different latencies trying to write in the same cycle. For example, MPY .Ml AO, A l , A2 ADD .LI A4, A5, A2 will result in both instructions trying to write register A2 simultaneously, because the multiply instruction has one delay slot. These register-write conflicts could still be checked statically during code generation, or in my tool by disallowing multiple assignments to the same queue po-sition. The situation becomes more complex with predication. Indeed, the processor reference manual explicitly states that examples like: [!B1] ADD .L2 B5, B6, B7 I I [BO] SUB .S2 B8, B9, B7 are legal if zero or one predicates are true, but will produce undefined results if more 35 than one predicate is true, and that the presence or absence of a conflict cannot be detected [22, Section 3.7.6]. My work contradicts this claim; in most instances, these conflicts can be checked. To handle this case, I extend the verifier further. Instead of a queue of symbolic expressions for each register, I have a queue of write-histories for each register. Each write-history is a list of all the (symbolic expression) values that are scheduled to be assigned to that register in that cycle, along with the predicate condition required for the assignment to take place. In the simplest cases, the write-history is either empty (no write to the register) or has a single, unpredicated expression. Whenever a new expression is scheduled to be assigned in a given cycle, it is first checked against all entries already in the write-history. The decision procedure determines if the predicate is mutually exclusive with all the other writes. If so, the tool has proven that this assignment is conflict-free; if not, the tool flags the potential conflict to the user. For example, consider the following code segment: ADD .L2 B l , BO, B2 ADD .L2 B2, BO, B3 [!B1] ADD . L l AO, A l , AO I I [!B2] ADD .SI AO, A2, AO I| [!B3] ADD .Dl AO, A3, AO The first two ADD instructions initialize B l through B3 to be equally spaced BO apart. The next three ADD instructions are in a single execute packet and might conflict in writing to AO in the same cycle. My tool flags this as a potential conflict, because BO might be equal to zero. 36 If I prepend an instruction guaranteeing that BO is not zero: MVK . S 2 1, BO ADD . L 2 B l , B O , B 2 ADD . L 2 B 2 , B O , B 3 [ ! B 1 ] ADD . L l A O , A l , AO I I [ ! B 2 ] ADD . S I A O , A 2 , AO I I [ ! B 3 ] ADD . D l A O , A 3 , AO the tool verifies that a conflict cannot happen. This class of resource conflict was previously considered to be unanalyzable, but runtime on my examples was a tiny fraction of a second. Integrating the analysis I am proposing into a compiler should enable generating more efficient code with greater confidence in its correctness. 3.2.7 Condition Code Unlike the T l C6x, the FR500 has condition codes. There are two types of condition codes for the FR500. One is the Integer Condition Code, (ICC) used for integer instructions; the other is Floating-point/media Condition Code, (FCC) used for floating-point/media instructions. For each type of condition code, there are four banks. They are ICCO through ICC3 and FCCO through FCC3. Each bank has 4 flags. ICCn (n = 0..3) has N(Negative), Z(Zero), V(Overflow), C(Carry). FCCn (n = 0..3) has EQ(equal), LT(less than), GT(greater than), UO(disable to compare). There are more detailed descriptions in the FR500 instruction set manual [9]. Conditioned instructions are those instructions which are used together with condition codes. For such instructions, they have to explicitly specify which condi-tion codes are used. For example, 37 subicc g r l 4 , # l , g r l 4 , i c c 0 cmpi gr l4 ,#256, icc l bge i cc l ,#0 , Labe l l In this example, "subicc"(subtract), "cmpi" (compare) and "bge" (Branch greater or equal) are instructions using condition codes. They use iccO and iccl explicitly.2 When I implement the equivalence check, I treat each bank of condition codes as a register file, and each flag as a register. Therefore, for the above example, because the instruction "subicc" changes all the flags(page 3, FR500 [9]) , the equivalence checker will update flags N, Z, V and C, e.g., the Z flag of iccO will be updated to " (— (- grl4 1) 0)". For the compare instruction "cmpi", it modifies iccl based on the result of "(cmpi grl4 256)". "bge" is a branch instruction, according to the FR500's instruction set manual [9], it will check "Not(N xor V)" of iccl. If it is true, a branch will taken. If the equivalence checker cannot prove it is either true or false, both branch directions will be analyzed (see the section of control flow analysis in this chapter for details). 2 I should mention that in the branch instruction "bge", there is a "#0". It is a hint bit. When there is a high probability of branching, the programmer or compiler should specify 2 to the hint bit. When there is not much probability of branching, the programmer or compiler specifies 0 to the hint bit. The hint bit can be used for branch prediction. 38 Chapter 4 Examples I have implemented my verification methods into two tools. In this section, I will give some examples of my tools and show verification results. 4 .1 T l T M S 3 2 0 C 6 x 4.1.1 Software Pipelining Example My first example demonstrates the verification of a non-trivial code optimization. The example is taken from an article, written by an expert on DSP code optimiza-tion, explaining how to optimize code for high-performance DSPs [18]. This article appeared in a trade magazine, and also on the magazine's website. The most difficult example in the article demonstrates software pipelining a short loop, ostensibly targeting the C62x. Software pipelining is a powerful instruc-tion scheduling technique that exposes additional parallelism in loops, thereby im-proving performance [15]. The basic idea is to rearrange the computation such that portions of different loop iterations execute at once, similarly to hardware pipelin-39 ing. A prologue is required to start the pipelined computation, and an epilogue is required to "flush the pipeline" at the end of the computation. When I found these examples, I was very happy, because it is not easy to find publically available equivalent code segments. The article stated that the code was for the T l C62x family, which was perfect, because my tool was also targeted for the T l C62x. The C62x family are fixed-point DSPs. I did not notice that the code was actually floating-point code for the T l C67x family. In particular, the M P Y S P instruction was not listed in the C62x instruction set, but I assumed that the M P Y S P instruction was simply the fixed-point M P Y instruction. I will return to this fixed-point/floating-point confusion later. Figure 4.1 gives the desired functionality in C. Figure 4.2 gives partially optimized, but unpipelined assembly code, which is reasonably readable. Figure 4.3 gives the software pipelined code taken from the article. According to the article, the code was generated by the compiler if the input vectors are declared to be constants. This code is quite hard to read, but the article clearly explains the principles involved. Intuitively, the ©-signs in the comments indicate which iteration of the orig-inal loop is being processed by which instructions. For example, the first iteration of the unpipelined loop corresponds to the LDW instructions on lines 17 and 18 of Figure 4.3, followed by the SUB instruction on line 25, the branch on line 27, the multiply on line 31, and so forth. It is also crucial to remember the five branch delay slots. Another way to understand the code is to note that the loop kernel on lines 44-50 of Figure 4.3 performs all the same operations as the original, unpipelined loop kernel, but can do far more operations in parallel because different loop iterations are being processed at the same time. The performance improvement is that the 40 f o r ( i { 0 ; i < 1 0 0 ; i + + ) o u t [ i ] = i n p u t l [ i ] * i n p u t 2 [ i ] ; } Figure 4.1: Software Pipelining Example. This C code specifies the desired func-tionality. (Listing taken from [18].) loop kernel now runs in 2 cycles instead of 10. I ran my tool on this example, comparing the pipelined and unpipelined programs. The tool discovered that the pipelined code has a bug. In particular, the result of the first iteration is never written, and there is an extra result written at the end. The fix is to add an additional STW instruction to the packet at lines 38-39, and to delete the STW instruction at line 67. Runtime for my tool was less than a second both for finding the bug as well as for verifying my fix. I was surprised to find a bug in the article. To confirm my results, I asked a friend to run the code on real hardware. Unluckily, he only had a C6201 system available, so he made the same modification (MPYSP to M P Y ) as I did. His results match mine exactly, supporting my equivalence checker [12]. I reached the conclusion that there is a bug in the original code and wrote this in my paper [8]. It turns out that the code in the article is actually correct, but for the floating-point C67x, not the fixed-point C62x, as stated in the article. Brett L Huber of Texas Instruments discovered my mistake: 41 1 _ e x a m p l e l : 2 : * * MVK . S 2 0 x 6 4 , B O MVC . S 2 C S R . B 6 MV . L 1 X B 4 . A 3 MV . L 2 X A 6 . B 5 AND . L 1 X - 2 . B 6 . A 0 MVC . S 2 X A O . C S R 3 4 5 6 II 7 II 8 9 10 ; * * 11 L l l : ; P I P E D LOOP PROLOG 12 ; * * : 13 L 1 2 : : P I P E D LOOP K E R N E L 14 LDW . D 2 * B 5 + + , B 4 15 1 I LDW . D l * A 3 + + , A 0 16 NOP 2 17 [ B O ] S U B • L 2 B O . l . B O 18 [ B O ] B . S 2 L 1 2 19 MPYSP . M I X B 4 , A 0 , A 0 2 0 NOP 3 21 STW . D l A 0 , * A 4 + + 2 2 2 3 MVC . S 2 B 6 . C S R 24 B . S 2 B 3 25 NOP 5 I a s s u m e C 6 2 x MPY h e r e . 26 : BRANCH OCCURS Figure 4.2: Unpipelined Assembly Code. According to the article [18], this code was compiler-generated, and the compiler was unwilling to pipeline the loop because of possible aliasing between the inputs and the output. The correspondence with the C code is fairly clear. 42 1 _ e x a m p l e 2 : 2 ; * * * 3 MVK . S 2 0 x 6 4 , B O 4 5 MVC . S 2 C S R . B 6 6 11 MV . L 1 X B 4 . A 3 7 | | MV . L 2 X A 6 . B 5 8 9 AND . L 1 X - 2 , B 6 , A 0 10 11 MVC . S 2 X A O . C S R 12 | | SUB . L 2 B O , 4 , B O 13 14 ; * * * 15 L 8 : ; P I P E D L O O P P R O L O G 16 17 LDW . D 2 * B 5 + + , B 4 ; 18 I | LDW . D l * A 3 + + , A 0 ; 19 20 NOP 1 21 2 2 LDW . D 2 * B 5 + + , B 4 ;@ 2 3 | | LDW . D l * A 3 + + , A 0 ;<9 24 2 5 [BO] SUB . L 2 B 0 , 1 , B 0 ; 26 27 [BO] B . S 2 L 9 ; 2 8 I I LDW . D 2 * B 5 + + , B 4 ;<9@ 2 9 | | LDW . D l * A 3 + + , A 0 ;@@ 3 0 31 MPYSP . M I X B 4 . A 0 . A 5 ; I a s s u m e C 6 2 x MPY h e r e 3 2 | | [BO] SUB . L 2 B O . l . B O ;@ 3 3 3 4 [BO] B . S 2 L 9 ;<§ 3 5 | | LDW . D 2 * B 5 + + , B 4 3 6 | | LDW . D l * A 3 + + , A 0 3 7 3 8 MPYSP . M I X B 4 , A 0 , A 5 ;<9 I a s s u m e C 6 2 x MPY h e r e 3 9 | | [BO] SUB . L 2 B O . l . B O ;<3S 4 0 (Figure 4-3, continued on next page) 43 (continuation of Figure 4-3) 41 ;** 4 2 L 9 : ; P I P E D LOOP K E R N E L 4 3 4 4 [BO] B . S 2 L 9 ;@@ 4 5 1 1 LDW . D 2 * B 5 + + , B 4 ; 000(3 4 6 1 1 LDW . D l * A 3 + + , A 0 ;OQQO 4 7 4 8 STW . D l A 5 , * A 4 + + f 4 9 1 1 MPYSP . M I X B 4 . A 0 . A 5 ;@@ I a s s u m e C 6 2 x MPY h e r e 5 0 1 1 [BO] SUB . L 2 B O . l . B O ;@@@ 51 52 ; ** 5 3 L 1 0 : ; P I P E D LOOP E P I L O G 54 NOP 1 55 56 STW . D l A 5 , * A 4 + + ;« 57 1 1 MPYSP . M I X B 4 . A 0 . A 5 ;@<9@ I a s s u m e C 6 2 x MPY h e r e 58 5 9 NOP 1 6 0 61 ' STW . D l A 5 , * A 4 + + ;<9<3 62 1 1 MPYSP . M I X B 4 . A 0 . A 5 ;@<3<3<3 I a s s u m e C 6 2 x MPY h e r e 6 4 NOP 1 6 5 STW . D l A 5 , * A 4 + + ;0@@ 6 6 NOP 1 6 7 STW . D l A 5 , * A 4 + + 6 8 6 9 MVC . S 2 B 6 . C S R 7 0 B . S 2 B 3 71 NOP 5 72 ; BRANCH OCCURS Figure 4.3: Software Pipelined Assembly Code. If the inputs are declared to be c o n s t , the compiler performs software pipelining, resulting in much more efficient code. But, does this do the same thing as Figure 4.2? (Listing taken from [18].) 44 You claim to have found a bug in a published code optimization example. After examining the code in Figure 4.3,1 have determined that the software-pipelined loop is in fact correct. I believe the problem is that you are assuming the M P Y S P instruction takes two cycles; actually, it is a floating-point instruction taking four cycles. This instruction is only legal on the C67x, which is a floating-point variant of the C62x; this may explain the confusion [13]. I added the instruction M P Y S P as an uninterpreted function with 3 delay slots, and ran it again. It proved that Huber's diagnosis is right, I made a wrong claim in my paper. However, my equivalence checker still proved valuable. There was a bug in my modified C62x code, and the tool found it. When I added the M P Y S P instruction to the checker (My checker targets the C62x, but for this example, introducing an uninterpreted function M P Y S P doesn't cause any trouble.), the checker proved the equivalence of the two code segments. This example clearly demonstrates how easily a bug crept into hand-tuned code, and how difficult it is to understand and verify V L I W code. 4.2 Fujitsu FR500 4.2.1 Two pieces of equivalent FR500 code In this section, I will give two assembly programs for the FR500. I received these examples, courtesy of Fujistu Laboratories of America. The code comes from re-search there on optimizing compilers [19], Two compilers, the fcc911s and fcc935s, used different combinations of optimization options (object code size minimization, 45 speed, etc.) to generate assembly code from C programs. I used my tool to verify the equivalence of the generated code. In the example presented here, there are two pieces of FR500 code, con-volv.asm and convoLthree.asm (Code listings are given in Appendix A, The compiler generated convolv.asm without any optimization option and convoLthree.asm with Level 3 speed optimization.). The C programf A.l], which implements a convolution algorithm for the FR500, gives the functionality of the example program. Obviously, such kinds of examples are the targets of my equivalence checker. We do not need to know the details of the examples, we just want to check whether the optimizations performed by the compiler are correct. The two programs, convolv.asm and convoLthree.asm, contain 79 and 225 instructions (including labels), respectively. The runtime for my tool to verify equiv-alence was 2.6s, running on a INTEL Pentium III 736Mhz with 128M RAM. Memory usage was about 12M bytes. My equivalence checker confirms the equivalence of the two pieces of code. This example demonstrates such an equivalence checker can work with compilers to verify optimization. 46 Chapter 5 Limitations, Conclusions and Future Work This thesis has extended previous work on verification of low-level code to handle the complexity of modern, high-performance VLIW processors. My proof-of-concept tools demonstrate that my approach can easily find bugs or confirm correctness in situations that are extremely challenging without automated assistance. In the rest of this chapter, I will discuss limitations and future work. I will cover both limitations of the techniques I used and limitations of the decision procedure. 5.1 Limitations In the previous chapters, I've mentioned some of the limitations. In this section. I try to summarize them and give some examples. 47 5.1.1 Bit Reasoning As I've already mentioned, my method, which extends a basic symbolic simulation approach, is at the word-level. That abstracts some datapath complexity. How-ever, the equivalence checker will miss some inequivalences and equivalences. For example, the equivalence checker for the T l C6x doesn't handle overflow at all. And the checker will miss the equivalence of, for example, (OR 0101 1010) and (OR 1111 0000), which are both equal to 1111. For another example, consider the SHL instruction. SHL (.unit) src2, srcl, dst Description; The src2 operand is shifted to the left by the srcl operand. The result is placed in dst. When a register is used, the six LSBs specify the shift amount and valid values are 0-^0. When an immediate is used, valid shift amounts are 0-31. If 39 < srcl < 64, src2 is shifted to the left by 40- Only the six LSBs of srcl are used by the shifter, so any bits set above bit 5 do not affect execution.1 Because I didn't handle registers at the bit-level, my equivalence checker will report inequivalence for two instructions such as SHL .SI AO, 8, A2 ; where AO has a value i n binary i s ; xxDO 00F0 SHL .SI AO, 4, A2 ; where AO has a value i n binary i s ; xDOO 0F00 Note: ' ' x ' ' here denotes any hexadecimal value from 0 to F . But actually these two instructions are equivalent. 1 Taken from [22, 3-113]. Where LSB is the Least Significant Bit. 48 Conversely, due to overflow, my equivalence checker can report equivalence for two routines which are not equivalent. For example, consider two computations to compute (A + B) - C and (A - C) + B. The operators "+" and "-" are in-terpreted. Therefore, the equivalence checker will report equivalence for these two computations. However, if A + B overflows, whereas A - C + B does not, the two computations will have different results. 5.1.2 Uninterpreted. Functions and Axiom Handling In the equivalence checker, uninterpreted functions greatly reduce datapath com-plexity, but due to their limitations, they also introduced errors for some cases. For example, (MPY a bn) should be equal to (MPY an b) for all n. However, it is im-possible to introduce such axioms to the decision procedure, because it is impossible to give infinite axioms. The equivalence checker will report inequivalence for such cases, although they are equivalent. Fortunately, the equivalence checker is conservative, except for situations such as overflow. Based on the definition of uninterpreted functions with equality, two computations of an uninterpreted function are equivalent if and only if all the parameters are equal. 5.1.3 Memory Size I assume that there are no memory size limits or protection, and that all addresses are accessible. Therefore, there are some possible cases in which memory access is invalid, but the verification tool won't notice it. For example, in order to save a register's data to a temporary address, two programs might use different addresses: one is valid, but the other is invalid. After some computations, both programs 49 read the data back to the register. In an actual system, the program which visits an invalid address will report an error and exit. But my equivalence checker can't detect such kinds of invalidity and reports that the values read into the registers are equivalent for both programs. 5.2 Future Work The most obvious direction for future work is to continue development on our tool to fully support the C6x and FR500 architectures. This entails better control-flow analysis and support, as well as implementing all instructions and checks for static resource conflicts. Eventually, such a tool should be integrated into compilers or integrated development environments. The direction of future development depends largely on the techniques avail-able for dealing with control flow. If I am able to find effective means for analyzing programs, then it is possible to develop a highly useful stand-alone verification tool. If the control flow analysis remains a difficult challenge, then the more likely path for future impact is to integrate the verification into the compiler, which has more information available to it on the control flow of the program. In the long term, the broad trend in compiler technology is towards ever more sophisticated program analysis to enable better and better compilation. The verifi-cation style I am proposing is, in some sense, just a more detailed, careful analysis of the code. My work shows that such analysis is becoming feasible and useful. In the future, such techniques may become standard components of optimizing compilers. 50 Bibliography [1] VLIW Questions by Peter Wayner, November 1994. http://www.byte.com/art/9411/secl2/artl.htm. [2] Inc Berkeley Design Technology. VLIW Architecture for DSP: A Two-Part Lecture, 1999. [3] Claudia Blank, Hans Eveking, Jens Levihn, and Gerd Ritter. Symbolic simula-tion techniques — state-of-the-art and applications. In International Workshop on High-Level Design, Validation, and Test, pages 45-50. IEEE, 2001. [4] R.E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers, C-35(8):667-691, August 1986. [5] R.E. Bryant. On the complexity of VLSI implementations and graph represen-tations of boolean functions with application to integer multiplication. IEEE Transactions on Computers, 40(2):205-213, February 1991. [6] David W. Currie. A tool for formal verification of DSP assembly language program, 1999. Msc Thesis of University of British Columbia. 51 [7] David W. Currie, Alan J. Hu, Sreeranga Rajan, and Masahiro Pujita. Auto-matic formal verification of DSP software. In 37th Design Automation Confer-ence, pages 130-135. ACM/IEEE, 2000. [8] X. Feng and A. J.Hu. Automatic Formal Verification for Scheduled VLIW Code. In Joint Conference on Languages, Compilers and Tools for Embedded Systems fo Software and Compilers for Embedded Systems, pages 84-92. ACM, 2002. [9] FUJITSU Ltd. FR-VFamily, FR500 Series Instruction Set Manual, May 2001. Ver. 1.9 2001.5. [10] Kiyoharu Hamaguchi. Symbolic simulation heuristics for high-level design de-scriptions with uninterpreted functions. In International Workshop on High-Level Design, Validation, and Test, pages 25-30. IEEE, 2001. [11] Kiyoharu Hamaguchi, Hidekazu Urushihara, and Toshinobu Kashiwabara. Symbolic checking of signal-transition consistency for verifying high-level de-signs. In Formal Methods in Computer-Aided Design: Third International Con-ference, pages 455-469. Springer, 2000. Lecture Notes in Computer Science Vol. 1954. [12] Jizhong Hah. Private Communication, January 4, 2002. [13] Brett L. Huber. Private Communication, August 27, 2002. [14] Wen-Mei W. Hwu, Richard E. Hank, David M. Gallagher, Scott A. Mahlke, Daniel M. Lavery, Grant E. Haab, John C. Gyllenhaal, and David I. Au-gust. Compiler technology for future microprocessors. Proceedings of the IEEE, 83(12):1625-1640, December 1995. 52 [15] Monica S. Lam. Software pipelining: An effective scheduling technique for VLIW machines. In Conference on Programming Language Design and Imple-mentation, pages 318-328. ACM SIGPLAN, 1988. [16] FUJITSU Ltd. FR500 Series VLIW Embedded Microprocessors: Merg-ing the art of multimedia with the advanced computing technol-ogy. FR500 Series, 2001. We accessed the on-line article at http://edevice.fuj i tsu.com/fj/CATAL0G/AD07 /07-00029 / index_e .html. [17] George C. Necula. Translation validation for an optimizing compiler. In Con-ference on Programming Language Design and Implementation, pages 83-94. ACM SIGPLAN, 2000. [18] Rob Oshana. Optimization techniques for high-performance DSPs. Embed-ded Systems Programming, March 1999. We accessed the on-line article at http://www.embedded.com/1999/9903/9903osha.htm. [19] Sreeranga P. Rajan. Private Communication, April 7, 2002. [20] Philips Semiconductors. An Introduction To Very-Long Instruction Word(VLIW) Computer Architecture. www.semiconductors.philips.com/ acrobat /other / vliw-wp. pdf. [21] Takao Sukemura. FR500 VLIW-architecture High-performance Embedded Mi-croprocessor. FUJITSU SCIENTIFIC k TECHNICAL JOURNAL, 36(1):31-38, June 2000. http://magazine.fujitsu.com/us/vol36-l/paper06.pdf. [22] Texas Instruments. TMS320C6000 CPU and Instruction Set Reference Guide, October 2000. Literature Number SPRU189F. 53 Appendix A Two equivalent FR500 programs A . l C Functionality * A convolution algorithm. * #define NTAPS 16 #define SIZE 256 #define WIDTH 16 #define ANS 16128 #ifdef TIME_SOFTWARE #include "../../../include/time.h" #endif 54 in t main() { i n t A[SIZE]; in t B[SIZE]; in t taps[NTAPS]; in t checksum; int i ; #ifdef TTME.SOFTWARE #include " . . / . . / • • / i n c l u d e / t i m e . b e g i n " #endif f o r ( i = 0; i < NTAPS;i++) taps [ i ] = (1 « WIDTH) - 1; f o r ( i = 0; i < SIZE; i++) A[ i ] = (1 « WIDTH) - 1; f o r ( i = 0; i < (SIZE - NTAPS); i++) { B[ i ] = A[ i ] * taps[0] + A[i+1] * t aps [ l ] + A[i+2] * taps[2] + A[i+3] * taps[3] + A[i+4] * taps[4] + A[i+5] * taps [5] + A[i+6] * taps[6] + A[i+7] * taps [7] + 55 A[i+8] * taps[8] + A[i+9] * taps[9] + A[i+10] * taps[10] + A[ i+ l l ] * t a p s [ l l ] + A[i+12] * taps[12] + A[i+13] * taps[13] + A[i+14] * taps[14] + A[i+15] * taps[15]; > checksum = 0; fo r (i=0;i<SIZE;i++) checksum += A[ i ] & 0x3f; #ifdef TIME.SOFTWARE #include " . . / . . / . . / i n c l u d e / t i m e . e n d " #endif #ifdef PRINTDK p r i n t f ("RESULT: Checksum y.d\n" , checksum); #endif if(checksum==ANS) { #ifdef PRINTOK printf("RESULT: Success ! \n" ) ; p r i n t f ("RESULT: Return-value */.d (7,x) \ n " , Oxbadbeef, Oxbadbeef) ; #endif return Oxbadbeef; } 56 e l se { #ifdef PRINTOK printf( "RESULT: F a i l u r e ! \ n " ) ; p r i n t f ("RESULT: Return-value '/.d (*/.x) \ n " , Oxbad, OxObad) ; #endif return Oxbad; } > A.2 convolv.asm (with no optimization) .program convol . l i b r a r y " l i b935 .1 ib " _main: L_47: subi.p sp,#16,sp se th i #hi(65535),gr l l se t lo .p #lo(65535),gr l l se t los #63,grl2 mov sp.gr10 L_55: s t i . p gr l l ,@(gr l0,-1024) subicc g r l 2 , # l , g r l 2 , i c c 0 s t i grl l,(S(grlO,-1020) s t i gr l l ,@(gr l0, -1016) s t i gr l l , (a (gr l0,-1012) addi.p gr l0,#16,grl0 57 bge iccO,#3,L_55 L_57: se t lo s .p #127,grl l mov grO.gr10 cmpi.p g r l l , #4 , i c c0 mov sp,gr l2 b i t iccO,#0,L_95 L_91: l d i . p S(grl2,-1024),gr5 addi gr l2,#8,gr l3 l d i @(grl2,-1020),grl2 andi gr5,#63,gr33 L_65: l d i . p (3(gr 13,-1024) ,gr6 add gr l0,gr33,gr7 andi.p grl2,#63,grl0 addi grl3,#8,gr8 l d i . p 0(gr l3,-1020),gr9 add gr7,grlO.gr14 andi.p gr6,#63,grl0 subi g r l l , # l , g r l 5 l d i . p Q(gr8,-1024),gr34 add gr l4 ,gr l0 ,gr35 andi.p gr9,#63,grl0 addi gr8,#8,grl3 l d i . p <3(gr8,-1020) ,gr l2 add gr35,grl0.gr10 subi.p g r l 5 , # l , g r l l andi gr34,#63,gr33 cmpi g r l l ,#4 , i ccO bge iccO,#3,L_65 L_99: add.p grl0,gr33,gr36 andi grl2,#63,gr5 add.p gr36,gr5,gr l0 subi g r l l , # l , g r l l mov g r l 3 , g r l 2 L_95: L_97: l d i . p <3(grl2,-1024),gr37 l d i Q(grl2,-1020),gr39 subicc.p g r l l , # 1 , g r l l , i c c l addi gr l2,#8,gr l2 andi.p gr37,#63,gr38 andi gr39,#63,gr40 add gr l0 ,gr38,gr l0 add.p gr l0 ,gr40,gr l0 bge iccl,#3,L_97 L_93: set los #16128,gr41 cmp gr l0 ,g r41, i cc2 bne icc2,#3,L_69 L_68: se th i .p #hi(195935983),grlO se t l o .p #lo(195935983),grlO bra L_70 L_69: set los #2989,grlO L_70: mov.p gr l0,gr8 addi sp,#16,sp ret . end A.3 convoLthree.asm (-03) .program convol . l i b r a r y " l i b935 .1 ib " _main: L_47: ..FUNCTION " f p " , " f p " , 1040 ..CONFIG S, B ..SYMBOL "A", V, 0x80000005, A, -1024 ..ARRAY 4, 1, 255, 0 ..SYMBOL "B" , V, 0x80000005, A, -1953789045 ..ARRAY 4, 1, 255, 0 ..SYMBOL " taps " , V, 0x80000005, A, -1953789045 ..ARRAY 4, 1, 15, 0 ..SYMBOL "checksum", V, 0x80000005, R, " g r l l " ..SYMBOL " i " , V, 0x80000005, A, 0 subi.p sp,#16,sp ..LINE 0, 33 se th i #hi(65535),gr l l 60 ..LINE 0, 33 se t lo .p #lo(65535),gr l l ..LINE 0, 33 set los #63,grl2 ..LINE 0, 33 mov sp.gr10 L.55: ..LINE 0, 34 s t i . p gr l l ,@(gr l0, -1024) ..LINE 0, 33 subicc g r l 2 ,# l , g r l 2 , i c c0 ..LINE 0, 34 s t i g r l l ,Q(gr l0 , -1020) ..LINE 0, 34 s t i gr l l ,Q(gr lO,-1016) ..LINE 0, 34 s t i grll ,<a(grl0,-1012) ..LINE 0, 33 addi.p grl0,#16,grl0 ..LINE 0, 33 bge iccO,#3,L_55 L.57: ..LINE 0, 57 se t lo s .p #63,grl3 ..LINE 0, 56 mov grO.gr11 ..LINE 0, 58 cmpi.p gr l3,#5, icc0 ._LINE 0, 57 mov sp.gr10 ..LINE 0, 58 b i t iccO,#0,L_99 L_95: ..LINE 0, 58 addi.p grl0,#16,grl2 ..LINE 0, 58 l d i 0(grl0,-1024),gr33 ..LINE 0, 58 l d i . p @(grl0,-1020),gr34 ..LINE 0, 58 l d i Q(grl2,-1024),gr35 L_65: ..LINE 0, 58 andi.p gr33,#63,gr33 ..LINE 0, 58 l d i S(grl0,-1016),gr5 ..LINE 0, 58 add.p g r l l , g r33 ,g r6 ..LINE 0, 58 l d i <9(grl0,-1012),gr7 ..LINE 0, 58 andi.p gr34,#63,grl l ..LINE 0, 58 addi gr l2,#16,grl0 ..LINE 0, 58 add.p g r6 , g r l l , g r8 ..LINE 0, 58 andi gr5,#63,gr l l ..LINE 0, 58 add.p g r8 , g r l l , g r9 ..LINE 0, 58 andi gr7,#63,gr l l ..LINE 0, 58 l d i . p Q(gr l2,-1020),gr l4 ..LINE 0, 58 add g r 9 , g r l l , g r l 5 ..LINE 0, 58 subi.p gr l3,#l,gr36 ..LINE 0, 58 l d i C3(grl0,-1024) ,gr33 ..LINE 0, 58 andi.p gr35,#63,grl l ..LINE 0, 58 l d i <3(grl2,-1016) ,gr37 ..LINE 0, 58 add.p g r l 5 , g r l l , g r 3 8 ..LINE 0, 58 l d i @(grl2,-1012),gr39 ..LINE 0, 58 andi.p g r l4 ,#63,gr l l ..LINE 0, 58 addi grl0,#16,grl2 ..LINE 0, 58 add.p gr38,gr l l , g r40 ..LINE 0, 58 andi gr37,#63,grl l ..LINE 0, 58 add.p gr40,gr l l ,gr41 ..LINE 0, 58 andi gr39,#63,grl l ..LINE 0, 58 l d i . p @(grl0,-1020),gr34 ..LINE 0, 58 add g r 4 1 , g r l l , g r l l ..LINE 0, 58 subi.p gr36,#l ,gr l3 ..LINE 0, 58 l d i @(grl2,-1024),gr35 ..LINE 0, 58 cmpi gr l3,#5, icc0 ..LINE 0, 58 bge iccO,#3,L_65 L.103: ..LINE 0, 58 l d i . p 0(grl0,-1016),gr42 ..LINE 0, 58 andi gr33,#63,gr33 ..LINE 0, 58 l d i . p <8(grl0,-1012) ,gr44 ..LINE 0, 58 add gr l l . g r33,gr43 ..LINE 0, 58 andi.p gr34,#63,gr3 ..LINE 0, 58 l d i S(grl2,-1020),gr47 ..LINE 0, 58 add.p gr43,gr3,gr45 ..LINE 0, 58 andi gr42,#63,gr32 ..LINE 0, 58 add.p gr45,gr32,gr46 ..LINE 0, 58 andi gr44,#63,gr4 ..LINE 0, 58 l d i . p S(grl2,-1016),gr7 ..LINE 0, 58 add gr46,gr4,gr5 ..LINE 0, 58 andi.p gr35,#63,gr36 ..LINE 0, 58 l d i 9(gr l2,-1012),gr9 ..LINE 0, 58 add.p gr5,gr36,gr8 ..LINE 0, 58 andi gr47,#63,gr3 ..LINE 0, 58 add.p gr8,gr3,gr l4 ..LINE 0, 58 andi gr7,#63,gr4 ..LINE 0, 58 subi.p gr l3,# l ,gr6 ..LINE 0, 58 add g r l4 , g r4 ,g r l5 ..LINE 0, 58 andi.p gr9,#63,gr37 ..LINE 0, 58 addi grl2,#16,grl0 ..LINE 0, 58 add.p g r l 5 , g r 3 7 , g r l l ..LINE 0, 58 subi gr6,# l ,gr l3 L.99: L.101: ..LINE 0, 58 l d i . p @(grl0,-1024),gr33 ..LINE 0, 58 l d i Q(grl0,-1020),gr35 ..LINE 0, 58 l d i . p <8(grl0,-1016) ,gr37 ..LINE 0, 58 l d i @(grl0,-1012),gr39 ..LINE 0, 57 subicc.p g r l 3 , # l , g r l 3 , i c c l ..LINE 0, 57 addi gr l0,#16,grl0 ..LINE 0, 58 andi.p gr33,#63,gr34 ..LINE 0, 58 andi gr35,#63,gr36 ..LINE 0, 58 add.p g r l l , g r 3 4 , g r l l ..LINE 0, 58 andi gr37,#63,gr38 ..LINE 0, 58 add.p g r l l , g r 3 6 , g r l l ..LINE 0, 58 andi gr39,#63,gr40 ..LINE 0, 58 add g r l l , g r 3 8 , g r l l ..LINE 0, 58 add.p g r l l , g r 4 0 , g r l l ..LINE 0, 57 bge iccl,#3,L_101 L_97: ..LINE 0, 68 set los #16128,gr41 ..LINE 0, 68 cmp g r l l , g r 4 1 , i c c 2 ..LINE 0, 68 bne icc2,#3,L_69 L.68: ..LINE 0, 73 se th i .p #hi(195935983),gr10 ..LINE 0, 73 se t lo .p #lo(195935983),gr10 ..LINE 0, 73 bra L.70 L_69: ..LINE 0, 80 set los #2989,grlO L_70: mov.p gr l0,gr8 addi sp,#16,sp ..CONFIG E ..LINE 0, 82 ret ..CONFIG E . end 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items