Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

High-level cycle-accurate specification of microprocessors Chang, Felix Sheng-Ho 2001

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2001-0352.pdf [ 2.74MB ]
Metadata
JSON: 831-1.0051490.json
JSON-LD: 831-1.0051490-ld.json
RDF/XML (Pretty): 831-1.0051490-rdf.xml
RDF/JSON: 831-1.0051490-rdf.json
Turtle: 831-1.0051490-turtle.txt
N-Triples: 831-1.0051490-rdf-ntriples.txt
Original Record: 831-1.0051490-source.json
Full Text
831-1.0051490-fulltext.txt
Citation
831-1.0051490.ris

Full Text

High-Level Cycle-Accurate Specification of Microprocessors by Felix Sheng-Ho Chang B.Sc, University of British Columbia, 1999 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Master of Science in T H E F A C U L T Y O F G R A D U A T E STUDIES (Department of Computer Science) we accept this thesis as conforming to the required standard The University of British Columbia August 2001 © Felix Sheng-Ho Chang, 2001 In presenting t h i s thesis i n p a r t i a l fulfilment of the requirements for an advanced degree at the University of B r i t i s h Columbia, I agree that the Library s h a l l make i t f r e e l y available for reference and study. I further agree that permission for extensive copying of t h i s thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It i s understood that copying or publication of this thesis for f i n a n c i a l gain s h a l l not be allowed without my written permission. Department The University of B r i t i s h Columbia Vancouver, Canada r Date PI , 7O0( A b s t r a c t This thesis introduces a new specification style for processor microarchitectures. My goal is to produce very simple, compact, but cycle-accurate descriptions that can be automatically simulated efficiently, in order to enable early exploration of different microarchitectures and their performance. The key idea behind my approach is that one can derive the difficult-to-design forwarding and stall logic completely automat-ically. I have implemented a specification language for pipelined processors, along with an automatic translator that creates cycle-accurate software simulators from the specifications. I have specified a pipelined MIPS integer core in my language. The entire specification is less than 300 lines long and implements all user-mode in-structions except for coprocessor support. The resulting, automatically-generated, cycle-accurate simulator achieves over 240,000 instructions per second simulating MIPS machine code. This performance is within an order of magnitude of large, hand-crafted, cycle-accurate simulators, but my specification is far easier to create, read, and modify. ii Contents Abstract ii Contents iii List of Tables vi List of Figures vii Acknowledgments viii Dedication ix 1 Introduction 1 1.1 Motivation 1 1.2 Project 2 1.3 Contributions 3 2 Background 4 2.1 Processor Microarchitecture 4 2.2 Related Work 6 2.2.1 Hawk 7 iii 2.2.2 Sleipnir 9 2.2.3 U P F A S T 10 2.2.4 L I S A 11 2.2.5 Loosely-Coupled Pipelined Circuits 12 2.2.6 T R A C 13 3 PSL: A Processor Specification Language 16 3.1 Design Objectives 16 3.2 Specification Framework 17 3.3 Language Description 22 4 Implementation of a Simulator-Generator 25 4.1 Design Goals 25 4.2 Data Structures and Algorithms 26 4.3 Simulation Engine 27 4.4 Optimization Trade-offs 31 4.5 Future Optimization Possibilities 32 5 Example: MIPS R2000 34 5.1 Specification of MIPS R2000 35 5.1.1 Fetch Stage 35 5.1.2 Decode Stage 37 5.1.3 A L U Stage 39 5.1.4 Memory Stage 41 5.1.5 WriteBack Stage 43 5.1.6 Exceptions 44 5.2 Results 45 iv 6 Conclusions and Future Work 48 Bibliography 50 Appendix A PSL Language Reference 52 A . l Introduction 52 A.2 Keywords and Identifiers 52 A.3 Constants 53 A.4 Containers -. . 54 A.5 Expressions 55 A.5.1 Unary Operators 55 A.5.2 Additive and Multiplicative Operators 56 A.5.3 Shift Operators . . . 57 A.5.4 Comparison Operators 57 A.5.5 Bitwise Operators 57 A.5.6 Logical Operators 58 A.5.7 If-Then-Else Operator 58 A.5.8 Container References 58 A.5.9 Syscalls 59 A.6 Statements 59 A.7 Stages . . 61 Appendix B PSL Grammar 63 Appendix C MIPS R2000 Specification 69 v Lis t of Tables L Context diagrams 55 vi Lis t of F igures 3.1 Context diagrams 20 4.1 Pseudo-code for the main simulation function . 28 4.2 Pseudo-code for a reference to container C 29 4.3 Pseudo-code for a reference to container C' 30 4.4 Pseudo-code for a reference to container C$ 30 5.1 Steps needed to simulate Dhrystone 46 vii A c k n o w l e d g m e n t s I thank my advisor, Alan Hu, for helping me explore an idea — an idea that turned into a long but fulfilling research project. Furthermore, I have been blessed with invaluable advice from many individ-uals, especially Mark Greenstreet whose suggestions have helped me stay on track. Finally, my gratitude goes out to Brian Winters who generously helped me get my writing started; my thesis would not have been completed otherwise. This research was supported in part by the National Science and Engineering Research Council of Canada. F E L I X S H E N G - H O C H A N G The University of British Columbia August 2001 vm Dedicated to my parents. ix C h a p t e r 1 I n t r o d u c t i o n 1.1 Motivation Processor design — whether for mainstream commodity products or application-specific processor cores — involves numerous difficult trade-offs between competing goals and constraints. These trade-offs span from instruction set architecture issues to low-level circuit implementations. This thesis focuses on the microarchitectural level, the highest level where the performance effects of pipelining, stalls, functional unit latency, and similar design decisions becomes visible. The performance effects of microarchitectural decisions are hard to predict. How much performance do we lose if we stall one-cycle for an unusual instruction combination instead of adding an expensive bypass path? How much performance do we gain with a faster functional unit? How does performance change if the typical software workload changes in some way? In practice, one relies heavily on cycle-accurate simulators in order to determine the number of clock cycles needed to execute samples of benchmark software, combined with rough timing and circuit 1 models to estimate the clock frequency and area of a given design. Unfortunately, writing a cycle-accurate simulator for a given microarchitec-ture is time-consuming and error-prone, especially since the resulting simulator must run fast enough to simulate non-trivial software. Therefore, simulation can be used to explore only a few design options; other possibilities can be considered only vaguely, using intuition and back-of-envelope calculations. Our hypothesis is that efficient cycle-accurate simulators can often be rapidly generated from a simple, compact, but cycle-accurate style for specifying micropro-cessors. If our hypothesis is true, it will enable microprocessor designers to do early exploration of different design options. 1.2 Project To test the hypothesis, we introduce a novel specification style tailored expressly for describing processor microarchitectures. The specification style allows the designer to think in terms of pipeline structure, functional unit latency, and whether or not a bypass path is provided. A n automated tool can then infer all of the bypass and stall control logic, which is normally one of the most time-consuming parts of a cycle-accurate processor model to design. To test the usefulness of our approach, we have embodied our specification style in a simple language (PSL) to describe pipelined processors. We have also written an automatic translator that reads descriptions in this language and gen-erates a cycle-accurate simulator for the processor. The key questions to answer are (1) whether the specification style enables simpler, more compact descriptions, and (2) whether the generated simulator is fast enough to be used for meaningful performance evaluation. If the answer to both questions is affirmative, then we have 2 a useful tool for early-stage microarchitectural design exploration. 1.3 Contributions The work presented in this thesis makes the following contributions to microproces-sor specification and simulation research: • A novel style of specifying cycle-accurate processor models, along with an explanation of our rationale for this approach to microprocessor specification. • A prototype language, PSL, that embodies this specification style. • A simulator generator that automatically generates a cycle-accurate simulator from a microprocessor specification written in PSL. • The specification of a non-trivial microprocessor (MIPS R2000) using PSL. Chapter 3 of this thesis explains the specification style and the prototype language PSL. Chapter 4 details the simulator generator. And Chapter 5 presents the speci-fication of MIPS R2000 as an example of using PSL. 3 Chapter 2 Background 2.1 Processor Microarchitecture A microprocessor can be described at many levels of abstraction. Each level corre-sponds to a different point of view and allows exploration of a different set of design choices. From an assembly programmer's point of view, a processor accepts and ex-ecutes instructions one at a time. At this level (often referred to as the ISA level), the implementation details of the microprocessor are not visible: only the effect of executing instructions. On the other end of the spectrum, a very low level of abstraction is the layout level, where a processor is described by a collection of rectangles specifying the desired structures on each layer of the fabrication process. At this level, there are enough details to determine the actual die size and the electrical properties of transistors, interconnect, and other devices. These important factors figure strongly in the manufacturing cost. The layout can be analyzed to determine the optimum 4 clock speed supported by this design. Between these two levels, many intermediate levels of abstraction have been proposed, named, and utilized in real world designs. Below the ISA level is a level often referred to as the microarchitecture level, where the resources and their organization on the processor are described. The next lower level is often referred to as the Register Transfer Language (RTL) level, where the operation of the processor is described by detailed pseudo-code statements using RTLs. (RTL is a simple machine-independent language used to describe the changes to processor registers). A n even lower level is known as the gate and logic level, where the exact logic gates used by the processor are specified. This thesis mainly focuses on the microarchitecture level. Microarchitecture can be defined as follows: "Microarchitecture refers to the set of resources and methods used to re-alize the architecture specification. The term typically includes the way in which these resources are organized as well as the design techniques used in the processor to reach the target cost and performance goals. The microarchitecture essentially forms a specification for the logical implementation." [13] At this level, the layout information is not available, but the specification is de-tailed enough to be allow cycle-accurate simulation of the processor. Along with an estimation on the final clock speed, simulation at this level can determine the number of clock-cycles required to execute sample programs and thereby estimate the real-world performance of different microprocessor designs. The ISA level is too high level to include most processor design decisions. Intuitively, the microarchitecture level is the highest level of abstraction that still 5 V exposes the implementation details (such as pipeline depth). Many design decisions made at this level directly affect the cost and performance of the processor. If these design choices lead to undesirable consequences, it is often difficult to undo these negative consequences at lower levels of abstraction. A typical design cycle therefore goes as follows: 1. S p e c i f i c a t i o n at the ISA l e v e l . 2. S p e c i f i c a t i o n at the microarchitecture l e v e l . 3. Cycle-accurate simulation i n conjunction with i n t u i t i v e estimation of hardware complexity and performance. 4. Make a l t e r n a t i v e design choices, then go back to step 2. 5. S p e c i f i c a t i o n at lower l e v e l s . . . Conventional specification methods require verbose descriptions that take a long time to write and a lot of work to modify. Furthermore, efficient simulators usually can not be automatically generated from the specification, making it necessary to spend a significant amount of time to hand-craft a simulator from a specification. Since steps 2 and 3 are often repeated many times during the design cycle, we decided to focus our work on the microarchitecture level and ways to speed up specification and automate simulator generation at that level. 2.2 Related Work Because of the need to simulate a processor before it is built, processor simulators have been around for decades and are too numerous to cite here. To place our simulator in the context of other simulators, we emphasize the distinction between an instruction set simulator, a cycle-accurate simulator, and a logic simulator (including cycle-based logic simulation). 6 A n instruction set simulator simulates a processor at the instruction set ar-chitecture level, completely ignoring how the processor is implemented. Because no processor details are modeled, these simulators provide the fastest simulations (several million instructions per second) and are mainly useful to allow early soft-ware development before the processor is completed. A logic simulator simulates the actual hardware implementation (e.g., at the gate level) and is primarily used to validate the correctness of the logic level implementation. However, it runs much more slowly (e.g., a few thousand instructions per second). Cycle-accurate sim-ulators fall somewhere in between instruction set simulators and logic simulators, providing intermediate simulation speed and accuracy. The simulator must model the processor accurately enough to compute the number of cycles needed to execute a program, but does not model the combinational logic. Simulation speed must be fast enough to run non-trivial benchmark code to evaluate processor performance. Cycle-accurate simulators are traditionally hand-coded for a particular mi-croarchitecture. The process of hand-coding the simulators is both time-consuming and error prone. Therefore, various projects have been launched to develop tech-niques for automating the generation of microprocessor simulators. The following sections list these related projects in order from high-level to low-level, each with a brief description and a comparison with PSL. 2.2.1 Hawk Hawk [9] is an executable processor specification language designed for rapid pro-totyping and simulation. Since it is embedded in the functional language Haskell, it is very flexible. Therefore, specifications can be written for various levels of ab-stractions (such as ISA level and cycle level). 7 A processor is specified in Hawk by joining together components using wires. The Hawk library comes with an extensive selection of pre-built components (such as register files) and different wire types (of different delay properties). Custom components and wires can also be written by describing their behavior using a Haskell expression. To combine these components into an executable specification, Hawk treats each component as a function that maps clock cycles into values. Essentially, each component generates an infinite list of values (one per clock cycle). Each value of a component is allowed to rely recursively on this component's past values, or on the values of other components. (Haskell's lazy semantics make it possible to execute such a specification of multiple infinite lists of values.) Simulation is then performed by inputing the specification into Haskell, and then querying the values of various components at different clock cycles. One of the useful domain-specific feature in Hawk is the grouping of values into a transaction. A transaction is basically a tuple of values that can obtain additional values as it moves from component to component. It provides a natural way to model instruction contexts in a typical microprocessor: each instruction can carry a transaction which receives more and more refined values and states as it moves through the processor. By allowing add-ons using a functional language, Hawk is far more flexible and dynamic than PSL. However, the flexibility comes at the price of performance. Haskell programs are normally interpreted, which is orders of magnitude slower than compiled execution. Even experimental Haskell compilers generate code that is several times slower than equivalent hand-written C code, unless extensive hints and annotations are used to reduce run-time checking and evaluation. The added 8 effort would nullify the advantages of using Hawk in the first place. 2.2.2 Sleipnir Sleipnir [6] is an instruction-level simulator generator that exploits domain-specificity by understanding the notion of instructions and instruction formats. Under Sleipnir, a microprocessor is specified by a list of resources (such as registers and memory locations) and a list of instructions. Each instruction is specified by its recognition bit pattern and a section of raw C code that should be executed when the microprocessor executes the given instruction. The Sleipnir's simulator then works as follows: while (1) { Identify the current i n s t r u c t i o n using the l i s t of b i t patterns. Execute the C code associated with the i d e n t i f i e d i n s t r u c t i o n . > Since the instructions are simulated one at a time, Sleipnir is well suited for ISA level simulation. It is sometimes possible to approximate cycle-accuracy by insert-ing raw C code to calculate the effects of delays and stalls, but doing so requires same effort comparable to that of writing a hand-crafted simulator. For complex microprocessors, Sleipnir simulations are effectively restricted to the ISA level. By using native C code for simulation, and by optimizing the simulator for the ISA level, Sleipnir boasts a higher simulation speed than most of the related simula-tor generators. According to the statistics cited in [6], the author's instruction-level specification of the integer portion of MIPS R2000 consists of 700 lines, and executes at 5 million instructions per second on a 250 MHz MIPS R10000. 9 Since Sleipnir, unlike PSL, is mainly designed for ISA level simulation, Sleip-nir is well suited for early software development, but is less suited for early explo-ration of microarchitecture design choices. 2.2.3 U P F A S T U P F A S T [10] is an integrated processor specification system that can automatically generate a low level simulator, an assembler, and a disassembler based on a processor specification. A processor specification using U P F A S T basically consists of three compo-nents: the artifacts (such as registers and memory locations), the pipeline stages, and the instructions. Each instruction is specified by its instruction bit pattern, its assembly instruction syntax, and the R T L statements it should perform when inside different stages. The simulator then works as follows: while(l) -C For each stage S { Execute the RTLs associated with the i n s t r u c t i o n i n S } Move inst r u c t i o n s forward i n the pi p e l i n e i f possible. Fetch a new i n s t r u c t i o n i f the f i r s t stage i s vacant. } To facilitate rapid specification of microprocessors, U P F A S T provides some domain-specific statements like stall (which prevents the instructions in the following stages from proceeding forward) and freeze (which blocks the entire pipeline). U P F A S T is the closest related work to PSL in its model of simulation and its approach to processor specification. However, unlike PSL, all forwarding and hazard detections in U P F A S T have to be detected and performed explicitly. For 10 example, data forwarding requires each instruction to check the value fields in the neighbor stages and then explicitly fetch them as replacement of the instruction's own outdated values. And hazard detection requires each instruction to check for possible hazards and then issue the stall command explicitly. The low level nature is reflected in the size of processor description. Accord-ing to [10], the author's specification of the integer portion of MIPS R2000 consists of 5000 lines, and executes at 200K cycles per second on a 200 MHz Pentium Pro. 2.2.4 LISA LISA [11] is an extensive processor specification system that allows specifications written in LISA to be automatically translated into a simulator, an assembler, and a disassembler. The simulation can be done at the ISA level, the cycle level, and even the phase level (where each cycle is divided into sub-cycles). Unlike PSL, one of LISA 'S goal is to be general enough for vastly diverse architectures; therefore there are fewer assumptions in LISA, making specifications in LISA more verbose. A microprocessor specification in LISA consists mainly of the hardware re-sources and the processor operations. While specifications in LISA can also include additional information (like the assembly language syntax that may be used to auto-matically generate an assembler), these features are not directly related to processor simulation, and therefore will not be described here. In LISA, each processor operation is described by its syntax (for assemblers and disassemblers), its behavior (a block of RTL-like statements), and an optional activation section that describes the timing behavior of the operation. LIS A's timing model is based on discrete control steps; processor designers can use these control steps to represent instructions, clock cycles, and even clock phases of the processor. Simulation then proceeds as follows: 11 while(1) •C For each activated operation: execute its associated RTLs. Activate more operations if possible. } LISA also supports domain-specific commands like stall (which blocks the activation of following stages) and flush. However, in order to be general enough to accommodate vastly different ar-chitectures, LISA does not contain built-in support for automatic forwarding and automatic stalling. For example, mandatory stalling has to be explicitly coded in each operation's activation condition. Therefore, specifications in LISA are neces-sarily more verbose than those in PSL. 2.2.5 Loosely-Coupled Pipelined Circuits Marinescu and Rinard [8] propose a style of specification based on the decoupling of pipelined circuits into modules that are loosely connected by FIFO queues. The specification can then be automatically translated into Verilog. Basically a pipelined circuit is specified by a set of global resources, a collection of modules, and the queues that connect the modules. Each module is specified by a set of rewrite rules using the notation of Term Rewriting Systems [3]. Each rule is a rewrite rule that modifies the current machine state. The semantics are that during each clock cycle, an enabled rule will be arbitrarily chosen and executed until there are no more enabled rules. Then at the end of a clock cycle, the module may insert items into its output queue. Furthermore, if there is no space in the output queue of a module, then the entire module will be 12 disabled with all rewrite rules disabled. Verilog generation is performed by symbolically evaluating each module's rewrite rules into a single transition expression, and then converting these transition expressions into Verilog. The queues are converted into inter-module buffers that are only readable at the beginning of clock cycles. To optimize the size and speed of the synthesized circuit, the relaxation algorithm given in [8] can be used to identify rules that can be executed earlier or concurrently without breaking the semantics. By relaxing the enabling conditions of some rules, the transition expressions can be simplified, reducing the final circuit size. The emphasis of this approach is to facilitate modular designs. By describing the inter-module buffers as queues, designers can concentrate on building each of the individual modules without worrying about the interaction with other modules. However, unlike PSL, this specification style is not specifically designed for micro-processors and thus lacks some domain-specific features contained in PSL. While automatic stalling is provided when output queues fill up, there is no automatic forwarding and no other automatic stalling. For example, in the sample pipeline processor specification cited in [8], the operand fetch stage is described by code that explicitly forwards data from a later module by explicitly searching for conflicting W R I T E instructions in the later module's input queue and then reading the up-dated value if found. Therefore, specifications written in this style are more verbose than in PSL. 2.2.6 T R A C Hoe and Arvind [5] propose an operation-centric style of specification using the no-tation of Term Rewriting Systems [3], and introduce T R A C as a synthesis tool that 13 translates a specification in this style into Verilog directly. Essentially, a machine is specified as the initial machine state and a set of rules that rewrite the current state into the next state. The state of a machine is the combination of all visible values in the model. For example, given a 5-stage pipelined microprocessor like MIPS R2000, the machine state could be modeled by the combined state of the registers, memory locations, and latch values in each stage. To specify how the machine evolves from cycle to cycle, transition rules need to be specified. Basically each rule is a guarded rewrite command that trans-forms the current state into the next state given suitable preconditions. For ex-ample, the fetching of instructions could be modeled by having a rule that trans-forms ( P C , Q U E U E , I M E M ) into (PC+1,QUEUE:IMEM[PC],IMEM) where P C is the program counter, Q U E U E is the list of fetched instructions, and I M E M is the instruction memory. Simulation then proceeds as follows: w h i l e ( l ) I A p p l y a l l n o n - c o n f l i c t i n g e n a b l e d r u l e s t o t h e c u r r e n t s t a t e . } Hoe and Arvind present several theorems in [5] which can be used to optimize the specification. For example, a theorem is given that determines when multiple en-abled rules can be applied simultaneously and yet achieve a result that is equivalent to atomically applying each rule in some order. By utilizing these optimization analysis, T R A C is able to go directly from specification to Verilog code without too much penalty. According to the statistics cited in [5], the automatically synthe-sized MIPS integer core is only 26% larger than the equivalent core generated by hand-written R T L . 14 T R A C ' s emphasis is on feasible and efficient synthesis. While PSL spec-ifications also contain enough information for Verilog synthesis, our higher-level semantics require more optimization efforts to generate efficient Verilog. To achieve feasible and efficient synthesis, T R A C descriptions are basically RTL-level, but with a more natural syntax to describe the detailed interaction between hardware com-ponents. In contrast with PSL, there are no built-in features to automate stalling, forwarding, or even movement of instructions through the processor. Every feature of the hardware is specified explicitly. This ensures efficient hardware synthesis, but it requires much lower-level specifications than PSL, making it less suitable for early exploration of different design choices. 15 C h a p t e r 3 P S L : A Processor Specification This chapter explains the specification style and its design objectives. We have embodied this specification style into a prototype language called PSL, which is introduced in Section 3.3. The detailed PSL reference and its grammar can be found in Appendices A and B. 3.1 Design Objectives Our main objective is to allow rapid specification of processors and automatic gener-ation of processor simulators. This means the specification style must be high-level enough to allow concise and intuitive specifications that are both easy to write and to modify. At the same time, the specification style must be specific and detailed enough to allow automatic generation of cycle-accurate simulators. In order to sat-isfy both goals simultaneously, we have decided to sacrifice generality by optimizing the specification style only for microprocessors. 16 Being domain-specific allows the specification language to include built-in support for common microprocessor design features. This enables processor design-ers to describe a processor with sufficient detail for cycle-accurate simulations (meet-ing the second goal) while using intuitive and concise language elements (meeting the first goal). In the following section, we present the exact conceptual framework of our specification style. 3.2 Specification Framework Our entire approach hinges on two assertions: Assertion 1 (Sequential Execution Semantics) Processors are designed to provide the same programmer-visible behavior as if individual instructions were ex-ecuted completely, one-at-a-time, one-after-another. Assertion 2 (Producer Knows Best) The functional unit that produces (or fails to produce) a result is the clearest place in a description to control access to that result. The Sequential Execution Semantics assertion means that a description style, by default, should enforce the normal execution semantics that software program-mers expect and that processors typically provide. For example, if an instruction depends on the result of the preceding instruction, a simulator generator should determine automatically that a stall or bypass is necessary. In contrast, typical processor descriptions languages require the model to describe explicitly all stall and bypass conditions needed to enforce sequential execution semantics. Describing these conditions is error-prone and time-consuming, and needlessly complicates the description. Note that our specification style can handle small violations of this 17 assertion — for example, our MIPS R2000 model in Chapter 5 accurately models delayed branches — but if a processor completely disregards sequential execution se-mantics, e.g., a completely non-interlocked V L I W processor, our specification style provides no advantage over existing methods. The Producer Knows Best assertion means that the description code to de-termine which instructions have access to which results should be localized to the portion of the description that produces the result, rather than scattered throughout the description among all possible places a result might be needed. Other descrip-tion styles typically require an explicit check of whether a result is available in any code that accesses that result. In our style, those checks are implicit and can be automatically inferred by a simulator generator. For example, consider the code sequence: addi $t0, $tl, 1 add $t0, $t0, $t2 sw $t0, 0($t3) (MIPS assembly convention has the destination register as the first operand, except for store instructions.) In our description style, the add-immediate and add instruc-tions would simply announce whenever their results were ready, and the Sequential Execution Semantics assertion would ensure that the add and store-word instruc-tions get the proper data values, via forwarding or stalling, without any explicit checks for data readiness in the description. More concretely, our description style requires specifying the functional units of the processor, what each functional unit does, and how instructions flow from one functional unit to the next (or are retired). Instructions carry with them their context, i.e., the instruction's view of the values of all machine state. The use of contexts is similar to other formal models of pipelined execution (e.g., [1, 9]) 18 and greatly simplifies the semantics of the description. Functional units will rely on the newest available value for operands, automatically forwarding or stalling as necessary, unless otherwise specified. Results are written into the instruction's context, where they are available to subsequent instructions automatically. Before an instruction retires, its results are committed to a global context, unless the instruction was killed due to exceptions, misprediction, etc. To support the Sequential Execution Semantics, we introduce the concept of containers and access operators. Containers are like variables and correspond to all programmer-visible machine state, such as registers, the program counter, condition codes, etc., as well as any additional bookkeeping information needed to process the instruction, such as the instruction register, the opcode, etc. A n instruction's context is simply an assignment of values to containers. In addition to their normal values, containers can be assigned the special values UNAVAILABLE, which indicates that the given value is not yet available, and TRANSPARENT, which indicates that the value of the container should be taken from the preceding context. Typically, an instruction's context will initialize with all containers being UNAVAILABLE, and then all containers that the instruction won't modify will be changed to TRANSPARENT. When the instruction's result is available for forwarding to subsequent instructions, the result container will be changed from UNAVAILABLE to the result value. When the instruction is ready to retire, the result value will be written to the global context. (See Figure 3.1.) We call assignments to the current context announcements; we call assignments to the global context commitments. When a functional unit accesses the value of a container, the default seman-tics is to return the most recent non-TRANSPARENT value for that container. The value might be forwarded from a recent instruction or taken from the global context. 19 G l o b a l Context $t0 $ t l $t2 $t3 m ru s nr older a d d i $t0, $ t l , 1 $t0 $ t l $t2 $t3 m n n m add $t0, $t0, $t2 WB U $ T 0 5C2 $t3 R j i r u n sw $t0, 0($t3) $t0 $ t l $t2 $t3 • • • • Figure 3.1: The store-word instruction wants to execute and needs to access the values of the registers. It gets the values from the most recent context with a non-TRANSPARENT value, so it gets StO from the preceding instruction, but $t3 from the global context, because the add instructions assigned TRANSPARENT to their $t3 containers (because they knew they didn't modify the value of $t3). IF, ID, EX, M E M , and WB refer to MIPS R2000 pipeline . stages. 20 If the most recent value is UNAVAILABLE, the functional unit will stall. This default behavior automatically enforces sequential execution semantics. Each instruction always gets the most current value for its operands. Stalling occurs automatically if a needed result has not yet been announced. Forwarding occurs automatically if a needed result has been announced, but not yet committed. Precise exceptions are easy to implement because uncommitted values disappear when instructions are killed/retired. No description effort is spent implementing any of these effects. In unusual circumstances, we would like to bypass the sequential execution semantics, so we provide alternative access operators. To date, we have needed only four: Given a container c, the default access c denotes the most recent non-TRANSPARENT value of the container, starting from the present context; c' denotes the most recent non-TRANSPARENT value of the container, starting from the previous context; c$ denotes the most recent non-TRANSPARENT value of the container from the previous context after all functional units further in the pipeline have completed the current clock cycle ; and c# denotes the value of the container in the global con-text. For example, the MIPS instruction set architecture specifies delayed branches — the instruction following a branch always executes, regardless of whether the branch is taken or not. Our specification style most naturally produces non-delayed branches in which the Fetch unit is stalled until the branch is resolved. To get delayed branches, we simply add a branch target container and a branch taken flag. The fetch unit fetches instructions normally, unless the branch taken flag signals a branch. The branch taken flag is accessed using the # access operator, so the fetch unit doesn't stall even though the branch decision isn't available yet. (The functional unit that resolves the branch commits the decision to the global context.) 21 Note that existing description styles make the delayed branch case easy to describe, as it is a natural artifact of the pipelining, but makes normal execution hard to describe. Our description style makes the common case (sequential execution semantics) simple, at the cost making the uncommon cases somewhat harder. 3.3 Language Description To demonstrate these concepts, we have developed a prototype language, PSL, that embodies this specification style. This section contains a brief description of the language. The detailed language reference of PSL is available in Appendix A , and its grammar is listed in Appendix B. Before we begin describing the language, it is useful to introduce a sample specification using PSL. Below is a specification of a simple, fictitious processor: container Mem[0..OxFFFF], PC, Op; enum "raise":=1, "lower":=2; constructor Fetch { true: { Op <- Mem#[PC] ; PC:=PC'+1; goto ALU; } } stage ALU { Op == "raise" : { s y s c a l l (7 ,0 ,0 ) ; } Op == "lower" : { s y s c a l l (3 ,0 ,0 ) ; } true : { r e t i r e ; } } The above specification describes a simple microprocessor with two stages: an in-struction fetch stage and an A L U stage. The main memory holds exactly 65536 words, and there are two processor state variables (PC and Op). During each clock 22 cycle, an instruction is fetched from the memory and the PC is incremented by 1. The previously fetched instruction then moves into the A L U stage where it will make one of two possible syscalls (based on the instruction opcode) and then retire'. In general, a processor specification in PSL includes a list of containers and a list of stages. Each stage has a name and a list of guarded commands. Furthermore, one and exactly one of the stage must be declared as a constructor stage. In the above example, there are three containers (Mem, P C , and Op) and two stages (constructor stage Fetch and the non-constructor stage A L U ) . The simulation begins with a global context (where all containers are initial-ized to zero or some other default values) and a list of stages which are all vacant (meaning no instruction is in any of the stages). During each clock cycle, every occu-pied stage will execute all enabled commands using the given instruction's context. For example, if an enabled command refers to the container PC, then by default the value it uses is the value from the instruction's private context. Legal statements include announcement statements using the <- operator (which modify the values in the current instruction context), commitment state-ments using the := operator (which modify the values in the global context), the goto statement (which indicates where the instruction should go in the next cycle), and the r e t i r e statement (which indicates the instruction should exit the processor after this cycle). It is illegal for an instruction to execute both a goto statement and a r e t i r e statement in the same cycle. The semantics of the multiple announcements and commitments is that these context modifications happen atomically in parallel; it is illegal to make more than one announcement or commitment to the same container during any given cycle. If an instruction encounters unavailable values when evaluating one or more of the 23 guards, or if an enabled statement requires a, value that is unavailable, then the instruction is deemed to have failed in the current cycle. That means none of its context modifications will happen, and it stays in its current stage regardless of any goto or r e t i r e statements. On the other hand, if an instruction does not encounter any unavailable values during a cycle, but its destination stage is occupied in the next cycle, then this instruction is deemed to have stalled in the current cycle. That means all of its announcements and commitments do happen, but the instruction stays in the current stage. At the end of a clock cycle, all instructions that have an enabled r e t i r e command will exit the processor, while all instructions that have an enabled goto command will attempt to move into their destination stages. The movement of in-structions happen atomically with the instructions furthest in the pipeline having the top priority. This means given stages 1, 2, 3, 4, and 5 in a pipeline, if the instruction in stage 5 wishes to retire, then the instruction in stage 4 can move into stage 5 within the same cycle as the retirement. Furthermore, whenever the con-structor stage would be vacant for the next cycle, a new instruction will be created and placed into the constructor stage just before the beginning of the next cycle. The newly created instruction has a per-instruction context where all containers are initialized to the special value UNAVAILABLE. Al l four access operators described in Section 3.2 are available in PSL. The syntax of the expressions are modeled after the C language. For the detailed lan-guage reference, please refer to Appendix A. 24 C h a p t e r 4 Implementation of a Simulator-Generator 4.1 Design Goals To test our specification approach and its embodiment in PSL, we have developed a simple proof-of-concept tool that automatically translates specifications into cycle-accurate simulators. There are two objectives for the simulator-generator: the gen-eration of the simulator must be fully automatic, and the generated simulator should be as fast as possible. The first goal is achievable because the PSL language has been defined with this goal in mind; that means barring a few extreme cases (such as division by zero), the language has no implementation-dependent nor any undefined behavior. The second goal is achieved by employing aggressive compile-time analysis to eliminate redundant computation, and by optimizing the simulation engine for the most common cases. More explanation of the optimization techniques developed for 25 this project can be found in Section 4.4. In the following sections, we present the prototype implementation of the simulator-generator. 4.2 Data Structures and Algorithms There are 3 main data structures used in the generated simulator: the global array (which stores the global context of all containers), the context singly-linked list (which has one node for each live instruction currently in the simulator), and the future array (which stores the atomic parallel commitments during a cycle). The global array is simply an array that stores the global context of all containers during the simulation. It has the same number of elements as containers, and all elements are initialized to 0 before the simulation begins. The simulator can then optionally load an image file that contains initial values for the global containers. In our simulator engine, the context singly-linked list is used to store per-instruction context. It is basically a linked list of nodes, one node for each live instruction. As instructions are fetched in the constructor stage, a new empty node is allocated for the new instruction and inserted at the head of the list. As instructions retire, their nodes are removed from the list. So in effect, the nodes of the list are sorted in program order, with the head being the youngest instruction and the tail being the oldest. Conceptually, the per-instruction context is just an array that stores the values of all containers in the private context. However, since most instructions only operate on a small subset of the list of containers, it is inefficient to store the per-instruction context as a large array. Each node contains the full list of an-nouncements in the context of its associated instruction. A n empty node means this 26 instruction has not made any announcements; therefore, all containers are regarded as U N A V A I L A B L E . Context modification is done by appending or removing an an-nouncement from the list. As an optimization, each entry in the list can represent the value of multiple adjacent containers if they share the same value. During each cycle, an instruction that attempts to make announcements and commitments may fail when it encounters U N A V A I L A B L E values. When that happens, all the context modifications it makes in that cycle have to be canceled. That is accomplished by storing all per-instruction context modifications into the instruction node's temporary array, and all global context modifications into the future array as each instruction executes. If the instruction can successfully execute all of its enabled commands, then the entries in the future array will be written to the actual global context, and the entries in the instruction node's temporary array will be written into the per-instruction context. Otherwise, the future array and the per-node temporary array will be purged, canceling all announcements and commitments by this instruction. 4.3 Simulation Engine In order to be as portable as possible, the automatically generated simulation engine is a C source file that uses only standard C functions. The simulator source code then simply contains three major data structures (as explained in the previous section), an initialization function (which can read image files as initial values for global containers), and a main simulation function (which contains the simulation code for the entire processor). The pseudo-code for the main simulation function is shown in Figure 4.1. One thing worth mentioning is how a reference to a container is evaluated at 27 while(true) { Erase a l l entries from the "future" array. For each stage that contains a live instruction •C Remember the current size of the "future" array. Erase a l l entries from the instruction's "temporary" array. Evaluate a l l guards. If an unavailable value is encountered •C -Clean up. Skip the rest of this stage. } Else { Execute a l l enabled statements. If an unavailable value is encountered -C Clean up. Skip the rest of this stage. > } } For each entry in the "future" array •C Commit the changes into the "global" array. > For each entry in each instruction's "temporary" array { Make the changes in the per-instruction context. } For each live instruction from oldest to youngest •C If the instruction wishes to retire, remove i t s node from the "context" linked l i s t . Otherwise, move into the destination stage i f vacant. } If the constructor stage becomes vacant •C Create a new instruction with an empty "context" node. Insert the new node into the "context" linked l i s t . Insert the new instruction into the constructor stage. } Figure 4.1: Pseudo-code for the main simulation function 28 run time. Recall from Section 3.3 that there are four access operators in PSL, each corresponding to a different sequence of context search. The pseudo-code for the C container reference is shown in Figure 4.2. For(node=ThisContext; node!=NULL; node=node->next /* e a r l i e r i n s t r u c t i o n */) i For each entry (Container E, Value V) i n node->AnnouncementTable < I f C == E { If V i s an i n t e g e r , then the value of the e n t i r e expression i s V. Otherwise, i f V i s UNAVAILABLE, then the e n t i r e expression i s UNAVAILABLE. } } } I f a value s t i l l hasn't been found, then the GLOBAL context i s queried. Figure 4.2: Pseudo-code for a reference to container C The second access operator, C' , differs in that it searches for the most recent non-TRANSPARENT value starting from the previous context instead of the current context. The pseudo-code for the C container reference is shown in Figure 4.3. The third access operator, C$, differs in that it waits for all later stages to be completed and then searches for values in both the per-instruction context and in the future and temporary tables. The pseudo-code for the C$ container reference is shown in Figure 4.4. The fourth and final access operator, C#, does not depend on any instruction context. Instead, it bypasses per-instruction private context and directly queries the global context for the value. It is quite straight-forward; therefore, no pseudo-code is listed here. 29 For(node=ThisContext->next; node!=NULL; node=node->next /* e a r l i e r i n s t r u c t i o n * •C For each entry (Container E, Value V) i n node->AnnouncementTable •C I f C == E { If V i s an i n t e g e r , then the value of the e n t i r e expression i s V. Otherwise, i f V i s UNAVAILABLE, then the e n t i r e expression i s UNAVAILABLE. > } } I f a value s t i l l hasn't been found, then the GLOBAL context i s queried. Figure 4.3: Pseudo-code for a reference to container C For(node=ThisContext->next; node!=NULL; node=node->next /* e a r l i e r i n s t r u c t i o n *, < For each entry (Container E, Value V) i n node->AnnouncementTable, node->TemporaryArray, and the f u t u r e array { • If C == E i I f V i s an i n t e g e r , then the value of the e n t i r e expression i s V. Otherwise, i f V i s UNAVAILABLE, then the e n t i r e expression i s UNAVAILABLE. } } } I f a value s t i l l hasn't been found, then the GLOBAL context i s queried. /* Since the stages are evaluated from the end of the p i p e l i n e to the f r o n t , a l l l a t e r stages would have completed. Thus, querying the Temporary and Future arrays provides the de s i r e d most recent value. */ Figure 4.4: Pseudo-code for a reference to container C$ 30 4.4 Optimization Trade-offs Due to time constraints, we have not used any sophisticated techniques for acceler-ating processor simulation, such as predecoding instructions, cross compilation, or strength reduction. The simulation engine currently uses only simple data struc-tures (like arrays and linked lists). The few optimization techniques we do use are listed below. The first optimization technique is basic constant folding and common subex-pression elimination. These optimizations are performed by the parser as it reads in the specification file. More information about performing these simple optimizations can be found in [2]. The main additional optimization technique used in our prototype simulator-generator is the elimination of redundant computation by memoization at compile time. Since all commitments and announcements happen in parallel at the end of a clock cycle, the values of expressions do not change during a cycle (with the exception of expressions that use the $ access operator). Therefore, if the same ex-pression appears more than once in a specification, the simulator should evaluate the expression only once per cycle, memoize the value into a variable pre-determined at compile time, and query the variable each time the expression needs to be evaluated again. To facilitate this, our prototype simulator-generator has a tunable parameter that determines how many computed values can be memoized during a clock cycle. The trade-off here is that larger values of N allows more memoization and reduces redundant computation, but if the value of N exceeds the number of general regis-ters available on the C P U running the simulation, then these excess values would spill into memory and slows down the simulation. In practice, a typical value of N that is roughly equal to the number of general 31 registers seems to work well. Since the automatic generation of simulator only takes a fraction of a second, and since the simulator-generator can report the number of required computational steps, it is recommended that multiple values of N be attempted to find the value that results in the fastest simulation. 4.5 Future Optimization Possibilities As noted in the previous section, the prototype simulator-generator currently does not use any sophisticated techniques for accelerating processor simulation. Many future optimization approaches are possible. We believe the first future optimization should be to replace the linked-list traversal with a combination of compile-time data flow analysis (to locate the most recent value of a container at compile time rather than at run time) and, where that's not possible, represent the context by a faster data structure (such as a hash table or a balanced search tree). The specification contains both data flow and control flow information, which should enable some amount of compile time analysis of forwarding and stalling. To simplify data flow analysis, we should restrict the language slightly. Right now, the language allows an instruction to declare a container as a Don't-Care (TRANSPARENT) and then take away the Don't-Care flag later. Furthermore, an instruction can announce a valid value for a container, and then take away the value by announcing TRANSPARENT, UNAVAILABLE, or even a different integer value. This flexibility means an instruction can potentially modify any container at any clock cycle. However,' this not only makes the simulation harder to optimize, but it also hampers future efforts at formal verification. We have found no need for this flex-ibility and believe the language should make all announcements and commitments 32 irrevocable. Looking at other related projects provides insight for additional optimiza-tion possibilities. For example, the Sleipnir [6] simulator uses a decode cache that holds decoded instruction information for later cycles. A similar approach could be adapted for our simulator, in that a retired instruction context could be retained for its next incarnation. Actual experiments are needed to determine whether this provides a positive impact on simulation speed or not. Another interesting optimization technique can be found in the LISA system. The idea is that quite frequently it is desirable to run a simulation of a small, fixed program for many clock cycles. This is true for much DSP code, and it is also sometimes true for microprocessor benchmarks. By taking into consideration a specific memory image, the simulator-generator can omit unused instructions, hardware resources, and bypass paths. Furthermore, control and data flow analysis can be performed on both the processor specification and the memory image to derive more container values and instruction stalling information ahead of time. By making this trade-off, the authors of LISA report the speed is sometimes improved by more than one order of magnitude. However, given our higher-level semantics, it is not clear if we can match the performance of simulators generated from more explicit specifications. For example, our simulator performs run-time hazard detection at every container reference, even though some hazards are never possible. On the other hand, it may be possible to eliminate some of these unnecessary computations through more careful analysis of the model. Furthermore, our simulator generator might be able to exploit domain-specificity to generate carefully optimized code for specific aspects of the simulator. Additional optimizations is an area for further research. 33 Chapter 5 E x a m p l e : M I P S R2000 We have implemented the integer core of the MIPS R2000 [7] microprocessor, in-cluding all user-mode instructions except coprocessor support, floating point instruc-tions, and non-aligned loads/stores. The model we present here does not implement exceptions, although they could be added easily. MIPS R2000 is a classic, 5-stage RISC pipelined processor. Extensive doc-umentation is available, and the microarchitecture has been thoroughly studied, so it makes an excellent benchmark example. Although such a design is no longer leading edge, it is a very representative processor of its generation. Furthermore, such a design is typical of the application-specific cores used today, for which there are currently numerous design starts and for which early-stage microarchitectural exploration would be useful. We intend to extend our language to handle more aggressive microarchitectures (superscalar, VLIW, etc.) in the future. 34 5.1 Specification of MIPS R2000 Our entire processor specification is 290 lines long, of which 150 lines are devoted to the instruction decoder. The description starts by declaring containers for the main memory (Mem), 32 general-purpose registers (Reg), multiplication and division registers (Hi and Lo), program counter (PC), and a handful of instruction fields and temporary results. The R2000 has five pipeline stages, so the description also declares five pipeline stages. Each of these 5 stages will be explained in the following subsections. 5.1.1 Fetch Stage The first pipeline stage is the instruction fetch stage: constructor FetchStage { PF# == 0: { INS <- Mem#[ (PC +4) » L 2 ] ; PC <- PC'+4 ; } PF# == 1: { INS <- Mem#[ (PN# ) » L 2 ] ; PC <- PN# ; PF:=0; } true : { Reg[0] <- t r ; goto DecodeStage; } } The first two lines basically attempt to read a new instruction from memory (the Mem array) into the instruction register (INS). Intuitively, the instruction is fetched from either the next instruction in memory (PC + 4), or from the branch target address (PN) if the branch taken flag (PF) is set. However, in order to get the correct semantics, there are a few subtleties that need to be observed. The first issue involves the modeling of the system memory. In this specifica-tion, we have decided to ban non-aligned memory access. Given this restriction, it is 35 natural to implement the memory as a word-addressed rather than byte-addressed array; that is, Mem[0] refers to the first four bytes and Mem[l] the next four bytes, etc. However, the program counter (PC) and the branch target register (PN) both store byte addresses. To compensate for the difference, the offsets are divided by 4 before being used as Mem array indices. Another memory simplification is that our specification does not accurately model self-modifying code. Therefore, instructions can be fetched using the # oper-ator without stalling even when a memory write is in the pipeline. There is only one branch taken flag and one branch target register; they are global variables used by the branch instructions to override the default "program order semantics". That means all references to PF and PN must use the # operator, since the other 3 access operators are all bound by the program order semantics. In addition, since all references to PF and PN are done via the global context, all modi-fications should be done using commitment statements rather than announcement. On the other hand, the program counter can be considered a per-instruction value in that each instruction is fetched from a particular memory address. In the branch-not-taken case, the new instruction announces its PC value by adding 4 to the previous instruction's PC value. Here we see the need for a second access operator (the ' operator) which returns the most updated value of a container in the context of the previous instruction. The statement PC<-PC'+4 takes the previous instruction's PC value, adds 4 to it, then immediately announces it as the new PC value so that the instruction fetch at the next cycle does not stall. The third line of the FetchStage stage introduces the usage of the keyword t r (which represents the special value TRANSPARENT). In MIPS R2000, register 0 is always 0 regardless of the instruction that reads it. This fact is known before 36 the instruction is even decoded; therefore, we can announce it in the fetch stage to ensure that all references to register 0 go directly to the global context without stalling. Finally, the goto statement is used to indicate the new instruction should attempt to go into the decode stage in the next clock cycle. Recall our language provides automatic stalling when needed. So here we do not need to check for stalling; if the current instruction in the decode stage is stalled, this newly fetched instruction won't be able to advance and will stall automatically. 5.1.2 Decode Stage The most difficult and tedious part of the processor specification was the instruction decoder. Decoding instructions involves nothing more than extracting bit fields and pattern matching, but the decode stage encompasses over half of the entire description. Our research is not directed at instruction decoding, so this part of the model is simply a necessary evil. Fortunately, other researchers have developed high-level techniques for specifying instruction decoders [12], so a real, high-level processor specification language should combine that work with ours. The decode stage description consists of numerous cases such as this one: / / ADD[U],SUB[U],AND,OR,XOR,NOR,SLT[U] (INS & 0xFC000030) == OblOOOOO: { Mem[..] <- t r ; Reg[. .INS[15..11]-1] <- Reg[INS[15..11]+1..] <- Hi <- Lo <- t r ; T2 <- 0; VI <- Reg$[INS[25..21]]; Op <- INS[ 5 . . 0]; 37 V2 <- Reg$[INS [20..16]]; T <- INS [15..11]; } The guard checks the opcode of the instruction register INS to see if it is one of the specified instructions. In this case, all these three-register-instructions (ADD, A D D U , etc.) share a common instruction format and can be detected by a simple bitwise test. Furthermore, they share the same basic behavior: 1. These instructions do not modify the memory nor the multiplication and division registers (Hi and Lo). Therefore, we can immediately announce the special TRANSPARENT value for Hi , Lo, and the entire Mem array. Furthermore, we clear the T2 flag to indicate the Hi and Lo registers will not be modified. 2. These instructions only modify one of the 32 general purpose registers. In particular, the target register is always the one indicated by bits 15 to 11 in the instruction. Therefore, we can store the target register's identity into an internal variable (T) and announce the special TRANSPARENT value for the other 31 registers. Here the instruction register (INS) is referenced using the third access operator, the default operator, to query the current instruction context. Since INS's value is announced in the fetch stage already, reading it here does not cause stalling. 3. Since these instructions all get their operands from the same register locations, we can fetch the operands in this common command block. Here, MIPS R2000 architecture requires that if an instruction in the A L U stage (the stage after the decode stage) produces a new value for an operand, that new value must be forwarded into the decode stage in the same cycle. This is accomplished in our specification by using the fourth access operator (the $ operator) which waits until all other stages are completed and then forwards the latest value from the previous instruction's context. 38 4. The difference among these instructions is the A L U operation they rep-resent. The choice of a specific A L U operation depends on bits 5 through 0 of the instruction register. Therefore, these bits need to be extracted then stored in an internal variable (Op). The decode stage contains similar guarded command blocks for other instruc-tions. It also contains the following goto statement true: { goto ALUstage; } directing all instructions to proceed to the A L U stage. 5 .1 .3 A L U Stage The A L U stage is the second longest stage, simply because there are many possible A L U operations. Here is the entire A L U stage description: stage ALUStage 0p== "ALU-SLL" : { RegLT ? T : 32] <- (VI « V2) ; } 0p== "ALU-SRA" : { Reg[T ? T 32] <- (VI » A V2) ; > 0p== "ALU-SRL" : { RegLT ? T 32] <- (VI » L V2) ; } 0P== "ALU-MFHI" : -C Reg[T ? T 32] <- (Hi ) ; } 0p== "ALU-MFLO" : { Reg[T ? T 32] <- (Lo ) ; > 0p== "ALU-MTHI" : { HiTemp <- (VI ) ; } 0p== "ALU-MTLO" : { LoTemp <- (VI ) ; > 0p== "ALU-ADD" : { Reg[T ? T 32] <- (VI + V2) } 0p== "ALU-ADDU" : { Reg[T ? T 32] <- (VI + V2) > 0p== "ALU-SUB" : { Reg[T ? T 32] <- (VI - V2) } 39 Op=="ALU-SUBU" : -C RegLT ? T 32] <- (VI - V2) ; > Op=="ALU-AND" : { Reg[T ? T 32] <- (VI & V2) ; } Op=="ALU-OR" : { Reg[T ? T 32] <- (VI | V2) ; > Op=="ALU-XOR" : { Reg[T 7 T 32] <- (vi - V2) ; } Op=="ALU-NOR" Reg[T ? T 32] <- "(VI | V2) ; > Op=="ALU-SLT" : { Reg[T ? T 32] <- (VI <s V2) ; } Op=="ALU-SLTU" : { Reg[T ? T 32] <- (VI <u V2) ; > Op=="ALU-LUI" : { Reg[T ? T 32] <- (V2 « 16) } Op=="LINK" : { Reg[T ? T 32] <- (PC + 8) > Op== "ALU-MULT" : { HiTemp<- (VI Op== "ALU-MULTU" : { HiTemp<- (VI Op== "ALU-DIV" : { HiTemp<- (VI Op== "ALU-DIVU" . -c HiTemp<- (VI (Op & 512)==512 Offset < - ( *SH V2); LoTemp<-(Vl *SL V2); > *UH V2); LoTemp<-(Vl *UL V2); } •/.S V2); LoTemp<-(Vl /S V2); } 7,U V2); LoTemp<-(Vl /U V2); } 1 + V2); } // MEMORY true: { goto MemoryStage; } > The A L U stage description simply computes the desired operation and specifies that instructions proceed to the memory stage next. A few points are worth mentioning: 1. Almost all A L U instructions use VI and V2 as operands. A l l forwarding and stalling is implicit in the language and need not be spelled out. 2. MIPS R2000 ignores all writes to register 0. In our specification, we have declared a Don't Care register (Reg [32]) that is not read by anyone. A l l writes to 40 register 0 are then simply redirected to register 32. 3. To allow concise specification of the A L U stage, PSL provides a hugedist of operators that perform most of the common A L U operations. For example, there are separate operators for signed and unsigned division (/S and / U , respectively). If the specification of a particular microprocessor requires an arithmetic operation that can not be done by combining existing PSL operators, a new operator can be easily added to PSL. 4. There are fewer cases in the A L U stage than in the decode stage be-cause many similar instructions (such as the register and immediate forms) have been folded into a common form by the decode stage. Since exceptions are not implemented, the with- and without-exception forms are folded as well. The branch instructions are taken care of entirely in the fetch and decode stages, and thus are not handled in the A L U stage. 5.1.4 Memory Stage The memory stage accesses memory if necessary: stage MemoryStage { // The next stage to go to is the WriteBack stage, true: {goto WriteBackStage; } // Command block for each different read instructions. 0p= ="MEM-LW" :{ Reg[T?T:32]<- Mem[Offset » L 2] . ; } 0p= ="MEM-LH" && (0ffsetfe3)==0:{ Reg[T?T:32]<- Mem[Qffset » L 2] [[31. .16]]; } 0p= ="MEM-LH" ££ (0ffset&3)==2:{ Reg[T?T:32]<- Mem[0f f set » L 2] [[15. • 0]]; } Qp= ="MEM-LHU" &k (0ffset&3)==0:{ Reg[T?T:32]<- Mem[0ffset » L 2] [31. 16] ; } 0p= ="MEM-LHU" ftft (0ffset&3)==2:{ Reg[T?T:32]<- Mem[0ffset » L 2] [15. • 0] ; } 0p= ="MEM-LB" ftft (0ffsetfc3)==0:{ Reg[T?T:32]<- Mem[Offset » L 2] [[31. .24]]; } 0p= ="MEM-LB" &k (0ffsetfc3)==l:{ Reg[T?T:32]<- Mem[Offset » L 2] [[23. • 16]]; } 41 Op=="MEM-LB" fe& (0ffsetft3)==2:{ Reg[T?T:32]<- Mem[Of f set ' » L 2] [[15.. 8]]; } Op=="MEM-LB" fefe (0ffset&3)==3:{ Reg[T?T:32]<- Hem[Offset » L 2][[ 7.. 0]]; } 0p=="MEM-LBU" ftft (0ffsetfe3)==0:{ Reg[T?T:32]<- Mem[0ffset » L 2] [31..24] ; } 0p=="MEM-LBU" fefe (0ffset&3)==l:{ Reg[T?T:32]<- Mem[0ffset » L 2] [23..16] ; } . Qp=="MEM-LBU" fefe (0ffsetfe3)==2:{ Reg[T?T:32]<- Mem[0ffset » L 2] [15.. 8] ; } 0p=="MEM-LBU" fefe (0ffset&3)==3:{ Reg[T?T:32]<- Mem[0ffset » L 2] [ 7.. 0] ; } // Common commands shared by a l l memory writes. // Memory writes can be recognized by (Op & 520)==520. (Op fe 520)==520:{ «em[. . ((Off set » L 2)-l)]<-Mem[((0ffset » L 2)+l) . .]<-tr;} // Command block for each different write instructions. Op=="MEM-SW": { Mem[Offset » L 2] <- V3; } Op=="MEM-SB" fefe (0ffsetfe3)==0: { Mem[Offset » L 2] <- (Mem'[Offset » L 2] fe OxOOFFFFFF) I ((V3 & 0xFF)«24); } Op=="MEM-SB" && (Offset&3)==l: { Mem[Offset » L 2] <- (Mem'[Offset » L 2] & OxFFOOFFFF) I ((V3 & 0xFF)«16); } Op=="MEM-SB" && (0ffsetft3)==2: •( Mem[Offset » L 2] <- (Mem'[Offset » L 2] & OxFFFFOOFF) I ((V3 & 0xFF)«8); } Op=="MEM-SB" && (Offset&3)==3: { Mem[Offset » L 2] <- (Mem'[Offset » L 2] & OxFFFFFFOO) I (V3 fc OxFF); } Op=="MEM-SH" fefe (0ffsetfe3)==0: { Mem[Offset » L 2] <- (Mem'[Offset » L 2] & OxOOOOFFFF) I (V3 « 16); } Op=="MEM-SH" fefe (Offset&3)==2: ^ Mem[Offset » L 2] <- (Mem'[Offset » L 2] fe OxFFFFOOOO) I (V3 fe OxFFFF); } } Even though we do not model non-aligned accesses, we do model MIPS R2000's ability to load/store half words and bytes. We model the load instructions by read-42 ing the entire word and extracting the desired bits, sign-extending them if necessary using the [[ ]] operator described in Section A.5.8. The store instructions are mod-eled by reading the entire word, modifying parts of it, then writing the entire word back. In addition, the store instructions mustVemember to declare TRANSPARENT for the untouched words in order to prevent unnecessary stalling. 5.1.5 WriteBack Stage The last stage is the WriteBack stage, actually committed. s tage Wr i t eBackS t age •c t r u e : { r e t i r e ; } It is where all register modifications are // I f T i s no t z e r o , t h a t means a new v a l u e f o r // r e g i s t e r T has been announced. // T h e r e f o r e , i t needs t o be commit ted. T: { RegLT] := Reg[T] ; > // I f T2 i s no t z e r o , t h a t means new H i and Lo v a l u e s // a re a v a i l a b l e i n HiTemp and LoTemp. // They must be announced and committed t o H i and Lo . T2: { Hi:=HiTemp; Hi<-HiTemp; Lo:=LoTemp; Lo<-LoTemp; }, // I f i t ' s a memory w r i t e i n s t r u c t i o n , t he announced // memory w r i t e must be commit ted . 43 / / In th i s s p e c i f i c a t i o n , commitments to memory are delayed u n t i l / / the f i n a l stage, so that f lush ing an i n s t r u c t i o n automatically / / undoes a l l i t s effects on the g lobal containers. (Op & 520)==520: { Mem[Offset » L 2] := Mem[Offset » L 2]; } Op=="SYSCALL" : { Reg[2] := s y s c a l K V l , V 2 , V 3 ) ; > } Note the use of commitment using the : = operator so that committed values affect future instructions, even after the current instruction is retired. In addition, since we have not implemented exceptions in our model, we need a place to implement the s y s c a l l interface. We have chosen to place it in the WriteBack stage arbitrarily. A custom handler can be used along with the simulator to allow statistics gathering, system I /O emulation, etc. 5.1.6 Exceptions The model we present here does not implement exceptions, although they could be added easily. One approach is to use a new container (FlushFlag) to indicate whether the instructions in the pipeline should flush or not. Every stage would check to see if FlushFlag# is set or not, and if so, executes a r e t i r e command immediately. (The guard that checks the FlushFlag must use the # access operator because it is the only operator that can bypass the "program order semantics"). With the above addition to the model, an exception can be raised by setting the flag. At the next cycle, all instructions in the pipeline will flush; since none of the instructions make it to the WriteBack stage, their annoucements will be discarded, satisfying the requirements for precise exceptions. 44 5.2 Results The above is the entire description of this processor. Except for the decode stage, we can see how our description style is vastly simpler than previous processor de-scriptions. The remaining question is whether we can generate efficient simulators from these descriptions. We selected the Dhrystone synthetic benchmark, Version 2.1 [14] to provide a non-trivial, integer workload to test the simulator. Dhrystone is too small for general-purpose processors, but is still widely used to benchmark embedded proces-sors. We chose the Dhrystone benchmark because we ran into compiler problems with larger benchmarks: namely, we were unable to get gcc to generate only MIPS R2000 code. With a small enough benchmark like Dhyrstone, we were able to man-ually substitute the unsupported instructions with equivalent R2000 instructions. Because of our limited system call interface, we omitted the initial log-file ac-cess of the Dhrystone benchmark, but included the rest of the benchmark. We used the gcc C compiler version 2.95.2, with options "-G 104857600 -mgpopt -mfp32 -mgp32 -s - s ta t i c" to compile the benchmark into MIPS object code The com-piler generated some instructions not supported on the R2000 — being a more modern compiler — so we had to substitute equivalent R2000 instructions manu-ally. (The steps we took to compile the Dhrystone binary on an SGI machine and then simulate it with our MIPS R2000 simulator are shown in Figure 5.1). Automatic generation of the simulator took less than a second. The resulting simulator was compiled using gcc version egcs-2.91.66 with the -03 optimization option. We ran our experiments on a 733Mhz Pentium III with 256KB cache and 128MB memory, running Linux version 2.2.16-22. We consistently obtained over 240,000 simulated cycles/second. Direct com-45 1) LINUX* make —> This builds the simulator-generater "sim" 2) LINUX* ./sim < mips.psl > mips.c LINUX* cat main.inc >> mips.c LINUX* gcc -03 -o mips mips.c —> This builds the MIPS R2000 simulator "mips" 3) SGI* gcc -G 104857600 -mgpopt -mfp32 -mgp32 -s -static -S -o dhry.s dhry.c —> This builds dhrystone.s 4) SGI* gas -o dhry.o dhry.s —> This builds dhrystone.o —> You may have to edit dhry.s manually beforehand to... A) Replace non-R2000 instructions with R2000 equiv. (Such as BEQL, DADDU) B) Replace non-gas pseudoinstructions. (Such as ".popsection") # # For this version of the tweaked "dhry.c", there are 4 things to do: # A) Replace the only BEQL with BEQ # B) The next line has the only "BREAK". Replace i t with NOP. # C) There are many ".popsection" ".section bss" pairs. Remove them. # D) The only remaining ".popsection" w i l l be at the end. Remove i t . 5) SGI* Id -o dhrystone.image dhrystone.o fcfc strip dhrystone.image —> This generates the image f i l e . —> It warns "missing _start". It's okay, since we moved "mainO" to top. 6) SGI* elfdump -R dhrystone.image —> Determine the i n i t i a l "GP value" needed by this image. 7) SGI* elfdump -h dhrystone.image —> Determine how many segments the image f i l e has, and how long —> each segment i s , and where each segment should be loaded to. # For example, i t may say (paraphrased): # .code ShouldBeAt 0x400018 , IsAtFileOffset 0x500 , Length 0x30 # .rodata ShouldBeAt 0x410000 , IsAtFileOffset 0x530 , Length 0x50 # .sdata ShouldBeAt 0x410100 , IsAtFileOffset 0x580 , Length 0x10 8) LINUX* ./mips 2000 100 \ #Run 2000 cycles without DEBUG, then 100 with DEBUG s 3 29 0x500000 \ #Make R29 big enough to give a large.enough stack 3 28 0x414950 \ #Set R28 to the "GP value" shown in "elfdump -R" f dhrystone.image 0x400018 0x500 0x30 \ #Load EACH segment in "elfdump -h" f dhrystone.image 0x410000 0x530 0x50 \ f dhrystone.image 0x410100 0x580 0x10 Figure 5.1: Steps needed to simulate Dhrystone 46 parison with other published results is difficult because of differences in modeling and host machine speed. As a very rough comparison, Onder and Gupta [10] re-ported 200K cycles/second for their U P F A S T system simulating a cycle-accurate, 5-stage, complete MIPS on a 200Mhz Pentium Pro. The description was 5782 lines long. Jeremiassen [6] reported 5.1M instructions/second for his Sleipnir system, simulating an instruction-level, integer-only MIPS core on a 250Mhz MIPS R10000. The description was 700 lines long. Apparently, our performance is within an order of magnitude of more carefully tuned simulators, and our descriptions are much smaller and simpler. 47 C h a p t e r 6 Conclusions and Future Work We have introduced a novel approach for specifying processor microarchitectures. The approach centers on two key ideas: enforcing sequential execution semantics automatically and localizing access to results in the code that generates the values. We have demonstrated the potential of our approach by implementing a simple description language with an automatic simulator generator, and then creating a MIPS R2000 model in our language. Our description is vastly simpler and shorter than competing methods, and our simulation performance is within an order of magnitude of high-performance, purpose-built simulators. We believe our approach will be very useful for early-stage microarchitecture design exploration. The work presented in this thesis makes the following contributions to mi-croprocessor specification and simulation research: • A novel style of specifying cycle-accurate processor models, along with an explanation of our rationale for this approach to microprocessor specification. • A prototype language, PSL, that embodies this specification style. 48 • A simulator generator that automatically generates a cycle-accurate simulator from a microprocessor specification written in PSL. • The specification of a non-trivial microprocessor (MIPS R2000) using PSL. The most obvious future work is to improve the performance of our generated simulators. There is considerable literature on techniques to improve simulation performance, and we should be able to harness some of it. Ideas and future directions on optimization are explored in Section 4.5. In addition, this thesis is not directed at instruction decoding. As a result, our tool does not facilitate concise specification of the instruction decoding. As noted in Chapter 5, the decode stage occupies more than half of our MIPS R2000 specification. We believe harnessing existing research on specifying instruction de-coders [12] should greatly improve the ease-of-use of our system. More generally, we believe our ideas extend easily to superscalar processors and are working on formalizing the concepts and implementing a tool chain. The main difference is more dynamic constructs for directing instructions to functional units. It may eventually be possible to synthesize efficient implementations from our descriptions. This goal seems rather distant, but research on automatic pipelin-ing is promising [4, 8, 5]. Because the logic required to enforce sequential execution semantics is currently extremely challenging to design correctly, being able to syn-thesize from a small, obviously correct description is a very attractive long-term target. 49 B i b l i o g r a p h y [1] Mark Aagaard and Miriam Leeser. Reasoning about pipelines with structural hazards. In Theorem Provers in Circuit Design, pages 13-32. Springer, 1994. Published in L N C S Vol. 901 in 1995. [2] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley, 1986. [3] F. Baader and T . Nipkow. Term Rewriting and All That. Cambridge University Press, 1998. [4] Soha Hassoun and Carl Ebeling. Architectural retiming: Pipelining latency-constrained circuits. In SSrd Design Automation Conference, pages 708-713. A C M / I E E E , 1996. [5] James C. Hoe and Arvind. Synthesis of operation-centric hardware descrip-tions. In International Conference on Computer-Aided Design, pages 511-518. I E E E / A C M , 2000. [6] Tor E . Jeremiassen. Sleipnir — an instruction-level simulator generator. In International Conference on Computer Design, pages 23-31. I E E E , 2000. [7] Gerry Kane and Joe Heinrich. MIPS RISC Architecture. Prentice Hall, 1992. [8] Maria-Cristina Marinescu and Martin Rinard. High-level specification and ef-ficient implementation of pipelined circuits. In Asia and South Pacific Design Automation Conference, 2001. [9] John Matthews, John Launchbury, and Byron Cook. Microprocessor specifi-cation in Hawk. In International Conference on Computer Languages, pages 90-101. I E E E , May 1998. [10] Soner Onder and Rajiv Gupta. Automatic generation of microarchitecture simulators. In International Conference on Computer Languages, pages 80-89. I E E E , May 1998. 50 [11] Stefan Pees, Andreas Hoffmann, Vojin Zivojnovic, and Heinrich Meyr. LISA — machine description language for cycle-accurate models of programmable DSP architectures. In 36th Design Automation Conference, pages 933-938. A C M / I E E E , 1999. [12] Norman Ramsey and Mary Fernandez. The New Jersey machine-code toolkit. In USENIX Technical Conference, pages 289-302, January 1995. The current version is available at http: / /www.eecs .harvard.edu/~nr/toolkit / . [13] Bruce Shriver and Bennett Smith. The Anatomy of a High-Performance Mi-croprocessor: A Systems Perspective. I E E E Computer Society Press, 1998. [14] R. Weicker. Dhrystone 2.1. SIGPLAN Notices, 23(8), August 1988. We down-loaded from f tp : / / f tp .nosc .mi l /pub /aburto . 51 A p p e n d i x A P S L Language Reference A . l Introduction P S L is a prototype language developed to embody the specification style detailed in Chapter 3 of this thesis. This appendix explains the syntax and semantics of the language, and Appendix B contains the complete PSL grammar. A specification written in PSL consists of enumeration declarations, container declarations, and stage declarations. Further explanations are given in the following sections. A.2 Keywords and Identifiers A n identifier is a case-sensitive sequence of alphabets, digits, and underscores where the first character is not a digit. For example, lOunits is not a legal identifier, and buf f e r l and Buf f e r l are distinct legal identifiers. In PSL, identifiers are used as names for stages and containers. The stage names and container names exist in the same name space; that means the same 52 identifier can not be used to identify both a stage and a container. In addition, the following sequences of characters are reserved keywords that can not be used as stage or container identifiers: constructor container enum fa lse goto na r e t i r e stage s y s c a l l t r true The usage of these keywords will be explained in the following sections. A. 3 Constants A n integer constant is represented by one or more digits. If the first digit is 0, the entire constant is considered octal, otherwise the constant is considered decimal. Binary and hexadecimal integer constants can be specified by adding the prefix Ob and Ox, respectively. Often it is more convenient to refer to integer constants by names. To allow this, a specification in PSL can contain enumeration declarations using the enum keyword which assigns an integer constant to a string literal. While the same integer can be represented by many string literals, each string literal can only represent one integer constant. For example, the following segment declares the string literal "Dozen" to represent the number 12: enum "Dozen" := 12; In addition, the keywords fa l se and true are pre-defined constants that represent the integer constants 0 and 1, respectively. The special values UNAVAILABLE (see Section A.6) and TRANSPARENT (see Section A.5.8) can be specified by the keywords na and t r , respectively. 53 A.4 Containers There are two types of containers in PSL: scalar containers and array containers. A container that holds exactly one value is called a scalar container and is declared by using the container keyword followed by a unique identifier. A container that holds multiple values is called an array container. A n array container's declaration requires two additional parameters: the smallest and largest index values. For example, the following section of code declares a scalar container PC and an array container Table where the valid entries are 2 to 10: container PC; container Table [2. .10]; The integer values contained in a container must always have the exact same size as the native word size of the simulator's environment. Furthermore, the inte-ger value can be interpreted in two different ways: as an unsigned integer or as a 2's-complement signed integer. The choice of interpretation depends only on the operator that operates on the value. During the simulation of a processor, there is a global context where each container is associated with an integer value, as well as a per-instruction context for each instruction in which each container is associated with either an integer value or one of the two special values (UNAVAILABLE or TRANSPARENT). These two special values will be explained in later sections. 54 A. 5 Expressions This section describes the expression operators and their semantics. The precedence of all operators are listed in Table A . l from highest precedence to lowest precedence. Precedence Operators Highest + - I ~ _ — unary operators *SH *SL *UH *UL /S / u 7.s y.u + -« » A » L <S k i <U i = <=S <=U >S >U >=s >=u Lowest 1 kk 1 1 ? : Table A . l : Context diagrams A.5.1 Unary Operators There are 4 unary operators: the unary + operator, unary - operator, unary ! op-erator, and unary " operator. The unary + operator does nothing. It is provided for symmetry with the unary - operator. The unary - operator treats its operand as a signed integer, and the result is the negation of the integer. The only exception is when the operand is the minimum negative integer, the unary - operator does nothing. 55 The unary ~ operator treats its operand as a unsigned integer, and it com-plements every bit of the value. The unary ! operator returns 1 if its operand is equal to 0. Otherwise the result is 0. In this case, it does not matter if the operand is treated as a signed or unsigned integer. A.5.2 Addit ive and Multiplicative Operators The + (add) and - (subtract) operators compute the sum and difference of the operands, respectively. It does not matter if the operands are treated as signed or unsigned integers. The *SH and *SL operators treat the operands as signed integers and return * the upper word and lower word of the result, respectively. The *UH and *UL operators treat the operands as unsigned integers and return the upper word and lower word of the result, respectively. The /S and / U operators treat the operands as signed and unsigned integers, respectively. The result is the division of these two numbers. If the second operand is 0, the result is implementation dependent. The 7,S operator treats the first operand as a signed integer and the second operand as an unsigned integer. The operator computes the remainder of these two numbers. If the first operand is negative, or if the second operand is 0, the result is implementation dependent. The °/,U operator treats both operands as unsigned integers and computes the remainder of these two numbers. If the second operand is 0, the result is implemen-tation dependent. 56 A . 5 . 3 Shift Operators The « and » L operators treat the operands as unsigned integers and perform a bitwise shift to the left and to the right, respectively. If the second operand is larger than the number of bits, the result of the expression is zero. The » A operator treats the first operand as a signed integer and the second operand as an unsigned integer. The result is the arithmetic shift to the right. That means the incoming bit is 1 when the original first operand is negative, and 0 if the original first operand is positive. If the second operand is larger than the number of bits, the result of the expression is 0 if the first operand is positive, and —1 if the first operand is negative. A . 5 . 4 Comparison Operators The == (equal to), != (not equal to), < (less than), <= (less than or equal to), > (greater than), and >= (greater than or equal to) operators will return 1 if the specified comparison is true, and 0 if the specified comparison is false. For the == and != operators it does not matter whether the operands are treated as signed or unsigned integers. However, the other operators require a suffix (S and U) to specify whether the operands should be considered signed or unsigned integers, respectively. A . 5 . 5 Bitwise Operators The &, | , and ~ operators treat the operands as unsigned integers and compute the bitwise A N D , bitwise OR, and bitwise X O R , respectively. 57 A.5.6 Logical Operators The && operator returns 0 if at least one of the operands is equal to 0. Otherwise it returns 1. The I I operator returns 1 if at least one of the operands is not equal to 0. Otherwise it returns 0. The operator returns 0 in two situations: when both operands are equal to 0, and when neither operand is equal to 0. Otherwise the ~~ operator returns 1. A.5 .7 If-Then-Else Operator The ?: operator takes three arguments. To evaluate the expression a?b:c, the operand a is evaluated first. If a is not equal to 0, then the third operand c is ignored, and the entire expression equals the value of b. On the other hand, if a is equal to 0, then the second operand b is ignored, and the entire expression equals the value of c. A.5.8 Container References A scalar container is referenced by its identifier. A n array container is always refer-enced by its identifier and an integer index into the array. It is impossible to refer to multiple entries of an array as a single expression. By default, the container reference uses the default access mode. A n optional suffix can be attached after the identifier to specify one of three alternative access operators. Given a container c, the default access c denotes the most recent non-TRANSPARENT value of the container, starting from the present context; c ' denotes the most recent non-TRANSPARENT value of the container, starting from the previous context; c$ denotes the most recent non-TRANSPARENT value of the container from the previous context after all functional units have completed the current clock cycle; 58 and c# denotes the value of the container in the global context. Furthermore, a range of bits can be extracted by using the [. . ] operator. Given a container reference e, the expression e [a. .b] treats e as an unsigned integer and returns the range of bits from bit a to bit b inclusively with bit b as the least significant bit. If the default word size is bigger than a-b+1 bits, the unspecified bits will be 0. The [ [ . . ] ] operator performs the same operation as [..], except the [ [ . .]] operator sign-extends the expression. Given a container reference e, the expression e [ [a . .b]] treats e as an unsigned integer and returns the range of bits from bit a to bit b inclusively with bit b as the least significant bit. If the default word size is bigger than a-b+1 bits, the unspecified bits will be equal to bit a. A.5.9 Syscalls The generated simulator may interact with an outside environment via the s y s c a l l interface. Whenever the expression s y s c a l l (a ,b,c) is evaluated, the three argu-ments will first be evaluated into integers, then the syscall handler will be invoked with the three integers as parameters. A custom syscall handler can be supplied to the generated simulator to provide additional capabilities such as statistics gather-ing, system I /O emulation, etc. A. 6 Statements There are four different types of statements in PSL: announcements, commitments, goto statements, and retire statements. The announcement statement is an an-nouncement that modifies the current instruction's context. It does not modify the context of other instructions nor the global context. For example, the following 59 guarded statement will announce PC to be 0 if the expression ExceptionRaised evaluates to a non-zero value: ExceptionRaised: { PC <- 0; } The commitment statement is used to modify the global context shared by all in-structions. It does not modify the context of any instruction. For example, the following guarded statement changes the value of Lo in the global context to 10: true: { Lo := 10; > The goto statement is used to specify the destination stage that the current in-struction should go into if possible. For example, the following guarded statement indicates that the instruction should move into the A L U stage before the next cycle if the expression OperandFetched evaluates to a value greater than 2: OperandFetched > 2 : { goto ALU; } The retire statement is used to indicate the instruction no longer exists, and that its instruction context should be discarded. For example, the following guarded statement indicates that the instruction should be discarded before the next cycle: true: { r e t i r e ; } During each clock cycle, each instruction evaluates all of its guard expressions. Each instruction then attempts to execute all enabled statements. If an instruction encounters an UNAVAILABLE value during its evaluation of guards or the execution of statements, then this instruction is deemed to have failed (meaning it will discard all of the changes it made during this cycle, stall, and attempt to re-execute in the next cycle). However, if an instruction completes all evaluation and execution, but its destination stage is occupied, then the instruction will stall with all changes intact. 60 A.7 Stages A stage can be declared in PSL using the keyword stage followed by the identifier of the stage. Inside a stage's declaration is a list of guarded statements. For example, the following is a valid stage declaration: stage Compute { F l a g ' < 40 : { Flag <- F lag ' +1; } !Exception : { Ready <- true; goto Dispatch; } Exception : { r e t i r e ; } } In addition, one and exactly one stage must be declared as a constructor by using the constructor keyword instead of the stage keyword. For example, the following is a valid constructor declaration: constructor Fetch { true: { OpCode <- Mem#[PC']; PC <- P C +4; goto Decode; } > Intuitively, a stage represents a typical pipeline stage in a processor. During each clock cycle, exactly zero or one instruction may be inside any given stage. If an instruction is present, then all enabled statements in that stage will be executed with respect to that instruction's context. If during the execution, some of the expressions can not be evaluated, then the entire instruction stalls and none of the statements will be executed. Otherwise, at the end of the cycle the instruction will either move into another stage (if the desired destination stage will be vacant in 61 the next cycle clock) or the instruction will stay in the current stage (if the desired destination stage will be occupied in the next cycle). Furthermore, whenever the constructor stage becomes vacant, a new instruction context (where all containers are labeled as UNAVAILABLE) will be created and moved into the constructor stage before the beginning of the next clock cycle. 62 Appendix B PSL Grammar Below is the grammar for the prototype language PSL. Nonterminal symbols are represented in italic style, whereas the terminals are represented in teletype style inside boxes. The three undefined terminal symbols (Identifier, Integer Constant, and StringLiteral) are explained in Appendix A. file —> declaration file I declaration declaration —> enum-declaration I container-declaration I stage-declaration enum-declaration -Pl-enum-entry enum-declarations enum-declarations —> CO I enum-entry enum-declarations 63 enum-entry —• String Literal I StringLiteral | := | Integer Constant container-declaration —> c o n t a i n e r container-entry container-declarations container-declarations —• • I | , | container-entry container-declarations container-entry —y Identifier I Identifier | [ | Integer Constant | .. | Integer Constant stage-declaration —> Identifier ["fj [T| c o n s t r u c t o r c o n s t r u c t o r stage stage I e tifier Q J stage-body Identifier [ J ] Q ] Identifier stage-body | } | stage-body —> 6/oc/c I stage-body block —> expression-right | : 11 { | stoterreents | } | I expression-right [Tj statements —• statement I statement statements 64 statement —> expression-left | <- | expression-right [T] I expression-left | := | expression-right [j^ j I goto identifier expression-left —> Identifier I Identifier |jJJ expression-right I 7denti/ter[T|[T7][T| I Identifier | [ 11 . . | expression-right | ] | I Identifier | [ | expression-right | . . 11 ] | I Identifier expression-right \ • • \ expression-right | ] [ expression-right —> expression-if-then-else I expression-logical-or expression-if-then-else —t expression-logical-or | ? | expression-right | : | expression-right expression-logical-or —>• expression-logical-xor I expression-logical-or | I I | expression-logical-xor expression-logical-xor —• expression-logical-and I expression-logical-xor | | expression-logical-and expression-logical-and —> eipresston-6itOTse-or I expression-logical-and fc&. expression-bitwise-or 65 expression-bitwise-or —> expression-bitwise-xor I expression-bitwise-or | I j expression-bitwise-xor expression-bitwise-xor —> expression-bitwise-and I expression-bitwise-xor [~~"| expression-bitwise-and expression-bitwise-and —» expression-equality I expression-fcitwise-and | fe | expression-equality expression-equality —> expressjon-meouaisij/ I expression-inequality | == | expression-inequality I expression-inequality | != | expression-inequality expression-inequality -expression-shift I expression-shift I expression-shift <S <u expresstcm-s/w/t expression-shift I expression-shift | <=S | expression-shift I expression-shift | <=U | expression-shift I expression-shift | >S | expression-shift I expression-shift | >U I expression-shift I expression-shift | >=S | expression-shift I expression-shift | >=U | expression-shift expression-shift —> expression-sum I expression-si expression-sum I expression-shift | >>A | expression-sum I expression-shift I >>L I expression-sum 66 expression-sum —> expression-product I expression-sum | + | expression-product I expression-sum | - [ expression-product expression-product —> expression-unary I expression-product *SH expression-unary I expression-product *SL expression-unary I expression-product | *UH | expression-unary I expression-product | *UL | expression-unary I expression-product | /S | expression-unary I expression-product | /U | expression-unary I expression-product | '/.S | expression-unary I expression-product '/.U expression-unary expression-unary —> expression-post I | + | expression-unary I | - [ expression-unary I | ! | expression-unary I expression-unary expression-post —> syscall | [~(~| expression-right expression-right QJ expression-right |Tj I constant I [TJ expression-right | ) | I expression-container-ref I expression-container-re/[~[] expression-rig/it | • • | expression-right [~]~| I expression-container-re/| [ 11 [ | expression-right \ • • \ expression-right | ] 11 ] | expression-container-ref expression-container I expression-container | [ | expression-right | ] | 67 expression-container —i Identifier I Identifier £#J I Identifier [T] I Identifier Q^j constant —> H [~true | |false] IntegerConstant StringLiteral Appendix C MIPS R2000 Specification Below is the complete specification of the integer core of the MIPS R2000 [7] micro-processor, including all user-mode instructions except coprocessor support, floating point instructions, and non-aligned loads/stores. The model we present here does not implement exceptions, although they could be added easily by following the suggestions in Section 5.1.6. //=' enum "ALU-SLL" 0, "ALU-SRL" 2, "ALU--SRA" 3; enum "ALU-MFHI" = 16, "ALU-MTHI" = 17, "ALU--MFL0" = 18, "ALU-MTLO" = 19; enum "ALU-MULT" =• 24, "ALU-MULTU" = 25, "ALU--DIV" = 26, "ALU-DIVU" = 27; enum "ALU-ADD" = 32, "ALU-ADDU" = 33, "ALU--SUB" = 34, "ALU-SUBU" = 35; enum "ALU-AND" = 36, "ALU-OR" = 37, "ALU--X0R" = 38, "ALU-NOR" = 39; enum "ALU-SLT" = 42, "ALU-SLTU" = 43, "ALU--LUI" = 501, "NOALU" =502; enum "LINK" = 503, "SYSCALL" = 504; enum "MEM-LH" = 545, "MEM-LB" = 544, "MEM--LHU" = 549, "MEM-LBU" =548; enum "MEM-SH" = 553, "MEM-SB" = 552, "MEM--SW" = 555, "MEM-LW" =547; c o n t a i n e r Mem[0..10485760] ; c o n t a i n e r R e g [ 0 . . 3 2 ] ; / / R[32] i s Don' t C a r e , c o n t a i n e r P N . P F ; c o n t a i n e r P C , I N S , T . T 2 , V 1 . V 2 . V 3 , O f f s e t , Op, H i . L o , HiTemp.LoTemp; / /================================ = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = stage WriteBackStage 69 } t r u e : { r e t i r e ; } T: { Reg[T] := RegCT]; } T2: { Hi:=HiTemp; Hi<-HiTemp; Lo:=LoTemp; Lo<-LoTemp; } •p=="SYSCALL" : { Reg[2] := s y s c a l K V l , V 2 , V 3 ) ; } (Op fe 520)==520: { Mem[Offset » L 2] := Mem[Offset » L 2 ] ; > stage MemoryStage •C / / Memory Reads 0p== "MEM--LW" :{ Reg[T?T:32]<- Mem[Offset » L 2] } 0p== "MEM--LH" kk (0ffsetfc3)= =0:{ Reg[T?T:32]<- Mem[Offset » L 2] [C31. .16]] } 0p== "MEM--LH" kk (0ffsetfc3)= =2:{ Reg[T?T:32]<- Mem[Offset » L 2 ] [ [15 . . 0]] } 0p== "MEM--LHU" kk (0ffsetfc3)= =0:{ Reg[T?T:32]<- MemCOffset » L 2] [31. .16] > 0p== "MEM--LHU" kk (0f fset&3)= =2:{ Reg[T?T:32]<- Mem[Offset » L 2] [15. . 0] } 0p== "MEM--LB" kk (0f fset&3)= =0:{ Reg[T?T:32]<- Mem[Offset » L 2] [ [31. .24]] } 0p== "MEM--LB" kk (0ffsetfc3)= =1:{ Reg[T?T:32]<- MemCOffset » L 2] [ [23. .16]] } 0p== "MEM -LB" kk (0f fset&3)= =2:{ Reg[T?T:32]<- Mem[Offset » L 2 ] [ [ 1 5 . • 8]] } 0p== "MEM -LB" kk (0ffsetfc3)= =3:{ Reg[T?T:32]<- MemCOffset » L 2 ] [ [ 7. . 0]] } 0p== "MEM -LBU" kk (0f fset&3)= =0:{ Reg[T?T:32]<- MemCOffset » L 2] [31. .24] } 0p== "MEM -LBU" kk (0f fset&3)= =1:< ,Reg[T?T:32]<- MemCOffset » L 2] [23. .16] } 0p== "MEM -LBU" kk (0f fset&3)= =2:{ Reg[T?T:32]<- MemCOffset » L 2] [15. • 8] } 0p== "MEM--LBU" kk (0f fset&3)= =3:-( Reg[T?T:32]<- MemCOffset » L 2] [ 7. . 0] } / / Memory Wri tes (Op k 520)==520:{ Mem[. . ( (Of f set » L 2 ) - l ) ]< -Mem[( (0 f f set » L 2 ) + l ) . . ]< - t r ; } Op=="MEM-SW": { Mem[Offset » L 2] <- V3; } Op=="MEM-SB" kk (Offset&3)==0: { Mem[0ffset » L 2] <- (Mem'COff set » L 2] k OxOOFFFFFF) I ((V3 fc 0 x F F ) « 2 4 ) ; } Op=="MEM-SB" kk (Offset&3)==l: { MemCOffset » L 2] <- (Mem* COff set » L 2] k OxFFOOFFFF) I ((V3 & 0 x F F ) « 1 6 ) ; } Op=="MEM-SB" kk (Offset&3)==2: { MemCOffset » L 2] <- (Mem'COff set » L 2] & OxFFFFOOFF) I ((V3 ft 0 x F F ) « 8 ) ; } Op=="MEM-SB" kk (Offset&3)==3: { MemCOffset » L 2] <- (Mem' [Offset » L 2] k OxFFFFFFOO) I (V3 k OxFF) ; } Op=="MEM-SH" kk (Offsetfc3)==0: •C 70 MemCOffset » L 2] <- (Mem'[Off set » L 2] & OxOOOOFFFF) I (V3 « 16); Op=="MEM-SH" fefe (0ffset&3)==2: { MemCOffset » L 2] <- (Mem' [Offse t » L 2] fe OxFFFFOOOO) I (V3 & OxFFFF); } } //.= t r u e : { goto Wri teBackStage; } stage ALUStage s \ Op=="ALU-SLL" •c Reg[T ? T 32] <- (VI « V2) } Op=="ALU-SRA" Reg[T ? T 32] <- (VI » A V2) } Op=="ALU-SRL" •c Reg[T ? T 32] <- (VI » L V2) } Op=="ALU-MFHI" •c Reg[T ? T 32] <- (Hi ) } 0p=="ALU-MFL0" { Reg[T ? T 32] <- (Lo ) } Op=="ALU-MTHI" { HiTemp <- (VI ) } 0p=="ALU-MTL0" { LoTemp <- (VI ) } Op=="ALU-ADD" { Reg[T ? T 32] <- (VI + V2) } Op=="ALU-ADDU" •c Reg[T ? T 32] <- (VI + V2) } Op=="ALU-SUB" { Reg[T ? T 32] <- (VI - V2) } Op=="ALU-SUBU" { Reg[T ? T 32] <- (VI - V2) } Op=="ALU-AND" i Reg[T ? T 32] <- (VI fe V2) } Op=="ALU-OR" •c Reg[T, ? T 32] <- (VI 1 V2) } 0p=="ALU-X0R" Reg[T ? T 32] <- (VI V2) } Op=="ALU-NOR" -c Reg[T ? T 32] <- "(VI 1 V2) } Op=="ALU-SLT" •c Reg[T ? T 32] <- (VI <S V2) } Op=="ALU-SLTU" { Reg[T ? T 32] <- (VI <u V2) } Op=="ALU-LUI" { Reg[T ? T 32] <- (V2 « 16) } Op=="LINK" i Reg[T ? T 32] <- (PC + 8) } Op=="ALU-MULT" Op=="ALU-MULTU" Op=="ALU-DIV" Op=="ALU-DIVU" { HiTemp<-(Vl *SH V 2 ) ; LoTemp<-(Vl *SL V 2 ) ; } { HiTemp<-(Vl *UH V 2 ) ; LoTemp<-(Vl *UL V 2 ) ; } { HiTemp<-(Vl '/.S V2) ; LoTemp<-(Vl /S V 2 ) ; } { HiTemp<-(Vl '/.U V2) ; LoTemp<-(Vl / U V 2 ) ; } (Op & 512)==512 : { O f f s e t <- (VI + V2) ; } / / MEMORY t r u e : { goto MemoryStage; } stage DecodeStage •C t r u e : { goto ALUStage; } (INS fe 0xFC00003F) == 12: / / SYSCALL { M e m [ . . ] < - R e g [ . . l ] < - R e g [ 3 . . ] < - H i < - L o < - t r ; T<-0; T2<-0; 0p<-"SYSCALL"; VI <- Reg$ [2]; V2 <- Reg$[4] V3 <- Reg$[5] (INS fe 0xFC000030) == OblOOOOO: / / ADD[U],SUB[U],AND,OR,XOR,NOR,SLTCU] 71 •c Meml"..] <- R e g [ . . I N S [ 1 5 . . 1 1 ] - l ] <- Reg[INS [15 . . 11]+1.. ] <- H i < - Lo<- t r ; T2 <- 0; VI <- Reg$[INS[25. .21]] ; Op <- INS[ 5 . . 0 ] ; V2 <- Reg$[INS[20. .16]] ; T <- INS[15. .11] ; } (INS & 0xFCOO003C) == 0: / / SLL,SRL,SRA { Mem[. . ] <- R e g [ . . I N S [ 1 5 . . 1 1 ] - l ] <- Reg[INS[15. . 11]+1. . ]<- H i < - Lo<- t r ; V K - R e g $ [INS [20. . 16] ] ; T2<-0; T < - I N S [ 1 5 . . 1 1 ] ; 0p<-(INSft3); V2<-INS[10 . .6 ] ; } (INS & 0xFC00003C) == 4: / / SLLV,SRLV,SRAV { Mem[. .] <- R e g [ . . I N S [ 1 5 . . 1 1 ] - l ] <- Reg[INS [15. . 11]+1. .] <- H i < - Lo<- t r ; VI <- Reg$[INS[20. .16] ] ; V2 <- (Reg$[INS[25. .21]] & O x l F ) ; K - I N S [ 1 5 . . 11] ; T2<-0; 0p<-INSft3; } (INS & OxFC00003C) == ObOllOOO: / / MULT,MULTU,DIV,DIVU T<-0; T2<-1; Op<-INS[5 . . 0 ] ; V K - R e g $ [ I N S [ 2 5 . .21]] ; V2<-Reg$[INS[20. .16]] ; R e g [ . . ] <-Mem[. . ]<- tr; } (INS & 0xFC00003F) == ObOlOOOl: / / MTHI •c V K - R e g $ [ I N S [ 2 5 . .21]] ; Reg[ . .]<-Mem[. .] <-Lo<-tr; T<-0; T2<-1; Op<-17; } (INS & 0xFC00003F) == O b O l O O l l : / / MTLO •C V K - R e g $ [ I N S [ 2 5 . .21]] ; Reg[ . .]<-Mem[. .] < -Hi<- tr ; T<-0; T2<-1; 0p<-19; } (INS & 0xFC00003D) == ObOlOOOO: / / MFHI.MFLO •C R e g [ . . I N S [ 1 5 . . 1 1 ] - l ] <- Reg[INS[15. .11]+1 . . ]<- Mem[ . . ]<- H i < - Lo<- t r ; T < - I N S [ 1 5 . . 1 1 ] ; T2<-0; 0p<-INS[5 . .0 ] ; } (INS fe OxFCOOOOOO) == 0x3C000000: / / LUI { Mem[. .] <- Reg[INS[20. .16]+1. . ] <- Reg[ . . I N S [ 2 0 . . 16]-1]<- H i < - Lo<- t r ; T < - I N S [ 2 0 . . 1 6 ] ; T2<-0; Op<-"ALU-LUI"; V2<-INS[15 . . 0 ] ; } x -/ / ADDI , ADDIU , SLTI , SLTIU , ANDI , ORI , XORI (INS fe OxEOOOOOOO)==0x20000000 fefe (INS & OxFCOOOOOO)!=0x30000000: •C Mem[..] <- Reg[INS[20 . .16]+1 . . ] <- R e g [ . . I N S [ 2 0 . . 1 6 ] - 1 ] < - H i < - Lo<- t r ; VI <- Reg$[INS[25. .21]] ; T <- INS [20 . . 16]; T2<-0; } (INS & OxFCOOOOOO)==0x20000000: { Op<-"ALU-ADD" ; V2 <- INS [ [ 1 5 . . 0 ] ] ; } (INS fe OxFCOOOOOO)==0x24000000: { Op<-"ALU-ADDU" ; V2 <- INS [ [15 . . 0 ] ] ; } (INS fe OxFCOOOOOO)==0x28000000: { Op<-"ALU-SLT" ; V2 <- INS [ [15 . . 0 ] ] ; } 72 (INS & OxFCOOOOOO)==0x2C000000: { Op<-" (INS fc OxFCOOOOOO)==0x30000000: { Op<-" (INS fc OxFCOOOOOO)==0x34000000: { Op<-" (INS & OxFCOOOOOO)==0x38000000: { Op<-" ALU-SLTU" ; V2 <- INS [ [15 . . 0 ] ] ; } ALU-AND" ; V2 <- INS [15. .0] ; > ALU-OR" ; V2 <- INS [15. .0] ; } ALU-XOR" ; V2 <- INS [15. .0] ; > (INS & 0xFC00003F)==0bl000: / / JR { PN := Reg$[INS[25. .21]] ; PF := 1; M e m [ . . ] < - R e g [ . . ] < - H i < - L o < - t r ; Op<-"LINK"; T<-0; T2<-0; } (INS & 0xFC00003F)==0bl001: II JALR { PN := Reg$[INS[25. .21]] ; PF := 1; Op<-"LINK"; T2<-0; K - I N S [15. . 11] ; Mem[. . ]< -Reg[ . .INS [15. . 11] -1 ]<-Reg[INS[15 . . 11]+1. . ] <-Hi<-Lo<-tr ; } (INS & OxFCOOOOOO)==0x08000000: / / J i PN := (PCftOxFOOOOOOO) I ( INS[25 . . 0] « 2 ) ; PF := 1; M e m [ . . ] < - R e g [ . . ] < - H i < - L o < - t r ; Op<-"LINK"; T<-0; T2<-0; } (INS & OxFCOOOOOO)==0x0C000000: / / JAL •c PN := (PCftOxFOOOOOOO) I (INS[25. .0] « 2 ) ; PF := 1; Mem[ . . ]< -Reg[ . . 30 ]< -Hi<-Lo<- tr ; Op<-"LINK"; T<-31; T2<-0; } (INS fc OxFCO10000)==0x04000000: / / BLTZ,BGEZ { PN:=PC+(INS[[15. . 0 ] ] « 2 ) + 4 ; Mem[. . ]<-Reg[ . . ] < - H i < - L o < - t r ; Op<-"LINK"; PF:=(INS&0xlOOOO) ~~ (Reg$[INS[25. .21]] <S 0 ) ; T<-0; T2<-0; } (INS fc OxFCO10000)==0x04010000: / / BLTZAL.BGEZAL { PN:=PC+(INS[[15. . 0 ] ] « 2 ) + 4 ; Mem[. . ]<-Reg[ . . 30 ]<-Hi<-Lo<- tr ; Op<-"LINK" ; PF:=(INS&OxlOOOO) (Reg$[INS[25. .21]] < S 0 ) ; T<-31; T2<-0; } / / BEQ / BNE / BLEZ / BGTZ (INS fc OxFOOOOOOO)==0x10000000: / / B E q , B N E , B L E Z , B G T Z { PN:=PC+(INS[[15. . 0 ] ] « 2 ) + 4 ; Mem[. . ] <-Reg[ . . ] < -Hi<-Lo<-tr : Op<-"NOALU" : T<-0; T2<-0; } (INS fc OxFC00OOOO)==0xl0OOOOO0: { PF:=Reg$[INS[25. .21]]==Reg$[INS[20. .16]] (INS & OxFC00OOOO)==0xl4OOOOOO: i PF:=Reg$[INS[25. .21]] !=Reg$[INS[20. .16]] (INS & OxFC000OOO)==0xl80OOOOO: { PF:=Reg$[INS[25..21]]<=S 0' (INS fe 0xFC000000)==0xlC000000:'{ PF:=Reg$[INS[25. .21]] > S 0 (INS & OxEOOOOOOO)==0x80000000: / / LW, L B ( U ) , LH(U) 73 •c Mem[. . ] <- Reg[ . . INS[20. .16]-1] <- Reg[INS[20. . 16]+1. . ]<- H i < - Lo<- t r ; Op <- INS[31. .26] + 512; VI <- Reg$[INS[25. .21]] ; V2<-INS[[15. . 0]] ; K - I N S [ 2 0 . . 16] ; T2<-0; } (INS fc 0xEOOO0000)==OxA0000000: / / SW, SWB, SWH { Op <- INS[31. .26] + 512; VI <- Reg$[INS[25. .21]] ; V3 <- Reg$[ INS[20 . . 16 ] ] ; V 2 < - I N S [ [ 1 5 . . 0 ] ] ; R e g [ . . ] < - H i < - L o < - t r ; T<-T2<-0; } c o n s t r u c t o r FetchStage i PF# == 0 : { INS <- Mem#[ (PC+4) » L 2 ] ; PC<-PC'+4 ; } PF# == 1 : { INS <- Mem#[ (PN# ) » L 2 ] ; PC<-PN# ; PF:=0 ; } t r u e : { Reg[0]<- tr ; goto DecodeStage; } } II-74 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051490/manifest

Comment

Related Items