UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Ovni-net : a flexible cluster interconnect for the new OVNI Real-time simulator De Rybel, Tom 2005

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_2005-0192.pdf [ 15.29MB ]
JSON: 831-1.0065859.json
JSON-LD: 831-1.0065859-ld.json
RDF/XML (Pretty): 831-1.0065859-rdf.xml
RDF/JSON: 831-1.0065859-rdf.json
Turtle: 831-1.0065859-turtle.txt
N-Triples: 831-1.0065859-rdf-ntriples.txt
Original Record: 831-1.0065859-source.json
Full Text

Full Text

OVNI-NET: A flexible cluster interconnect for the new OVNI Real-Time simulator by Tom De Rybel Ind. Ing. Hogeschool Gent, 2002 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE in THE F A C U L T Y OF G R A D U A T E STUDIES (Electrical Engineering) THE UNIVERSITY OF BRITISH COLUMBIA March 2005 © T o m De Rybel, 2005 Abstract The original OVNI Real-time simulator employs one of the lowest latency network solutions available today. However, with the advent of the M A T E solution scheme, a new, more flexible interconnect solution became necessary. Also, the current system is limited in bandwidth and tied into the IDE hard-disk interface. This technology is soon to disappear, in favor of serial ATA. The new system presented circumvents any of the normal P C standard ports and is build directly on top of the industry-standard PCI bus. Also, the parallel connections between the nodes are replaced by high-speed full-duplex serial links, allowing for a system of larger physical size with a more manageable wiring solution. These enhancements yield in an even lower latency while significantly boosting the throughput. It also allows the hardware to be used on non-PC platforms such as Sun, Apple, HP, and IBM workstations. A major change to accommodate the flexibility of M A T E was the incorporation of an intelligent shared-memory hub, avoiding many of the latency issues introduced by switched packet network solutions. To further boost the performance, no microprocessors were used in the network system. Instead, the required intelligence was distributed throughout the system and implemented locally in logic. This allows for a highly parallel operation of the interconnect with an absolute minimum of overhead. The integration of the new interconnect system and the M A T E solution scheme is now so tight that the network system itself became part of the solution algorithm. However, due to the highly modular design approach, and the extensive use of F P G A technology, the system can be modified at will. Very different solution schemes and enhancements such as hardware error control, digital and analog I/O, and other features can be implemented without electrical modifications, allowing the same physical hardware to be used in both a research and a production environment. The result is a economic, obsolescence proof, flexible, high-speed low-latency cluster interconnect system build around industry-standard components. ii Acknowledgments No project is ever the work of a single person, and this thesis is no exception. By working on the OVNI project, I was introduced to the research of many people. It was from that diverse pool of ideas that I would eventually find my own, the inspiration behind OVNI-NET. Yet mere exposure to knowledge in itself is not enough. There has to be guidance to see the full meaning of the words on the page. Here I was lucky enough to be able to cooperate with the people who originated them, who shared some of their time, experience, and wisdom to make those words clear and self-evident. In the mere two years of research that followed I was shown a great deal of elegant beauty. To those who created it this page is offered as a token of appreciation. First of all I would like to thank my supervisor, Dr. Jose Marti, for his guidance, unconditional support, and the occasional kick in the butt. Also for recognizing when I didn't quite understand what was being explained and giving it another try. Repeatedly. Then there was Jorge Hollman, the author of the work my project started from, who not only encouraged me throughout and let me bounce ideas of him, but also gave me advice on UBCs many policies, as such inadvertently showing me how to be "creative" with them. As in many projects before, also this one came to that feared point where I got stuck in a torrent of questions, culminating in that dreadful realization that I had to, well, ask for help. And help I received form all over the world! Within UBC Dr. Michael Davis and Marvin Tom helped me out to tame Xilinxs' WebPack, showing me once again that manuals do have a purpose, even if merely an obscure one. Then there was Marco Buffa from Italy who guided me through the beginnings of writing a device driver on Linux and showed me the logic behind the wall of programming interfaces. Also Miha Dolenc and Tadej Markovic, the original authors of the PCI core I used gave me some advice and, in their turn, showed me the wisdom embedded in the R T F M acronym. Again within UBC, Andrey Vlassov and Luca Filipozi shared some of their Linux experience, asking me questions whose convoluted answers resulted in slight "tweaks" of the design. But beyond direct technical help I would also like to mention David Chu Chong and Lance Armstrong, whose tremendous efficiency in processing my many "wanted last week" and "this is a in iv rather weird part" orders showed actual feats of bending space-time. Not to mention their proverbial calm on the phone when dealing with suppliers, up to and way past the point where I would have assigned a hit squad. The logic analyzer I could borrow from Alan Prince and Tihomir Tunchev further helped me keep my sanity relatively intact. I would also like to apologize to my many colleagues in the lab, who besides humoring my humor also didn't destroy their computers too often or otherwise strain my nerves, which I must admit, were rather tensed every so often. Not to mention the incidental cases of serial, multi-language swearing at hapless pieces of equipment which sometimes decided to reply with smoke signals to my requests and then ended up rather more modular than originally envisioned. I hope to learn some more bad language from you all as to render those sessions less repetitive. Finally I'd like to thank my parents, without whom I wouldn't be here and Doris Metcalf, the department's graduate secretary, without whom I wouldn't be in Canada, which sorta puts the first at odds with the latter. Contents 1 Introduction 1 1.1 Latency 2 1.2 Build-in hardware abstraction 2 1.3 Wiring 3 1.4 Re-configurability 3 2 Requirements 5 3 Background 7 3.1 M A T E : Multi-Area Thevenin Equivalent 7 3.2 M A T E Connectivity 8 3.3 A T : Simulation clock 9 4 Implementation 10 4.1 Overview 10 4.2 Principle of operation 11 4.3 The various busses 15 4.3.1 The P C interface bus: PCI 15 4.3.2 The hub bus: G T L P 17 4.3.3 Node links: LVDS 18 4.3.4 On the chip: Wish Bone 18 4.4 PCI Core 19 4.4.1 BIOS 19 4.4.2 IRQ 20 4.4.3 B A R 20 4.4.4 W B Master/Slave interfaces 20 4.4.5 Clock domains 20 v CONTENTS vi 4.5 Hub bus implementation 21 5 Central solver board 22 5.1 Hardware Architecture 22 5.1.1 PCI 22 5.1.2 F P G A 24 5.1.3 Clock 24 5.1.4 G T L + . 25 5.2 Software Core Architecture 25 5.2.1 PCI Core 25 5.2.2 A t timer 26 Architecture 26 Config space 27 5.2.3 WishBone Bus link 27 Architecture 27 Dynamic Discontinuous Address Translation 28 Config space 30 6 Hub bus terminator 31 6.1 Hardware Architecture 31 6.1.1 Finding the bus impedance 32 6.1.2 Implementation 33 7 Hub Bus Link 36 7.1 Hardware Architecture 36 7.1.1 Implementation 37 7.2 Software Core Architecture 38 8 Hub Node Link Card 39 8.1 Hardware Architecture 39 8.2 Software Core Architecture 41 8.2.1 Serial Links 41 Protocol 42 Architecture 43 8.2.2 Dual-port memory 44 CONTENTS vn 9. Serial Transceiver 45 9.1 Hardware Architecture 46 9.1.1 Power 47 9.1.2 Matching 47 9.1.3 CAT5 Cable alternative 48 9.1.4 Notes 49 10 Node solver board 50 10.1 Hardware architecture 50 10.1.1 Serial transceiver interface : 50 10.2 Software Core Architecture 50 10.2.1 Dual-port memory 51 10.2.2 Serial links 51 Config Space 52 10.2.3 Skip timer 52 Config Space 53 11 API 54 11.1 ioctl 54 11.1.1 IOCTL manpage entry 54 11.1.2 Example use 55 11.2 mmap 56 11.2.1 mmap manpage entry 56 11.2.2 msync manpage entry 60 11.2.3 Example use 61 12 Results 62 13 Conclusions 65 14 Statements 6 6 Bibliography 67 15 Schematic diagrams and Board Designs 70 15.1 Serial Transceiver 70 15.2 PCI IO Card • 72 CONTENTS viii 15.3 Hub Bus Terminator 76 15.4 Hub Bus Link 78 15.5 Hub Card 81 16 Tests 85 17 Photographs 88 List of Figures 3.1 Subsystem to Mate to ONVI-NET Mapping 8 3.2 OVNI-RTDNS vs OVNI-NET Connectivity 8 3.3 Multiple At 9 4.1 OVNI-NET Hardware Overview 12 4.2 Cycle Data Flow . . . . ' 13 4.3 M Y R I N E T vs OVNI-NET 14 4.4 PC Bus Architecture 16 4.5 PCI Core 19 5.1 PCI Card 23 5.2 At timer 26 5.3 Data Formatter 28 5.4 Dynamic Discontinuous Address Translation 29 6.1 Tapped Backplane Bus 33 6.2 G T L + Bus Terminator 33 7.1 Bus Link • 37 8.1 Hub Node Link Card 40 8.2 Hub Module 42 8.3 Serial Protocol 42 8.4 Serial Links Unit 43 9.1 Serial Transceiver 46 10.1 Node Solver Core . . . : 51 ix LIST OF FIGURES x 16.1 At timer 2 /j,s pulses 86 16.2 Central node loop back test screen shot 86 16.3 Logic Analyzer screen shot 87 List of Tables 4.1 Bus Layout 21 5.1 Central Node timer config space 27 5.2 Central Solver links config space 30 10.1 Node Solver links config space 52 10.2 Node Solver skip timer config space 53 12.1 Single-Line point-to-point data transfer performance 63 12.2 Point-to-point bandwidth OVNI-NET advantage 63 16.1 Logic Analyzer signal order 85 xi Chapter 1 Introduction The real-time simulation of large power systems poses many great challenges. Even with highly optimized algorithms and models, a modern desktop computer is still insufficient to handle anything more then a few hundred nodes. More nodes are possible with large multi-processor machines, but this brings issues with the cost of the hardware, operating systems and specialized administration skills. Single and double processor P C hardware, on the other hand, is steadily becoming cheaper and ever more powerful. If it were possible to map a large problem over many commodity computers, an economic but powerful machine could be build. The research at the U B C Power Systems group, more specifically the work done on M A T E [1, 2] and OVNI [3, 4, 5, 6, 7], proposes a method to spread a simulation over many machines. To be able to do this, we need a method to connect all of these computers together suitable for hard-real-time simulations. In this thesis we will investigate a novel platform for cluster interconnections based on F P G A 1 technology that allows us to seamlessly map the M A T E solution onto the underlying hardware at great efficiency. Even more, the proposed network hardware platform can be modified at will to accommodate very different systems as well, giving us the flexibility to implement many, and even mutually exclusive, network concepts on the same hardware with only a simple code-download. As a result we can now mix and match interconnect topologies within the same system, op-timizing local areas where, for example, a point-to-point approach would work better than the centralized-hub topology that M A T E requires. By distributing and incorporating the required in-telligence in the networking hardware itself the computational nodes are fully transparent to the network. There is no software overhead to do the communications, further improving the latency. It 'Field-Programmable Gate Array 1 CHAPTER 1. INTRODUCTION 2 is for this reason that the proposed system can be referred to as a "Distributed Network Processor". Much like a mathematical co-processor, or G P U 2 on a modern video card, it takes most of the burden of the task away from the main C P U 3 and performs it faster and better than the main system could, leaving it more time and resources for other tasks. Another advantage is that from a software standpoint, the network does not exist, abstracting the whole cluster to a simple shared-memory pool. This simplifies the programming and avoids having to incorporate hardware specific routines into the solver core, making the program more portable. 1.1 Latency Technically, OVNI-NET's main strength is found in the non-additive latency.4 Because the amount of transferred data in M A T E is limited, latency becomes a major factor in the overall performance of the system. But that is not all. Due to the centralized solver architecture of M A T E , the connection of this part to the rest of the system requires high bandwidth as well as it carries all data in the system. OVNI-NET offers that: at 528 Mbps it is fast. With this system we have designed a high-bandwidth, low-latency re-cortfigurable solution that. can be tailored for a multitude of applications. It's shared-memory approach sets the system apart from most designs and makes it a perfect fit for M A T E , both by design and philosophy. 1.2 Build-in hardware abstraction In the C, and less visibly C++, pointers point to the location of variables in the computer's main memory. These pointers are thus nothing but location addresses of data. Suppose that OVNI, the implementation of M A T E , is running on a single machine. In that case the simulation core is sharing one chunk of memory for it's operations. The various virtual sub-systems all have their own section of this memory for their Thevenin equivalents. In fact, one could cleverly map these directly into the space reserved for the G connectivity matrix. Now cue OVNI-NET. As far as the software is concerned, it still read/writes its Thevenin equivalents to an assigned piece of memory in the nodes and the central machine still reads/writes it's virtual G matrix from it's local memory. The difference is that all of these reserved blocks of memory are now on separate machines. OVNI-NET ties these all together in the shared memory of 2Graphics Processing Unit 3 Central Processing Unit 4Latency is how long it takes before the first bit of data becomes available at the receiving end. The total communication time is the latency plus the actual data payload. CHAPTER 1. INTRODUCTION 3 it's hub. Through the build-in intelligence of the system, this shared memory is constantly kept up-to-date with the results in the various nodes and the central solver. As far as the individual machines that make up OVNI is concerned, there is no network. The communication hardware autonomously manages the At and all the copying of data from and to the machines. The simulation runs on the rhythm of the simulation interval, passed through hard- and software interrupts to the nodes and central. There are of course some smaller measures that have to be taken, but in essence, when the core is implemented with the network concept in mind r and the memory allocation for the system is done accordingly, the whole cluster becomes transparent. The result is total hardware abstraction where a multitude of computers become one single virtual machine. 1.3 Wiring In addition to the need for functional abstraction, the physical construction of a large computer cluster places a number of physical constraints on the wiring of the machine. It has to be tidy, function in less-than-ideal circumstances, and allow easy maintenance. To achieve this, the system uses differential serial data-transfer to the computational nodes employing a high-grade shielded twinax cabling solution that can be upgraded to fiber optics. The connections require only two thin cables to be routed in addition to the normal PC supply and network connections. Also, all of these wires concentrate in the OVNI-NET network hub. This can be rack-mounted and results a neat, easily accessible system. 1.4 Re-configurability In a research environment new ideas are, ideally, a matter of fact. Unfortunately, classic hardware design is terribly inflexible. The use of microprocessors brought much of the flexibility back, but at a significant cost in latency and speed. These embedded computer chips have to run programs and are by nature sequential devices. In a situation where latency is of cardinal importance, their use is prohibitive. The use of very high speed parts certainly alleviates the issue to some extend, but it is still a sub-optimal solution the the problem. The network system needs intelligence, but the requirements posed upon it are limited. State machines are sufficient. These can be implemented in hard-wired boards, or in ASICs5. Both of these time-honored solutions are unusable due to their inherent lack of re-configurability. Even a Application Specific Integrated Circuit CHAPTER 1. INTRODUCTION 4 small change in the desired operation requires a lengthy, expensive physical redesign. This does not really invite experimentation. A solution lies in the use of F P G A s 6 . These re-programmable chips have almost all the benefits of real hardware without the most inhibiting detractors. The ability to change an entire design to something that bears absolutely no resemblance to the previous one certainly stands out. But another, frequently overlooked, property is that in hardware, either hard-wired or in these chips, everything runs in parallel. The possibility to have many "programs" run simultaneously at the same time is something that microprocessors can only truly achieve by having many of them together. This true simultaneous multi-tasking is the technology that enables the OVNI-NET concept. Data can be received and recorded in memory while another part of the chip keeps track of the A t while another part is arbitrating on the PCI interface. All at the same time without missing a clock or pausing another task while processing the current one. Thus the entire set of sub-program runs at the hardware speed of the programmable chip. This is how OVNI-NET achieves PCI bus speed operation throughout the entire system without having to resort to expensive microprocessors and peripheral hardware. It is also a key element in it's universal design concept. From the ground up, the system was designed with change in mind. The modularity allows topologies ranging from point-to-point to a fully switched-packet network (when the hub is expanded with a switch fabric). In the system presented in this thesis, the architecture is a non-standard star with a shared external memory pool. Almost any classic network topology is possible with OVNI-N E T without having to design any new physical hardware. (With the exception of a switched network as the designed hub does not have a switch fabric to reduce cost. This is the whole point as a switch is not an efficient solution for M A T E and the added complexity is thus undesired. But a switch can be fairly easily designed if one is required.) Point to point, bus, star, token-ring, (non-simultaneous) message passing, one-by-one switching, and more are possible with the same hardware. Also in industry this flexibility is a plus as it comes at no extra cost. It makes the installation obsolescence proof, and allows for expansion and integration in existing infrastructure. The interface cards can be easily modified to accommodate a host of different functions and devices. Even most hardware design flaws can be fixed by a simple in-field re-programming of the boards, by the customer themselves. As with software, optimizations and fixes can be distributed on a disk or through the Internet, allowing the clients to keep their installation up-to-date with minimal down-time. Field Programmable Gate Arrays Chapter 2 Requirements Hard-real-time Power systems solutions are very demanding of the underlying hardware. The prob-lem is one of latency. Normal network hardware, such as Ethernet, has latencies in the order of a 100 /is. For a typical real-time simulation, the time-step is 50 fj,s. As such the communication alone would take at least twice the total available simulation time. Specialized network systems, offered by companies such as Mirinet and Dolphin, have much lower latencies (in the order of 3 u-s) and look suitable at first glance. However, when put in the M A T E environment the picture starts to look very dim the moment more then a few machines are connected together as the latency of these systems is additive. (They are switched-packet networks. Every time another machine is addressed, it takes 3 s^ for this to happen regardless of the data payload.) It quickly became clear that a new network concept was needed to fully exploit the power of M A T E . A previous system [8, 5, 4] developed at the U B C Power Systems Group by Jorge Hollman et al did offer exactly that, but was limited to point-to-point operation and a small number of links between the virtual sub systems. The new M A T E concept required many more links and a far higher configurational flexibility. As such a totally new concept was developed as the existing method could not be expanded to the M A T E requirements. We can list the new requirements as follows: • Low latency: overhead in the fis range, non-additive • High bandwidth: due to the central-solver approach of M A T E , all link data has to be ex-changed with this machine. The more bandwidth we have, the faster this can happen and the lower the total communications time. • Programmable simulation clock: Real-time power system simulations need a reliable A t signal to synchronize both the communications as the solution itself. The clock needs to support 5 CHAPTER 2. REQUIREMENTS 6 different simulation intervals on a per-node basis in order to support the solution concept of circuits latency [9, 10, 11]. • Flexible: OVNI is a research project. As such, it is in constant motion and the hardware should be able to adapt to the changing nature of the system. • Easy to use: there should be enough abstraction from the hardware so that it becomes nearly transparent to the the core programming. A suitable device-driver and A P I 1 hides much of the complexity from the core and users. • Affordable: To keep the price reasonable, only C O T S 2 parts can be used. The system presented in this thesis offers all of the above. With it's 3^s3 latency, 528Mbps band-width and 1024 32 bit word data size4, it allows sizable systems to be simulated within the M A T E environment in real-time. 1 Application Programming Interface: an agreed-upon set of interfaces to the hardware, abstracting the details from the user of the API. Commercial Off The Shelf 3no data payload 4For each individual simulation node Chapter 3 Background In order to achieve high communications efficiency, we must first take a closer look as to how the system is represented in M A T E . This formulation is unique in that it can be almost transparently mapped to a customized interconnect system. 3.1 M A T E : Multi-Area Thevenin Equivalent At the hart of the OVNI simulator lies the M A T E system representation. It segments a large simulation in smaller chunks (areas or sub-systems) which can be solved on individual solver nodes. These areas are then represented as Thevenin equivalents to the outside world. The connections between these sub-systems are represented as the lines impedance matrix [2]. Due to this scheme the internal complexity of the sub-systems can be hidden from the overall simu-lator. To solve for the entire system, the central machine only has to solve for the links impedance matrix connecting the different Thevenin equivalents, which is independent of the internal complex-ity of the individual subsystems. Using matrix notation, the sub-systems and links are formulated as a large (sparse) matrix. The sub-systems are organized around the diagonal and the links in columns (Fig. 3.1). Looking at the high degree of regularity, we can readily identify the areas that contain the information we want to communicate in the system. The design objective is to map these areas (containing the solved data) into the physical solver nodes joined by physical interconnects. The result is a very transparent and natural way of com-municating within the system. 7 CHAPTER 3. BACKGROUND 8 Power system with sub-systems MATE G-Matrix OVNI-NET Figure 3.1: Subsystem to Mate to ONVI-NET Mapping OVNI-RTDNS OVNI-NET Figure 3.2: OVNI-RTDNS vs OVNI-NET Connectivity 3.2 M A T E Connectivity OVNI-RTDNS [5] employed point-to-point connections. Within the implementation, this allowed a maximum of two links per machine. As a result, the potential network connectivity was severely limited. In OVNI-NET however, the links between sub-systems are abstracted. As such there is no more limit (except for the maximum data size) on the number of links between sub-systems, allowing a full graph connectivity if so desired. For example, the following illustration (Fig. 3.2) shows the largest possible connectivity of OVNI-RTDNS with four nodes next to OVNI-NET. It has to be noted that several lines between sub-systems are also possible with M A T E , but here only single lines are shown for clarity. CHAPTER 3. BACKGROUND 9 Figure 3.3: Multiple A t 3.3 A T : S i m u l a t i o n c lock Besides communicating the Thevenin equivalent data, the simulation requires a steady clock to not only synchronize the communications but also the simulation itself. ;Tn M A T E , this all important At signal is not necessarily the same for all individual sub-systems. Using to the concept of circuit latency [11, 9, 10, 2], some solver nodes can be running at an integer multiple of the fastest reference At . As such, the simulation clock system poses some extra requirements on the network system. It is agreed that the system runs at the fastest At in the simulation and that slower nodes skip a number of cycles before again taking part in the simulation. A l l of this happens without introducing extra software-overhead since it is implemented as state-machines in the networking system logic. In itself, the simulation interval timing system is a good illustration of the power of the F P G A design approach. A clock of this type is uncommon in normal computer network systems, yet on the OVNI-NET platform we can add something as invasive as this without much effort (Fig. 3.3). C h a p t e r 4 Implementa t ion 4.1 Overview To function as a- research platform the system was designed with a high amount of flexibility in mind. Part of the solution was found in a fully modular design approach. But that in itself is not enough to accommodate the mutability of an environment where new concepts requiring very different networking layouts are frequently tested. Also in a production system the ability to drastically modify the machine easily with little cost protects the investment as new techniques can be implemented with a mere software update. The result of these concerns is a design split into self-contained sub-systems, each employing their own local control hardware with the ability to fully re-configure the circuits. For these reasons one could call OVNI-NET a distributed network processor: distributed as the intelligence is spread-out throughout the system where it is needed locally; network processor due to the autonomous handling of the communication tasks. Throughout the design all efforts have been made to prevent hard-wiring of any functionality. In the cases where that could not be avoided, the entire sub-system was constructed on a separate board that plugs into the respective master board. If modifications are required, the necessary investment, and system down-time are confined to this board. However, despite the rigorous choice for flexibility, a classic software approach using microprocessors was avoided. A solution to meet the extremes between total changeability but slowness of software and total rigidity but high speed of wired hardware was found in the F P G A 1 technology. This technology offers hardware-like speed and latencies with software-like flexibility and changeability. When designing the system, the power and inherent segmentationalist approach of the M A T E 1 Field Programmable Gate Array 10 CHAPTER 4. IMPLEMENTATION 11 concept surfaced throughout the network design itself. Indeed, the divide-and-conquer philosophy almost dictated the layout of the system, tying principle and design closely together. This reliance upon one another is part of the key to achieving high performance while maintaining low cost. However, due to the flexibility of the design, it can be readily modified to suit an entirely different system as well. This polymorphic property is the strength of the OVNI-NET design concept. 4.2 Principle of operation M A T E is a two-step solution process. First the sub-systems, mapped onto the solver nodes, compute their local results in parallel. When they are done, those results are communicated to the central solver which solves the reduced system represented by the links subsystem. The updated link currents are then communicated back to the sub-systems and the whole process starts again. This cycle has to be completed within the smallest At used in the system. When looking at the block diagram of the network (Fig. 4.1) one can see that its topology follows the general concept. Each of the node computers has an interface card that connects to a hub. This hub concentrates, or separates, the data to the central solver. The link to this machine is parallel for additional speed. The whole network system runs synchronous at the clock of the central solver. The hub design is the key to the latency performance of OVNI-NET. For M A T E , we only need to communicate small amounts of data, but with strict latency requirements. Ideally, the central solver should be able to grab whatever data it needs from any of the node computers memory with zero delay and vice-versa. This, however, is impossible, if nothing else due to the limitations imposed by the speed of electrical signals in the medium. The next best thing would be a multi-processor design with a shared memory. Although certainly possible, this brings us into the realm of large servers and other super computers. Their cost is prohibitive and it also defeats the whole purpose of M A T E , i.e., the ability to run, and expand, on cheap commodity hardware. Limited by the P C platform, a possible solution is the use of PCI cards and a shared-memory (the hub). This is what OVNI-NET does. The design functions not so much as a classic network, but rather as a shared memory pool. The moving of data in the system is handled through autonomous memory-to-memory copy operations. Autonomous in the sense that the PCI cards employ their own intelligence and are capable of handling all network communication details at hardware speeds. As a result, we already have savings in latency as there is less need for from-and-back communication cycles between the networking system and the system CPU. For illustration, let us walk through the communication cycle (Fig. 4.2) as it happens during CHAPTER 4. IMPLEMENTATION 12 PCI C a r d PCI C a r d PCI C a r d S e r i a l L i n k s H U B / D i s t r i b u t e d Memory ] TRX| Node L i n k r~ ] C a r d •j TRX| | T R x | Node L i n k i i C a r d ] T R X | P a r Bus L i n k l C a r d P a r a l l e l L i n k PCI C a r d | P a r Bus Lnk C e n t r a l S o l v e r Figure 4.1: OVNI-NET Hardware Overview CHAPTER 4. IMPLEMENTATION 13 Solver Nodes HUB Central Node Solver Nodes HUB Central Node Figure 4.2: Cycle Data Flow operation. First, all nodes in parallel transfer the just completed sub-system solution. The system's C P U memorymaps this data onto the card's local memory. This is merely an abstraction as the card will immediately transfer this data to the hub. All nodes can do this in parallel. The transfer takes place over their 528Mbps serial links to the hub, the shared memory pool. This memory is accessible in full parallel form. The main savings in latency are found in this hub part of the design. First, the links are always on. This means that the PLLs of the serial link serializers/deserializers are constantly in lock and do not need to be relocked. For the parts we used this implies about 1/J,S per link. Second, there is no switching. The central node does not have to poll each machine individually, request data, and address the.next. This means that the latency of the system is N O T additive in this stage. With Myrinet networks and the likes (Fig. 4.3), one has to access each machine sequentially, adding up the latencies. Myrinet is designed for a wholly different purpose where each machine has to be able to talk to each other machine directly. Hence the use of switch fabrics and switched hubs. This adds complexity, latency, and much cost to the system that is not required in our case. Our shared-memory approach avoids this complexity altogether and is perfectly tailored to the M A T E concept, . allowing it to work at far better performance. In OVNI-NET the network system infers almost all required operations from the data and the state of the network system itself, allowing even dynamic change of data sizes without program intervention. This aspect, and the latency savings explained above, gives our design superior latency perfor-mance due to the inherent specialized structure of the whole. Yet, we are not locked-in to this one custom application as the design can be tailored to other situations by re-programming the FPGA's. This way our design is quite universal while still being highly specialized. If one were to design a CHAPTER 4. IMPLEMENTATION 14 MYRINET OVNI-NET • ' Figure 4.3: MYRINET vs OVNI-NET switch in addition to the hub, virtually all common network topologies would be possible. After the data copy to the hub, the second stage of the first half of the communication cycle can start. Here all results are copied directly into the central solver's memory. Since all data has already been gathered, it can then be moved in one single block without interruptions at the full PCI bus speed. Here lies the bottleneck of the M A T E concept when implemented on real hardware. There is only so much data one can move in a certain amount of time, restricting the otherwise unlimited expandability. This is a common issue with all systems based on a single central solver. However, as we will show later, this is not a paramount obstruction since with modern machines this limit is so high that even systems of substantial size can be handled without problems. Also, the advent of the new PCI-X architecture will alleviate the problem even more when it becomes more readily available. For our second stage of the communications (from central to nodes) the system is symmetric in reverse. Here the symmetry in the hardware serves another important purpose. When interfacing with real-world devices it is imperative that all of these run at exactly the same clock rate to prevent A D C / D A C sample time mismatches. When care is taken to make all interconnects of equal length, the inherent symmetry and the centrally-controlled At logic will assure a very accurate synchronization between all parts of the system. In addition, in most cases the PCI cards can be CHAPTER 4. IMPLEMENTATION 15 modified and expanded with an analogue interface card enabling stand-alone data-acquisition and interfacing, further reducing cost by eliminating a P C as mediator for real-world interfacing. 4.3 The various busses OVNI-NET uses a number of very different bus topologies in it's construction. Since these string all the individual parts together, we will here go in some detail as to their design and the reasons they were chosen for the design. 4.3.1 The PC interface bus: PCI There are many ways-to communicate with a PC using expansion hardware. However, from an architectural standpoint; they all come down to two main approaches. One is through a local bus that interfaces directly to the microprocessor. The venerable V L B (VESA Local Bus) is a good example of that. These methods offer the best possible speed for an expansion card. Unfortunately, local busses are not very stable in their design and vary from processor generation to processor generation. PCI and AGP on the other hand are busses that connect through a bridge to the CPU. Prom figure 4.4 we can also see that they can communicate with the main memory through the bridge without using the CPU. It thus isolates the bus from the C P U and allows it to perform other operations while the bus is busy. It is this bridge that allows us to be so efficient in memory-to-memory applications as this one. Strictly speaking AGP doesn't use a true bridge chip and can run at far faster speeds2 than the PCI. It also has a tighter memory integration than the older PCI bus and pipelining, but the technical implementation is troublesome at this point, certainly when compared to PCI. In many respects AGP is more of a local bus. PCI implementations (IP cores3) are relatively plentiful, including an open-source [12, 13] one that can be used without licensing costs. This cuts down our development time significantly while keeping costs down. AGP, for instance, does not have this level of support and it's implementa-tion would have been prohibitively expensive and time consuming. For PCI we can use economic commodity FPGAs while for A G P we would have to use far faster, and thus far more expensive, parts. Unfortunately, PCI is not the fastest bus around these days, but it is still sufficient for our application. It has to be noted that virtually all other busses in a P C (IDE, USB, FireWire, printer 2Ideally the CPU front-side bus rate. intellectual Property, software descriptions of hardware instead of physical chips. CHAPTER 4. IMPLEMENTATION 16 CPU Front-Side Bus PCI Bridge PCI Slot 0 '/////, PCI Slot 1 PCI Slot 2 AGP Chipset AGP Card (Grpahics) Memory Controller Memory Bus E I DRAM 2 DRAM 3 Figure 4.4: P C Bus Architecture CHAPTER 4. IMPLEMENTATION 17 port, serial ports, etc..) are implemented on top of the PCI bus. As such, using any of these will result in a performance hit due to the extra overhead. A common issue with AGP, and PCI, is that they will disappear in the near future when PCI Express is introduced. The problem is that PC architecture is currently on a critical point where a lot of the older technologies are (finally) being retired. The new buses (USB, FireWire, PCI Express, SATA, etc..) are without exception serial in nature to reduce the wiring and drive current requirements. As such it is unwise to invest in obsolete technologies while the new ones are poised to take-over the field quickly. However, PCI has been steady for many years and it is very likely that it will continue to exist alongside with PCI-Express for some time to come, or at least until PCI-Express hardware matures and becomes more universally available. In addition, PCI-Express is partly compatible with PCI on the protocol level, which should make the transition easier. Again, due to our modular design approach, the moment PCI-Express becomes available we will only have to partially re-design one board while the majority of the core in the chip and the hardware behind the bus can remain virtually unmodified. This is one of the strengths of FPGAs. Big design changes translate in relatively minor physical changes which can be implemented in a minimum of time at a minimal cost. 4.3.2 The hub bus: GTLP To connect the hub together and link it with the central node, we opted for a high-speed parallel bus. This mainly for ease of implementation. Designing a high-speed serial backplane is an expensive undertaking and debugging signals in the GHz range on a tapped transmission line is non-trivial without very expensive test equipment. As such, we decided to use a lower-speed (33MHz) 32bit parallel bus in this part of the design. The electrical length of the hub backbone bus is about one meter. Then there is the link with the central solver. All in all the electrical length can be up to two meters. Ordinarily, modern high-speed parallel busses would use reflected-wave switching approaches. However, this would extend the electrical length to four meters, which becomes a noticeable factor in our latency figures. Circumventing this issue, a combination of two techniques was chosen to limit this extra electrical overhead. One was to place the link to the central solver physically in the middle of the hub bus, effectively halving the bus backbone delay. A second reduction was found in the use of a modern, low-voltage incident-wave bus design using Texas Instrument's Gunning Transceiver Logic Plus [14, 15]. This technology can deliver sizable drive currents on a terminated bus while still keeping the overall power consumption reasonable. CHAPTER 4. IMPLEMENTATION 18 The high-speed capabilities, easy single-ended implementation and robust design of G T L P [16], made it ideal for our 64 wire bus design. 4.3.3 Node links: LVDS Although we can get away with a massive parallel bus in the hub and for the central node as they can be easily integrated in one device, running wide parallel busses to the nodes would not be ideal for reliability and cost reasons. As such a far more elegant approach was sought for in serial techniques. Here high-speed GHz range signaling would be less of an issue as we would only be dealing with single, untapped, and fully terminated transmission lines. Considering many techniques, we eventually settled for a specific part by National Semiconductor [17]. The device uses LVDS [18]. Although Low Voltage Differential Signaling is not meant for long-distance, it proved reasonable enough as our cluster is still physically small and a the technology worked fine over the distances required to wire at least three 19" racks. If more distance is required, E C L 4 driver equipped chips can be substituted. LVDS [19, 20] uses a differential link, giving good noise immunity and signal quality, even over cheap CAT5 [21] Ethernet cable. 4.3.4 On the chip: Wish Bone The used PCI core is really a PCI bridge. It converts the PCI bus protocol to a dedicated SoC 5 interconnect bus. As such the choice to use the WishBone [22] bus inside the FPGA's was merely a result of using the OpenCores PCI core [12, 13]. However, the W B is a well-designed, light-weight interconnect bus offering all the flexibility we need for our design. The open specification is also extensively documented and proved quite easy to use in our environment. As a matter of fact, the design leaves the actual implementation of it entirely in the hands of the designer. The specification is merely a protocol specification, not a physical layer description. Given the mostly memory bank oriented architecture of our design, the choice was made to implement the W B as a classic address-decoded shared address/data bus. The address decoders are distributed in the design, avoiding cascading them as much as possible to obtain high decoding speeds and as such high performance. At 32bit, 33MHz the SOC bus is running at the same rate as the underlying PCI bus. However, it is also running in a different clock domain but this synchronization is taken care of in the PCI core module. 4Emitter Coupled Logic 5 System On Chip CHAPTER 4. IMPLEMENTATION 19 P c I PCI Core B R U 0 F WB M A S T E R WB S L A V E PCI BUS CLK WB BUS CLK Figure 4.5: PCI Core 4.4 PCI Core OVNI-NET's functioning is highly dependent on a high-performance interface with the host com-puters. At the time of development, our only viable choice was the PCI bus. This is a rather complex beast to work with. A good solution was found in the excellent OpenCores PCI core, a very complete implementation that even supports D M A and bus mastering. (We are not exploiting these capabilities yet, but they are available.) Although the core is extensively described in it's de-sign document [12] and specification [13], we will here note a few of the functions that immediately apply to us and will be referenced to throughout the following descriptions. 4.4.1 B I O S The system's BIOS, through the PCI bus, automatically assigns valid address ranges and interrupt lines to the hardware plugged in to it. This process requires a specific set of configuration registers and support logic in the boards. Called "configuration space", this part of the board's memory holds identification information for the system, the assigned variables, as well as some other board-specific information. It is though this system that modern Plug-and-Play hardware prevents the old jumper dances encountered on the ISA bus. CHAPTER 4. IMPLEMENTATION 20 4.4.2 IRQ It has to be noted that PCI explicitly allows IRQs 6 to be shared with other hardware in an attempt to get around the very limited number of those in the PC platform. This poses some special demands on the device drivers as they have to make sure that the interrupt was for them, and not another board sharing the same interrupt line. 4.4.3 BAR Besides the interrupts, the host computer has to communicate with the card. Since everything is memory-mapped, we have to find out where exactly the BIOS assigned the start of our board's memory map in the system. This happens through the mechanism of BAR's, or Base Address Registers. BARO is the configuration space, and can be obtained from the operating system. In BARO we can read the location of the other BAR's. In Linux we can request these from the OS as well, but it is nice to note that when assigned, all this information is in fact to be found in the board's configuration space. We use BAR1 for all board operations. 4.4.4 WB Master/Slave interfaces WishBone [22] interaction happens in two ways. The PCI core has both a master and a slave WB interface. In our case, we purely use the master. This however limits us to operations initiated by the host computer. Our only way to control it from the board is through the IRQ line. However, for future expansion we will have to use the WB slave of the PCI core as well. Then we will be able to construct our own W B master and do D M A 7 transactions fully controlled by the hardware itself. However, due to time constraints this was not implemented for now. 4.4.5 Clock domains In the OVNI-NET system, the stability of the clock is very important as it directly controls the solution step At , and as such any operations with external devices. Unfortunately, the PCI bus clock is not reliable enough for our purpose. It can be stopped, scaled, switched from 33 to 66MHz when another, high-speed board is selected. For this reason our board is equipped with a local crystal oscillator. This signifies that we have to run the PCI core in two clock domains: the PCI side on the PCI bus clock and the W B side on the board clock (Fig. 4.5). The designers of the PCI core have taken care of this and implemented circular buffers [12] to re-synchronize both clock domains. These are not very deep, so it is preferable that both domains run interrupt ReQuest 7Direct Memory Access CHAPTER 4. IMPLEMENTATION 21 at closely related speeds so that few wait stages occur as the buffers fill-up. In our implementation they both run at the same speed. 4.5 Hub bus implementation Although the hardware does not impose it, we do have to decide on a fixed layout for the 64 lines of the central hub bus. Eight of these, the node select lines, are somewhat fixed as the hub boards have these hardwired due to a lack of pins on the F P G A . Beyond that, the clock lines are also fixed. That leaves 54 fully universal pins. We segmented these into two busses. A fully universal bus A and a partially-bound bus B. As we need a 32 bit data bus and the G T L + drivers used [23] group the lines in eight and can only assign direction per eight, we decided on the following bus layout: Name Description Direction Width Bus Line D A T Data Bus inout 32 A 63-1 SEL Node Select out 8 B 47-33 A D R Address Bus out 10 B 63-49,15-13 C L K Clock out 1 B 1 RST Reset out 1 B 3 W E Write Enable out 1 B 5 D T Deltat T out 1 B 7 E R R _ D T Error Delta T in 1 B 19 Table 4.1: Bus Layout The line numbers increment by odd numbers. So 1, 3, 5, 7, etc... All even line numbers are ground. Directions are seen from the central node. All busses are have their MSB on the left (little endian). Chapter 5 Cen t r a l solver boa rd Let us begin with the core of the OVNI-NET architecture, the central solver interface. This board plugs into the PCI interface of the host computer and is then linked to the hub backbone bus with two universal 32-bit wide busses. The signaling assignment of the busses is almost entirely denned by the user.1 5.1 Hardware Architecture To save on development time and allow for last-minute changes, the PCI card was designed to be relatively universal. In OVNI-NET it is used for both the central solver as the node machines. The only difference is which parts are placed on it. As such, most of the following description will also apply later on. Also, various other hardware modules can be connected to it, such as A D C / D A C converters, if so desired. The entire design is based on an F P G A chip and almost everything is wired to it. This way we can maximize the design's flexibility as all peripheral connections are now directly influenced by the chip's programming: 5.1.1 PCI Although specified for both 5V and 3.3V, most personal computers still use 5V signaling. This poses an issue with the used Xilinx Spartan He F P G A , which only allows a maximum of 3.3V on its inputs [24]. To adapt the chip to a 5V PCI bus, high-speed bi-directional level shifters are needed. They also have to accommodate the reflected-wave architecture of the PCI bus. 'The bus clocks axe exceptions and fixed, all other signals are freely assignable within the 8 bit bus granularity. 22 CHAPTER 5. CENTRAL SOLVER BOARD 23 Figure 5.1: PCI Card CHAPTER 5. CENTRAL SOLVER BOARD 24 A solution [25] was found in the use of bus switches. The particular device [26] used consists of single CMOS F E T s 2 . By biasing the gate of these devices 0.7 volts lower than the supply voltage, their outputs saturate at about 3.3V, even when driven by 5V. Also in reverse, the 3.3V of the F P G A is passed virtually unchanged. Since the Spartan 2e output stages can be configured for compliance with PCI levels, we still have enough drive capability to produce reliable signals on the bus. The nice thing about this simple, elegant solution is that it is fully bi-directional without a control signal and that it will work with both 3.3V and 5V PCI busses, making the design universally suited for all common PCI bus implementation. 5.1.2 F P G A To configure the chip at power-up, a dedicated F L A S H memory was added to the board. This 2Mbit device supports the Xilinx serial protocol and configures the F P G A in about one second. Programming this memory, and/or the F P G A happens in-system through the on-board J T A G 3 chain. Due to the high-speed operation of the chip, and the relative nastiness of the P C power busses, all required voltages are post-regulated on the board and heavily decoupled. Also the board itself, with its four-layer design, helps to optimize the power supply to the F P G A . Large quantities of decoupling capacitors were implemented, resulting in a clean, high-impulse load capable power supply. This is very important as the F P G A chip draws over an ampere of current at 2.5V when it starts up and re-configures all its logic cells [24]. 5.1.3 Clock The board runs in two different clock domains, separated and synchronized in the PCI core on the FPGA. One is found around the PCI bus, at 33MHz. The other, the rest of the circuit. Although most of that runs at 33MHz as well, the oscillator is set for 66MHz, as this frequency is required for higher-speed serial communications and chosen for future enhancements, as well as to keep everything as universal as possible. To increase the drive capabilities of the oscillator and produce a clean clock, a zero-delay [27] buffer was used. This way we can provide multiple, mutually-insulated clock signals on different traces. It has to be noted that a lot of care must be taken with these traces. They have to be length and impedance controlled [28], with proper termination, to prevent reflections on these transmission lines. In the design presented, termination was not done and Complementary Metal Oxide Semiconductor Field Effect Transistor 3Joint Test Action Group: an in-system test and programming port defined in the IEEE Standard 1149.1. CHAPTER 5. CENTRAL SOLVER BOARD 25 everything worked up to 33MHz. However, this was the prime reason why operation at 66MHz did not prove reliable, although the hardware is, in principle, capable of this. The F P G A itself produces the clock for the data bus. This, because of the on-chip delays caused a shift in the clock. A better way of doing this would be by using an external P L L 4 clock synthesizer which could then be set for the correct delay [27, 28]. F P G A generated clocks are not very clean and have a significant amount of jitter. However, this has not proven to be problematic at this point. 5.1.4 GTL+ The parallel link to the hub has to convert the G T L P levels to regular 3.3V L V T T L 5 as used by the F P G A . As these drivers need a control bit to set the direction, the F P G A has to keep track of the bus state in order to avoid multi-drive conflicts. Since this technology is open collector, or more correctly, open source, a suitable pull-up and bus termination circuit is needed [18]. A full description of this follows later (chapter 6) when we describe hub bus terminator module. Let us say now that the termination and the threshold voltage for the receivers is provided by a joint regulator, so that small variations in these voltage levels happen with respect to each other and the bus can continue normal operations, even in less than ideal circumstances when they occur [16]. 5.2 Software Core Architecture Beyond the hardware, the software core in the F P G A is another large part of the design. Written in Verilog using Xilinx WebPack 6.2, care was taken to use as few Xilinx specific elements as possible to improve portability to other families of chips, such as those manufactured by Altera. 5.2.1 PCI Core The design and usage of the PCI core itself is described in detail in it's documentation [12, 13]. Due to the implementation as a bridge, it can be treated as a black-box that interfaces to the host machine's PCI bus on one side, and the SoC wishbone bus on the other. For programming reference it is only important to know the location of B A R 1 6 . All addresses on the Wishbone are with respect to this one. The information can be read from the core's configuration space through BARO, and the required function calls are mostly provided by the operating system. 4Phase Locked Loop 5 Low Voltage Transistor-Transistor Logic 6Base Address Register CHAPTER 5. CENTRAL SOLVER BOARD 26 D e l t a t Output Figure 5.2: At timer An interrupt line is also provided, which can be shared with other devices. As such, the device driver must resolve if the interrupt originates from the OVNI-NET system or somewhere else. This can be done by checking the appropriate flag in the PCI core BARO configuration space. 5.2.2 At timer Located on the central node card is the master At timer module (Fig. 5.2). For flexibility, it can be programmed in 1 fj,s increments with a maximum of about one hour twenty minutes for each time step, furthermore, the time step can be dynamically changed at runtime and also single-stepping is supported for debugging. Archi tecture All setup information is loaded in two specific registers through the Wishbone interface. One holds the time step length, while the other is for timer control purposes, including start/stop, single-step, and reset. The reset bit is self clearing in the next clock cycle, so there is no need to manually negate the bit after it has been asserted. This saves one PCI bus cycle and eases programming. Support for manual clearing is provided to avoid errors if this were to happen. These registers are continuously read by the timer sub-system itself to determine the required action. The timer clock is provided by a scaler module that converts the 33MHz Wishbone clock into a 1 /is base A t rate for the timer sub-system. By comparing the requested interval with a counter clocked by the timer base rate, the appro-priate moment to assert the output is determined. This pulse lasts exactly one period of the 33MHz Wishbone bus clock. Additional logic takes care of the timer control. CHAPTER 5. CENTRAL SOLVER BOARD Config space 27 Register Name Address Bit Function T C R (Timer Control Register) 0x0 0 Timer Stopped when set 1 Single Pulse when set P C R (Period Control Register) 0x4 31-0 32 bit A t timer period in [is All registers are 32 bit, undefined bits are not used. Addresses given as required for C device driver code. Table 5.1: Central Node timer config space 5.2.3 WishBone Bus link The formatting and distribution of the data between the Whishbone bus and the OVNI-NET shared memory hub happens in this module. By using our Dynamic Discontinuous Address Translation scheme (section, the communications are optimized. It is here that the F P G A approach shines again, as the required operations are inferred online from the data as it passes along. Because everything is implemented in hardware, and thus has to ability to run fully parallel, we only had to introduce a single delay in the data flow to synchronize it with the control system, at a total penalty of only about 30 ns. Architecture The communications core is split into three major sections (Fig. 5.3). The first one is the config-uration register unit. In our implementation it performs two functions. One is the reset control of the T X and RX units, the other is storing the total number of nodes in the cluster. This number is used by both the transmit and receive units to determine how many data segments to expect and as such properly synchronize their protocol generators. The reset control re-initializes the protocol generators to their initial state and is self-clearing. Let us next inspect the receiver module. This one uses the read-requests from the Wishbone interface and converts them to the D D A T 7 format of the hub bus. For this, it generates the module select and bus address signals based on the replied data from the respective hub memory unit. It is of great importance that the hub memory is sequentially read out starting from the first memory module upward. This is the normal operational modus required for OVNI-NET and as such it does not pose an issue. It has to be noted that due to the way the Wishbone timing works, also the 7Dynamic Discontinuous Address Translation CHAPTER 5. CENTRAL SOLVER BOARD 28 Recieve Formatter Figure 5.3: Data Formatter receiver D D A T generator requires a timing offset in the data flow and as such also has to operate with the timeshift delay required in the transmit unit. The opposite happens in the transmit unit, again, with the same limitation of sequential access. Due to the timing on the Whisbone bus, we have to delay the data bus to leave enough time for the protocol generator to calculate the required address and module select signals based on that data. This takes one bus cycle. From the perspective of the host computer each direction of data flow is accessed through a different memory segment through mmap8 operations. Here we could optimize the design by implementing a D M A 9 controller, which would fully relieve the system C P U of moving the data from memory to the PCI card. However, due to time constraints this is left for future development. Dynamic Discontinuous Address Translation The underlying architecture of OVNI-NET, the shared memory concept, invites a fully memory-mapped architecture. However, for efficiency and the desire to dynamically allocate the data-sizes required per node by inference, a special addressing scheme had to be devised. The situation is as follows: the central solver has a block of memory with node data. To optimize the operational speed, all node data blocks have to be lumped together in one continuous chunk. 8 memory mapping 9Direct Memory Access CHAPTER 5. CENTRAL SOLVER BOARD 29 %f N 1 N 2 Central Node Data Segment OVNI-NET Memory Map Figure 5.4: Dynamic Discontinuous Address Translation Each block has its length written in the first word and the board knows the number of nodes (the only parameter that has to be actively configured). Now the central machine needs to map all of these node data blocks on the correct solver node. To speed things up even further, by making the data blocks dynamic we do not have to transmit the full data size ( I K words) if this is not required and thus save communication time and bandwidth. For maximum efficiency, we need to do this with the least processor intervention possible. This means that we will have to communicate this data continuously, without further processing (eg. lookup table-based translation) to the shared memory. This allows us to use block operations on the PCI which save time, and also enables D M A techniques, when implemented. This introduces the problem that the network hardware itself has to determine where exactly it has to map parts of this data stream into its shared memory. Each node card (which has two solver node shared buffers) is address-mapped in I K word chunks. Thus while operating on these memory segments, the system has to jump to the next block when all data for the previous one has been delivered. A special addressing scheme is thus needed to handle these dynamic and discontinuous oper-ations. In our case we can make one simplification, however, and that is that all operations will always be sequential, never random access. Although this changes little on the principle, it does make the implementation easier. By capturing the logic in state machines instead of programs, this dynamic allocation comes at virtually no extra overhead cost while we can keep all communications optimized during the operation itself and never10 have to actively set parameters in the boards, saving the overhead these individual PCI bus operations would incur. 1 0Except for one, the total number of nodes. Although also this could be inferred from the data as well but since it never changes at runtime in MATE, we choose not to. CHAPTER 5. CENTRAL SOLVER BOARD 30 The only data formatting, and from that, protocol convention in the system finds its origin here. To infer the data length, the very first word of a data segment has to be the length of that segment. This word is read by the hardware and used to configure all the buffers and address translation logic. As this happens at runtime while the data is passing through the system, the cost of this convenience is at most one bus cycle. At 33MHz this translates to a mere 30ns. Given the tremendous savings on transmission time and making the system self-configuring, this small extra delay is well justified. Config space Register Name Address Bit Function NN (Number of Nodes) 0x80000 7-0 Total number of solver computer nodes in the system C T L (Control) 0x80004 0 Reset link formatters when set R X 0x100000 N / A Receiver data formatter from shared memory access T X 0x180000 N / A Transmitter data formatter to shared memory access All registers are 32 bit, undefined bits are not used. Addresses given as required for C device driver code. Table 5.2: Central Solver links config space Chapter 6 H u b bus te rmina tor With high-speed digital busses we find ourselves in the realm of radio-frequent engineering. Due to the high frequencies, wires become transmission lines and issues such as reflection, termination, and impedance start to play a major role in the design. 6.1 Hardware Architecture In order to allow all the node link cards to communicate, the hub requires some form of interconnect. Due to the high data speed and number of nodes in the system, a short, wide backplane bus has practical advantages. The location where the bus resides, the shielded, well-controlled environment of a metal rack enclosure, and the limited length (19" rack @ 48.26 cm) does away with many of the usual problems, such as noise sensitivity as the used bus is not differential. Also, the relative ease of use, avoiding complex serialization/deserialization with it's necessary control logic and resulting latency, tipped the balance in favor of a parallel backplane. Serial links would still need an inter-connect, and this one would have to be capable of handling microwave signals while the bus used here runs at only 33 MHz. The disadvantage is again the multitude of contacts. The bus employed uses 128 connections, half of which are ground. Backplanes have matured over the last decades and the now available, special-ized connectors, such as the impedance controlled A M P Z-Pack, are very reliable. Although we used simple header connectors in the prototype for budgetary reasons, a commercial/industrial imple-mentation surely has to use more advanced connections and a real multi-layer controlled impedance P C B 1 backplane. For our prototype 90 Q flat cable was used with success. When choosing a backplane, the next question is what signaling technology to use. The relative 1 Printed Circuit Board 31 CHAPTER 6. HUB BUS TERMINATOR 32 length and quite high signaling rate required here immediately discredits anything but the more advanced technologies. Indeed, driving a long, thus high-capacitance, bus at high speeds with normal logic requires powerful drivers and high currents. The problem has been recognized by the manufacturers and they responded with a multitude of novel approaches, all sharing a significant reduction of the bus voltages. One of the more novel concepts, as used in the PCI 2 bus, is reflected wave switching. Although effective and capable of working with high speed on relatively long buses, it is not that well suited due to its inherent latency. One has to wait for the wave to bounce back from the unterminated end of the bus before switching occurs. Then, there are a whole host of incident wave switching technologies, almost all operating at voltages lower than 2 V. After much consideration two options were left: LVDS [18] and G T L + [16]. The latter is and enhanced version of G T L 3 designed by Texas Instruments, and can be made backward compatible with a number of older standard if so required. While LVDS is fast and robust due to its differential and current-driven nature, it uses twice the number of wires of a single-ended technology, and needs matching networks in addition to the bus termination. These networks exist commercially in compact packages and they can also be easily constructed from discrete parts. Still bus LVDS seems more something of an afterthought by the manufacturers and the signals degrade rapidly with the number of plugged-in extension boards. Typically, up to twenty stubs (connections to the bus) can be handled [19, 20], which would be sufficient for our application. A better technology for buses is G T L + 4 . This was specifically designed to drive backplanes and is more robust for our application. As LVDS, it uses a terminated bus and incident wave switching but is voltage based instead of a current based. There are however some limitations on the impedance it can drive. Texas Instruments sells the devices in three categories [14]: one capable of driving 25 mA, one of 50 mA, and one 100 mA. The part we used is a medium-drive of 50 mA. 6.1.1 Finding the bus impedance First, it is not practical to define one perfect impedance as this depends on the number of cards inserted in the bus [16]. However, we can choose a reasonable value that will work in a wide range of situations. To obtain this impedance we first choose a backplane. The cable used here has an impedance of about 90fi. We feed the bus from one of the plug-in cards (Fig. 6.1), which sees two lengths of bus, 2 Peripheral Component Interconnect 3 Gunning Transceiver Logic 4 also known as GTLP CHAPTER 6. HUB BUS TERMINATOR 33 P m + CM H a Figure 6.1: Tapped Backplane Bus I Figure 6.2: G T L + Bus Terminator each at 90Q. This means that the card sees only half the bus impedance, 450 as the two sections appear in parallel. The situation becomes even worse with loading (adding more cards to it). In practice, we end up with a range of impedances from Z/2 to Z/3. Or: from about 450 to 30O. The medium-drive chip used can handle loads down to 250, so this is still ok. For the end termination of the bus we have to use a resistor equal to the bus impedance, here compromised to 910 standard value. This is a bit on the low side, as the flat cable used as backplane is specified between 900 and 100O. Due to it's unshielded nature, objects in the vicinity will increase the impedance (closer to a ground plane). 6.1.2 Implementation The G T L + standard centers the bus around IV for the undefined state. Due to the open-source outputs of the drivers, they can then pull this voltage down to near-ground for a logic zero. For CHAPTER 6. HUB BUS TERMINATOR 34 logic one, the bus itself requires active termination to the logic one voltage level of 1.5V. It is due to this topology that each module can enter high-impedance, or tri-state, and the bus can be shared. The termination itself (Fig. 6.2) has to be seen as an A C termination. In A C equivalent, all the bypass capacitors and the source5 appear as shorts to ground, thus only leaving the termination resistors visible to the higher-frequency signals of the bus. In this design we consistently chose point-of-load regulation. Because we are using low voltages throughout the circuit with fairly high peak current demands, long traces would introduce far too much voltage drop (due to resistance and inductance) and complicate the effective voltage regulation. Again in this part, the 1.5V regulation is placed on the termination board itself and uses copper planes to connect to the pull-up resistors. These planes double as a low-value high-frequency bypass capacitor, a desired side-effect. To find the proper value of the bypass capacitors we have to find how much charge is needed to preserve the supply voltage during switching. Q = CV (6.1) Where: Q = Charge V = Applied Voltage C = Capacitance : .• Differentiating this equation yields: - • C-I± (6 -3, From this we can compute a suitable value for the decoupling capacitor for one line of the bus as follows: The current supplied on the bus is 50mA with our medium-drive outputs. For convenience we group them per four lines, thus requiring 200mA. The rise/fall time of a switching event would be in the order of 2ns. We propose a maximum voltage droop of 1%, which for the 1.5V bus results in a maximum droop of 0.015V. _ 200mA * 2ns IbmV 5The source can be considered near-ideal for lower frequencies as long as the internal feedback loop of the voltage regulator is working properly. CHAPTER 6. HUB BUS TERMINATOR 35 C = 26.6671-F This would be the minimum capacitance allowable. We used standard bypass capacitors of 82nF. The choice of parts is also important since due to the high operating frequency we want to use the smallest possible casing to reduce inductance. The 0204 form factor parts used are about the smallest still usable for manual placement. Also the dielectric is of high importance, which for high-speed digital applications results in the use of ceramic devices [27], typically employing X7R or Y5V dielectrics. Chapter 7 H u b B u s L i n k 7.1 Hardware Architecture In order to connect the central solver to the system there were a number of possibilities. However, since the hub's bus was already parallel and this particular computer sits very close to it, we opted for a parallel link to this bus. Due to the short length of cable required, this would give us the best possible latency figures while easing implementation. This will be even more true when the computer is physically integrated with the bus. To reduce the delays even further, we opted to place the link at the physical center of the bus. This way we can halve the transmission propagation times from the central node to the furthest solver nodes on the bus. However, one can not just connect a relatively long piece of wire to the center of a high-speed bus. A tap on a transmission line bus, called a stub, has to be be as short as possible as not to disturb the continuity of the bus and cause excessive reflections. Thus we have to isolate the central node connection bus from the hub bus. ' This is accomplished by a set of buffers. Since there are no G T L + to G T L + buffers, we have to first convert to L V T T L 1 and then back to G T L + . An added difficulty is the bi-directional and re-programmable nature of the bus. Thus a certain amount of intelligence is required on this link card to control the direction of the buffers. For this reason a small C P L D 2 , a programmable gate array capable of simple state-machines and general logic, was integrated on the board. It can sense the bus control lines and determine the required transmission state depending upon the logic programmed inside of the chip. In the 1Low Voltage Transistor Transistor Logic 2 Complex Programmable Logic Device 3 6 CHAPTER 7. HUB BUS LINK 37 H O Figure 7.1: Bus Link case of OVNI, this will set the state as requested by the central solver and propagate it to the hub bus. The whole card thus functions as a bus repeater. However, a much different behavior could be implemented at will. 7.1.1 Implementation In figure 7.1, Bus B is the control and miscellaneous bus, bus A is the bi-directional data bus 3 . The used G T L + / L V T T L drivers require an external reference voltage. By setting this voltage, together with the pull-up voltage on the terminator cards, we can change the interface technology standard, for example, for backward compatibility with older G T L parts or other low-voltage incident-wave technologies. Here we wish to operate in the more advanced G T L + mode, and thus need a pull-up voltage of 1.5V and a reference voltage of 0.5V [16]. This reference is obtained from a separate voltage regulator for stability and supply noise-rejection, followed by a simple voltage divider. The current requirements are very benign and are satisfied by the divider with a decoupling capacitor at each chip. This regulator also supplies the power to the on-board termination for the central node side. The CPLD senses a number of bus B signals to determine the desired operation of the bus. The bus B level shifters B (TJ$2 and U$6 in the schematic) have to be correctly set at start up for this 3 At least in the current OVNI-NET concept, but this can be re-programmed as desired. CHAPTER 7. HUB BUS LINK 38 to work. In our case, these are hard-coded in the CPLD as one 8 bit out and one 8 bit in bus for the control signals. Once these are configured, the CPLD can sense all relevant signals, with some extra for future expansions, to determine the direction state of the other sub-busses. An added coaxial connection can be used for an external clock or a high-speed measurement point to aid in debugging the system. The bus arbitration CPLD has access to the bus clock, both for transmit and receive, for synchronous operation if so desired. The CPLD and FPGAs have full control over the tri-state (high-impedance) mode of the bus drivers so that an inactive card does not place any level on the bus. The bus pull-up at the terminator will automatically keep everything in a defined state (high), even when a bus is fully inactive and all connected boards are tri-stated. By using a transition sensitive instead of a level sensitive, clock, a low-to-high transition would then signify a clock event. Keeping the selected device off the bus (thus in tri-sate) until all signals have been set-up will assure reliable operation. The bus link card itself plugs into the hub bus at the center position and is connected to the central solver interface board with two short flat cables. For this reason it is best to combine the hub and the solver P C in one enclosure. This will improve reliability and keep every connection as short as possible. Another advantage is that the hub can now share the PC's power supply. The whole power system of the hub was designed with always at least one L D O 4 regulator in series with an external supply voltage to improve power purity. In this way, sharing with a rather noisy PC supply (due to the main board requirements) poses no problems. A boon is that modern computer mainboards can monitor the supply voltages so a watchdog program could keep track of fluctuations and warn the operator when the supply quality were to become too low to assure reliable operation. This can help prevent brown-out induced data corruption that may result in simulation errors. 7.2 Software Core Architecture For this board the code is rather trivial. In OVNI-NET there is only one bus that dynamically changes its direction and that is the data bus (BUS A). As such we can permanently program all other busses in the desired direction while we only need to sense the W E 5 signal to determine the direction of the data bus. This requires only a simple inverter. We hardly make use of the flexibility of the bus design, but this flexibility proved very handy during testing where we would modify the bus behavior and sense/provide the most important control signals through the high-speed coaxial test port. 4Low Drop Out 5Write Enable Chapter 8 H u b N o d e L i n k C a r d These cards contain the actual hub logic. They handle the serial communication to the solver nodes and make-up the shared memory system. Each module supports up to two nodes. The other end of this card plugs into the hub bus, which combines all of them together, and places everything under control of the central solver machine through the bus link interface card. 8.1 Hardware Architecture One of the main issues in this design is the clock. FPGA's are known to produce quite a bit of jitter. Given the strict requirements of the serial link modules, and the fact that several devices have to be supplied by the various clock signals, specialized buffers were used. These chips, Texas Instruments CDCVF2505 [29], are so-called zero-delay buffers [27]. They correct for their internal phase delay by a build-in P L L . By running separate, individually buffered clock lines, the signal quality improves quite a bit as the loads are now isolated from each other, resulting in better steepness of the clock transitions and thus less jitter. The individual clock traces can now be made to the required length without taps along the way, thus reducing reflections on the trace. The P L L of the clock buffer also provides a certain amount of filtering as it requires a small amount of time to react to changes. The shared hub bus is designed as a classic memory-mapped expansion bus. On our card (Fig. 8.1) the decoding of the address happens in a separate chip, a Xilinx XC9572XL CPLD. This was because we ran out of lines on the F P G A and wanted the ability to set the address in the field. Using two hex switches per channel, the current design can resolve up to 254 addresses1, and thus the same amount of nodes. If expansion were required, this is possible by adding an address line to the 'The addresses 0 and 255 are reserved. 39 CHAPTER 8. HUB NODE LINK CARD Figure 8.1: Hub Node Link Card CHAPTER 8. HUB NODE LINK CARD 41 address bus and some extra switches. However, with the current (and near-future) state-of-the-art, 254 machines are more than the architecture can handle in real-time and should be sufficient for the simulator for some time to come. The program in the C P L D resolves both the selected address and the broadcast address. This broadcast address selects all modules at once, and could be used to distribute the At, and other control signals, such as an indication if all modules are ready for communication. This is important to verify if the cluster is effectively running in hard-real-time and which solver nodes are being overloaded. Again the power supply required special consideration. Besides the usual high-speed logic re-quirements, the F P G A startup demands a 700mA surge [24]. To provide this energy, large buffer capacitances were incorporated. To aid in the high-frequency response of the power supply system, large power planes were used, resulting in a four-layer PCB. An advantage of using multi-layers is that we can have signal traces on both sides of the board, and due to the internal power layers, these signals are shielded from each other. This helps maintain a constant impedance along the trace, improving signal quality and reducing interference. To assure that the state machines in the F P G A and C P L D initialize properly at power-up, a separate reset-chip was added. Also, four LEDs are available to give status information and aid in trouble shooting the device. 8.2 Software Core Architecture Due to the modular nature of the hub module, each board is merely a part of the whole. The mutual integration happens through a memory-mapped architecture, where each board owns an address space. Because of this, we have to replicate some of the required functionality between the boards. On the other hand, this allows each module to function independently as well, making the whole design more fault-tolerant as the only things shared are the hub backbone busses. The used F P G A has enough resources to implement two communication channels per hub module (Fig. 8.2), thus making the whole design more cost-effective. 8.2.1 Serial Links The interface with the solver nodes happens through high-speed serial links. This module handles the serial protocol and relays the data into the shared memory segment appropriate for this partic-ular solver link. The serial receiver sub-unit extracts relevant control, status, and data words from the serial data stream, which are used to manage the data flow in the hub module. Conversely, the serial transmit unit integrates the control, status, and data information into a single stream. The CHAPTER 8. HUB NODE LINK CARD 42 S E L 0 o— S E L o Figure 8.2: Hub Module S B O S B 1 S O D S T A T D L E N • • • D A T A • • • 1 6 1 6 4 1 2 1 6 D L E N X 2 X 1 6 Figure 8.3: Serial Protocol build-in address generators assure that the serial interfaces can interface directly with the shared dual-port memory units. Protocol Due to our low-latency requirements, the serial protocol was designed to be light-weight and ex-pandable. In its current implementation (Fig. 8.3), no error control was introduced. If desired, this function can be easily added and executed at hardware speed. The spare bits in the protocol can be used to flag different versions or add other functions. We first have two start words that contain only zeros. These are used to prompt the receiver state machine that data could be happening. A string of zeros can also indicate the rest state. For the receiver to start, it requires a sequence of two zero words, followed by a word with the pattern 1010 as the first four MSB. This is the start-of-data sequence (SOD). Since the granularity of the serial link is 16 bits, we re-use the remaining 12 bits in this word to convey link and solver status CHAPTER 8. HUB NODE LINK CARD 43 ADDR Figure 8.4: Serial Links Unit information. This can, for example, contain the At and error flags. After the start and status words, we have a 10 bit word (10 LSB) containing the number of data bits in this packet. Zero bits is allowed. The six MSB of this word are currently unused as the maximum data size is 1024 32 bit words in the current implementation. These are free for other functions or future expansion and otherwise ignored in the system. Then follows the actual data payload, the length of which is given by D L E N X 2 X 16 bits. This because D L E N specifies the number of 32 bit words and the links serdes2 chips transfer the information 16 bits at a time. Architecture Due to the serial nature of the data-flow, state machines were used to implement the serial link data generators. They have individual ports for the D L E N and STAT information as well as a memory-compatible bus that can read/write directly to a normal memory unit (Fig. 8.4). The reset flag from the hub bus is used to reset the state-machines at boot time. Otherwise, the receiver is free-running, using sequence detection to synchronize its operations while the transmit unit detects the moment all data has arrived in the shared dual-port memory before starting the transfer. The way OVNI and M A T E work, the communications exhibit long dead times while the nodes compute. This time-out results in long zero-sequences, which help to keep the state machines sane. Normally long zero-sequences would cause the P L L units in the serdes chips to loose lock, giving rise the the common use of pseudo-random sequence scramblers or Manchester encoding. However, the chips we used [17, 21] have these provisions already built-in and emit synchronization transitions 2 Serializer/Deserializer CHAPTER 8. HUB NODE LINK CARD 44 in between our data bits. The whole process is transparent to the user. This was one of the main reasons why the parts were selected as they simplify their use significantly. 8.2.2 Dual-port memory Key to the design of the shared memory is a dual-port memory instantiation in the F P G A . This fast block memory [24] can operate synchronous with the rest of our system, therefore avoiding the need for wait stages and access multiplexing. This results in very high performance. Furthermore, due to the dual port access we have some room for further optimizations (such as starting the data transfer after only part of the data has already come in) which can save us a couple microseconds on the total communication time. More importantly, the dual port architecture significantly simplifies the implementation of the serial link receivers. These run in a different clock domain and the memory is used to synchronize them with the rest of the system. This is because both D P 3 memory ports allow independent access. It has to be noted that there is one port which is both read and write, while the other port can only be read. With our chosen F P G A , there are enough DP block memory resources on the chip to implement a 512 word transmit and a 512 word receive buffer for each of the two channels on the hub card. This is a limitation of our design. However, it is possible to optimize the implementation, making use of the fact that M A T E only requires half-duplex communications, and combine these two memory blocs together in a single 1024 word deep buffer. We could also use a larger F P G A with more memory. In the current implementation we used the individual transmit and receive memories, therefore limiting the theoretical data size of 1024 words per node (170 links) to 512 words (85 links). While less than what the architecture can handle, it is still more than enough as a typical simulation uses far fewer links, usually less than ten. Compared to the previous version of the network system in OVNI-RTDNS [8], this is a significant improvement as that system only supported a maximum of two to four links per computational node. 3Dual Port Chapter 9 Serial Transceiver To link the many nodes to the hub, there are a number of physical issues that result from the tight constraints on the communication times and quality. The main problem is the speed of light itself, in that it is too slow. Indeed, in this system the propagation delay through the communication channels is already significant. Unfortunately, this constrain can not be avoided with commercial electronics and physics. Another issue is the desired high bandwidth. The easiest solution is a parallel bus. There are however two issues with that. Let us investigate. Let us assume a reasonably-sized cluster. Thirty machines take a certain amount of physical space and connecting them requires cable. With some clever stacking two meters of cable will do. This, however, is already long and sending signals over it will quickly become unreliable. Thus we have to keep the actual signaling rate relatively low. At 1.06 Gbps this requires a 32 bit bus (bus speed ~33 MHz). To improve the noise immunity and prevent coupling from adjacent signal lines we either have to use differential signaling or alternate signal and ground lines. Such a measure is required as computers are noisy, or could be installed in a noisy environment, and we want to avoid transmission errors as much as possible as re-transmission takes too much time in hard-real-time and error-correction mechanisms take time and, depending upon the implementation, are sometimes at the cost of accuracy. In either case we end up with at least 64 wires for a reliable system. This is a lot. Those cables are not only expensive and bulky, but prone to intermittent contacts with aging due to the multitude of them required to build the cluster. Using more advanced signaling technologies, such as G T L P , the bus speed can be boosted on less wires but it is still limited in distance. When the system becomes larger the situation becomes worse. Parallel buses are thus not very suited for our application. A far more desirable solution is the use of serial communications, either over cable or fiber-45 CHAPTER 9. SERIAL TRANSCEIVER 46 Figure 9.1: Serial Transceiver optics. Here we choose cable for budgetary reasons, but the transceiver is modular so it can be easily replaced by a laser module if so required. In this project a fully-integrated serializer/deserializer was employed. The part, DS92LV16 [17, 21] by National Semiconductor, takes in our data at 66 MHz 16 bit and renders it into a 1.06 Gbps LVDS (Low Voltage Differential Signaling) [18] bit stream with embedded clock. The other half of the chip is a receiver that not only extracts the data but also recovers the clock signal from the received data stream using a local oscillator as reference. A number of test- and synchronization modes are build-in, making it a convenient, reliable solution. We are running the chip at a lower speed than it is capable of (80 MHz, 1.25Gbps) to allow for component aging and leave more headroom to improve the noise immunity. Due to an issue with the clock distribution and termination in our current version of the design, we had to reduce the operational speed to 528 Mbps, using a 33 MHz clock signal. This reduction is only temporarily and in the next iteration of the design the issue will be addressed. 9.1 Hardware Architecture The circuit consists out of two major parts (Fig. 9.1). One is the chip, which converts the parallel data in a high-speed serial stream and a high-speed serial stream back into parallel data. This is done in full-duplex, allowing simultaneous transmission and reception of data. It also has build-in PLLs and clock-extraction circuits. These are vital to ensure tight synchronization of the received data stream and to produce the internal ~0.5GHz serialization clock signal. The second section is a differential impedance matching network. This ensures a good match of CHAPTER 9. SERIAL TRANSCEIVER 47 the chip to the used twin axial transmission lines while helping improve the noise immunity. It also provides separation of the ground planes connected to the cable shields to prevent ground loops. Although small, a design operating at these high signaling rates requires special attention. Many provisions were made to ensure reliable operation, even under adverse conditions. 9.1.1 Power Let us begin with the power supply. Due to the extremely high switching speed within the chip and the co-existence of both analogue and digital circuits on one piece of silicon, power purity is not a simple matter. First of all the board was given its own local voltage regulator. Due to the supply line inductances it is desirable to place the regulation as close to the point of load as possible. In order to prevent the use of a multi-layer board, the power plane only extends under the chip itself and is then fed with three short separate wires to the regulator. This multiple-path parallel feeding reduces the path inductance if the mutual coupling between the wires is kept at a minimum. This is done by routing them at different angles and spacing them as much as possible from each other. Next thing on the list is a solid ground. Here this was accomplished by a ground plane on the underside of the board. To ensure low-inductance connections to the chip, many vias in very close proximity to the pins were used. Finally, copious decoupling capacitors, as recommended by the manufacturer, were employed. We used pairs of high-quality ceramic 0204-sized surface mount parts. These tiny devices have a very low self-inductance and are small enough to fit twenty-four of them directly underneath the chip. This allowed us to keep the connections as short as possible, further improving the high-frequency performance. Care was taken to place all of them directly on the small power plane under the chip as to help further reduce the negative effects of the wires feeding this plane. 9.1.2 Matching The very high-frequency serial data stream has to be treated as any R F signal and transmission line theory has to be used. Indeed, at ~0.5GHz fundamental, this differential LVDS signal has more in common with microwave than normal digital signals. First we had to select a suitable cable to run it over. Unfortunately, due to budgetary and practical reasons the available options were greatly reduced. A good compromise was found in a high-quality 150fi twin axial cable manufactured by Belden (part nr. 9182). This cable sports a reasonable attenuation for it's size, good shielding and tight control over the actual length of the signal pair. This pair has to be well-matched in length because otherwise, due to propagation time differences, the differential signal phase shifts and this causes signal degradation. The used twinax, CHAPTER 9. SERIAL TRANSCEIVER 48 with its signal wires twisted and double shielding, is rugged and has very good noise-rejection characteristics making it suitable for a production environment. In order to match the cable, we have to determine the output impedance of the transmitter in the chip. This is not given by the manufacturer because it changes in function of the operating point of the driver. However, we can estimate it from the I /V characteristic [27]. The required information can be obtained from the IBIS simulation files available for the part. By plotting the curve and determining the slope, we can determine the dynamic impedance of the driver output stages. This proved to be close to 80 single ended, a typical value for this type of circuit. Now, to match it we simply put a 680 resistor in series. This brings the driver impedance close to the required 750 single-ended or 150O balanced, as required for this cable. Again, this is a compromise. We chose a better match in favor of higher signal strength. Since we desire reliability over size, the compromise can be justified. The receiver is easy to match. LVDS is in essence a current signaling technology. We only have to terminate the transmission line in it's characteristic impedance. The voltage dropped over the termination resistor is measured by the high-impedance input of the chip. Here, a small-size 0204 150O surface mount resistor was used. The choice of 0204 resistors has to do with the desire to have as little impedance discontinuity as possible in the connections. For this reason all tracks should be 750 micro strip and the components are chosen to have about the same width as not to disturb the impedance too much. However, due to the thickness of the board we could not use 750 tracks (if we had used a four layer board the dielectric would have been thinner and the required width possible). A compromise was reached by using short, thinner traces and still minimizing impedance jumps; hence the use of the 0204 package devices since these are about the smallest readily available. Still, at only one centimeter length, the traces proved adequate for the purpose. Only, when making a production design, it is wise to take extra care and, if possible, use a proper trace width. This will improve the match. 9.1.3 CAT5 Cable alternative The use of this device proved highly effective. During tests, an available length of unshielded CAT5 twisted pair cable was tried. The three meter long piece still allowed reliable communication. This is even more impressive when noted that CAT5 has a characteristic impedance of 100O and our design is terminated in 150 O. However, to use this type of cable in practice, we would strongly suggest to modify the termination to match, and use CAT6 cable instead as it has far better performance, although still inferior to twinax due to higher losses and lower mechanical quality. CHAPTER 9. SERIAL TRANSCEIVER 49 9.1.4 Notes The use of a separate daughter board has a number of reasons. First, it allows different types of wiring to be used, depending upon circumstances. E.g., parallel, serial over copper, or serial over fiber. Since all interface boards can accept up to two units, it also allows specialized modules to be designed, for example to connect to other types of equipment. This allows the system to be used in a wide variety of set-ups without creating vendor lock in. Also on vendor lock in, the used serializer/deserializer is proprietary to National Semiconductor and at the time of writing there are no second-source suppliers for the part. Although National informed us regarding the future availability of the part, this is not a comfortable situation due to the highly specialized nature of the component. Hence the separate board. If availability becomes an issue, a new plug-in board can be designed to replace the current one without having to re-design the entire interface card. This actually applies to all components used, which is another motivator behind the rigorous modularity of the design. Using serial links in a low-latency environment brings up another issue. They use PLLs which require a certain tracking and lock-in time before the communication channel becomes available. The lock time for this chip is 1/zs, and thus significant. For this reason we used a point-to-point approach where the links are always up. After lock is achieved (about l^s at start up) the delay in the link chips themselves (without cable) is at most four 16 bit bus cycles, 0.12^s. Clock purity is of utmost importance due to the high multiplication used in the chips. Hence good quality crystal oscillators suitable for communications have to be used. As a consequence, the complex clock distribution on the mother board requires clock buffers and other measures to preserve the low jitter properties. Chapter 10 N o d e solver boa rd The last part in the system is the board that connects the solver nodes to the shared memory hub using serial links. It is again a PCI card, similar to the one on the central solver, that now provides connection to two serial link units. 10.1 Hardware architecture Since this unit essentially uses exactly the same board as the central solver, we refer to chapter 5 for the description and will only outline the differences here. 10.1.1 Serial transceiver interface The board has two connectors and each supports one serdes unit. Since these connectors share the same F P G A pins as the G.TLP level shifter, these hub bus connectivity chips should be omitted, together with the hub bus connectors, when assembling the board to prevent signal conflicts during operation. Otherwise the physical hardware is identical. 10.2 Software Core Architecture Although the PCI core and the serial link formatter units are identical to those used in, respectively, the central solver board of chapter 5 and the hub node link card of chapter 8, there are a few differences in the operation of the unit. 50 CHAPTER 10. NODE SOLVER BOARD 51 Figure 10.1: Node Solver Core 10.2.1 Dual-port memory The serial link units require the data to be send and received continuously. However, the PCI bus does not guarantee constant data flow [30]. As such it proved necessary to buffer all data for both transmit and receive in a dual-port memory bank from which the serial data can be consecutively streamed (Fig. 10.1). These memories are implemented as two 1024 word deep dual-port block ram units in the F P G A . 10.2.2 Serial links These are essentially the same as in chapter 8. There are now a set of configuration registers involved that need to be manually set. The first is a self-clearing link reset flag. By setting it, the state-machines in the link data formatters are reset. It is generally a good idea to do this reset before starting a simulation. The second word is special. It contains both local as remote status information. Of the thirty-two bits, the first sixteen LSB contain local status information and of the upper sixteen, the twelve lowest contain remote status info. Using these bits we can evaluate not only the status of the serial links, but also At information and other important status information. Furthermore, these data words are accessible by the boards as well, helping to guide the sequencing of the communications cycle. When writing to this register, only the local status bits change and only the ones not under CHAPTER 10. NODE SOLVER BOARD 52 control by the links core. If data is written in those, it will be re-written by the links status at the next Wishbone bus cycle. Config Space Register Name Address Bit Function Link Status 0x18000 31-0 Local and remote link status. C T L (Control) 0x18004 0 Reset link formatters when set R X 0x14000 N / A Receiver data formatter dual-port memory access T X 0x10000 N / A Transmitter data formatter dual-port memory access All registers are 32 bit, undefined bits are not used. Addresses given as required for C device driver code. Table 10.1: Node Solver links config space 10.2.3 Skip timer In order to support different At rates in the system, each node solver has a specially modified version of the At timer (section 5.2.2). The only difference is that instead of obtaining the clock from the WB bus and dividing it to a 1 /is signal, we now use the system At to clock the timer. By doing this the timer will only produce an output when a defined number of cycles has passed. By choosing the system At to be the fastest one, we can now obtain every integer multiple of this simulation interval. Since this is a real-time simulator, we have to take worst-case timing into account. For this reason all units will exchange data at every At , even if a node runs at a slower rate, effectively running the system at worst-case all the time. However, during these cycles the previous versions of the data are communicated and the solver computer is not involved in the process. As such this comes at no performance penally. For soft-real-time systems, we could change this behaviour to skip communicating nodes that do not strictly require data exchange and speed-up the whole. However, if the situation were to occur when all nodes have to communicate, the bus might then be saturated and the communications could exceed the time budget. CHAPTER 10. NODE SOLVER BOARD Config Space 53 Register Name Address Bit Function T C R (Timer Control Register) 0x0 0 System A t ignored when set, simulation stopped 1 Single simulation run when set PCR (Period Control Register) 0x4 31-0 32 bit A t skip value All registers are 32 bit, undefined bits are not used. Addresses given as required for C device driver code. Table 10.2: Node Solver skip timer config space Chapter 11 API No piece of computer hardware is complete without a device driver and the AP^provided by it. To illustrate how to interact with the hardware and its high level of abstraction, an example device driver was written for the Linux 2.4 platform which provides system calls to the various setup registers and takes care of all hardware initializations and address translations necessary to easily access the network system. The result is a clean, easy to use set of commands issued through IOCTL commands. All actual data transfers happen through M M A P instructions. Since the actual source code for both the driver and example programs is thoroughly documented, we refer to that for the actual details. However, to illustrate the ease of use, let us examine the two most important operations in detail. 11.1 ioctl Under Unix-style operating systems, calls to device drivers happen through a special system call named ioctl. 11.1.1 IOCTL manpage entry I0CTL(2) Linux Programmers Manual IOCTL(2) N A M E ioctl - control device SYNOPSIS #include <sys/ioctl.h> 1 Application Program Interface 54 CHAPTER 11. API 55 int ioctl(int d, int request, ...); DESCRIPTION The ioctl function manipulates the underlying device parameters of special files. In particular, many operating characteristics of character special files (e.g. terminals) may be controlled with ioctl requests. The argument d must be an open file descriptor. The second argument is a device-dependent request code. The third argument is an untyped pointer to memory. Its traditionally char *argp (from the days before void * was valid C), and will be so named for this discussion. An ioctl request has encoded in it whether the argument is an in parameter or out parameter, and the size of the argument argp in bytes. Macros and defines used in specifying an ioctl request are located in the file <sys/ioctl.h>. R E T U R N V A L U E Usually, on success zero is returned. A few ioctls use the return value as an output parameter and return a nonnegative value on success. On error, -1 is returned, and errno is set appropriately. ERRORS E B A D F d is not a valid descriptor. EFAULT argp references an inaccessible memory area. E N O T T Y d is not associated with a character special device. E N O T T Y The specified request does not apply to the kind of object that the descriptor d references. EINVAL Request or argp is not valid. CONFORMING T O No single standard. Arguments, returns, and semantics of ioctl(2) vary according to the device driver in question (the call is used as a catch-all for operations that don't cleanly fit the Unix stream I/O model). See ioctl_list(2) for a list of many of the known ioctl calls. The ioctl function call appeared in Version 7 A T & T Unix. SEE ALSO execve(2), fcntl(2), ioctl_list(2), mt(4), sd(4), tty(4) 11.1.2 Example use All available calls can be found in the header file test.h. As much as possible, the same calls were used for both the central solver and the node solver cards. In fact, the driver is the same for both CHAPTER 11. API 56 boards, one just uses different ioctl calls which pertain to the particular piece of hardware under investigation. Let us now look at an example. Suppose we wish to set the At timer to a 2 s^ interval. Do note that a valid file descriptor (fd) has to exist for the device for this to work. The code required looks as follows: valid data = (void*)0x00000002; result = ioctlifd, O V N I _ N E T _ S E T _ T I M E R _ I N T E R V A L , &data); When looking at the actual IOCTL call, we first see the file discriptor which defines to which device we want to address the call. In Unix, all devices in the system are files. As such, all devices are addressed as files and an open file descriptor is necessary. The second argument is the actual IOCTL call. Technically, these are numbers assigned by the device driver writer. However, to make the intend of the call clear to the programmer, these calls are also defined in the test.h (header file for the device driver) with their full name and refer in there to their actual number as used by the driver itself. The full list of calls can be found in the header. The last argument is the data we wish to write in or read back from the call. To know what to expect, it is advised to look at the device driver itself and the provided example program link_test.c. The argument is sessentially a pointer which is converted in the driver between user space to/from kernel space. 11.2 mmap This powerful instruction allows us to map sections of system memory on another piece of memory. Since OVNI-NET is presented to the system as a memory-mapped device, this call also allows us to map data on the network system. It is currently the way used to move data over the system, at least until full D M A is implemented. 11.2.1 mmap manpage entry MMAP(2) Linux Programmers Manual M M A P (2) N A M E mmap, munmap - map or unmap files or devices into memory SYNOPSIS ^include <sys/mman.h> CHAPTER 11. API 57 #ifdef _ P O S I X _ M A P P E D _ F I L E S void * mmap(void *start, size_t length, int prot , int flags, int fd, off_t offset); int munmap(void *start, size_t length); #endif DESCRIPTION The mmap function asks to map length bytes starting at offset offset from the file (or other object) specified by the file descriptor fd into memory, preferably at address start. This latter address is a hint only, and is usually specified as 0. The actual place where the object is mapped is returned by mmap, and is never 0. The prot argument describes the desired memory protection (and must not conflict with the open mode of the file). It is either P R O T _ N O N E or is the bitwise OR of one or more of the other P R O T _ * flags. P R O T _ E X E C Pages may be executed. P R O T _ R E A D Pages may be read. P R O T _ W R I T E Pages may be written. P R O T _ N O N E Pages may not be accessed. The flags parameter specifies the type of the mapped object, mapping options and whether modi-fications made to the mapped copy of the page are private to the process or are to be shared with other references. It has bits M A P _ F I X E D Do not select a different address than the one specified. If the specified address cannot be used, mmap will fail. If M A P _ F I X E D is specified, start must be a multiple of the pagesize. Use of this option is discouraged. M A P _ S H A R E D Share this mapping with all other processes that map this object. Storing to the region is equivalent to writing to the file. The file may not actually be updated until msync(2) or munmap(2) are called. M A P _ P R I V A T E Create a private copy-on-write mapping. Stores to the region do not affect the original file. It is unspecified whether changes made to the file after the mmap call are visible in the mapped region. You must specify exactly one of M A P _ S H A R E D and M A P _ P R I V A T E . The above three flags are described in POSIX.lb (formerly POSIX.4) and SUSv2. Linux also knows about the following non-standard flags: M A P _ D E N Y W R I T E This flag is ignored. (Long ago, it signalled that attempts to write to the underlying file should fail with E T X T B U S Y . But this was a source of denial-of-service attacks.) CHAPTER 11. API 58 M A P _ E X E C U T A B L E This flag is ignored. M A P _ N O R E S E R V E (Used together with MAP_PRIVATE. ) Do not reserve swap space pages for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify this private copy-on-write region. When it is not reserved one might get SIGSEGV upon a write when no memory is available. M A P _ L O C K E D This flag is ignored. M A P _ G R O W S D O W N Used for stacks. Indicates to the kernel V M system that the mapping should extend downwards in memory. M A P _ A N O N Y M O U S The mapping is not backed by any file; the fd and offset arguments are ignored. This flag in conjunction with M A P _ S H A R E D is implemented since Linux 2.4. M A P _ A N O N Alias for M A P _ A N O N Y M O U S . Deprecated. M A P _ F I L E Compatibility flag. Ignored. Some systems document the additional flags M A P _ A U T O G R O W , M A P _ A U T O R E S R V , M A P _ C O P Y , and M A P _ L O C A L . fd should be a valid file descriptor, unless M A P _ A N O N Y M O U S is set, in which case the argument is ignored. offset should be a multiple of the page size as returned by getpagesize(2). Memory mapped by mmap is preserved across fork(2), with the same attributes. A file is mapped in multiples of the page size. For a file that is not a multiple of the page size, the remaining memory is zeroed when mapped, and writes to that region are not written out to the file. The effect of changing the size of the underlying file of a mapping on the pages that correspond to added or removed regions of the file is unspecified. The munmap system' call deletes the mappings for the specified address range, and causes further references to addresses within the range to generate invalid memory references. The region is also automatically unmapped when the process is terminated. On the other hand, closing the file descriptor does not unmap the region.. . The address start must be a multiple of the page size. All pages containing a part of the indicated range are unmapped, and subsequent references to these pages will generate SIGSEGV. It is not an error if the indicated range does not contain any mapped pages. For file-backed mappings, the st_atime field for the mapped file may be updated at any time between the mmap() and the corresponding unmapping; the first reference to a mapped page will update the field if it has not been already. CHAPTER 11. API 59 The st_ctime and st_mtime field for a file mapped with P R O T _ W R I T E and M A P _ S H A R E D will be updated after a write to the mapped region, and before a subsequent msync() with the M S _ S Y N C or M S _ A S Y N C flag, if one occurs. R E T U R N V A L U E On success, mmap returns a pointer to the mapped area. On error, M A P _ F A I L E D (-1) is returned, and errno is set appropriately. On success, munmap returns 0, on failure -1, and errno is set (probably to EINVAL). ERRORS E B A D F fd is not a valid file descriptor (and M A P _ A N O N Y M O U S was not set). EACCES A file descriptor refers to a non-regular file. Or M A P _ P R I V A T E was requested, but fd is not open for reading. Or M A P _ S H A R E D was requested and P R O T _ W R I T E is set, but fd is not open in read/write (0_RDWR) mode. Or P R O T _ W R I T E is set, but the file is append-only. EINVAL We don't like start or length or offset. (E.g., they are too large, or not aligned on a PAGESIZE boundary.) E T X T B S Y M A P _ D E N Y W R I T E was set but the object specified by fd is open for writing. EAGAIN The file has been locked, or too much memory has been locked. E N O M E M No memory is available, or the process's maximum number of map- pings would have been exceeded. E N O D E V The underlying filesystem of the specified file does not support memory mapping. Use of a mapped region can result in these signals: SIGSEGV Attempted write into a region specified to mmap as read-only. - " SIGBUS Attempted access to a portion of the buffer that does not correspond to the file (for example, beyond the end of the file, including the case where another process has truncated the file). ' CONFORMING T O SVr4, POSIX.lb (formerly POSIX.4), 4.4BSD, SUSv2. SVr4 documents additional error codes ENXIO and E N O D E V . SUSv2 documents additional error codes E M F I L E and E O V E R F L O W . SEE ALSO getpagesize(2), mmap2(2), mremap(2), msync(2), shm_open(2), B.O. Gallmeister, POSIX.4, O'Reilly, pp. 128-129 and 389-391. Linux 2.3.51 2000-03-25 MMAP(2) CHAPTER 11. API 60 11.2.2 msync manpage entry MSYNC(2) Linux Programmer's Manual MSYNC(2) N A M E msync - synchronize a file with a memory map SYNOPSIS #include <unistd.h> #include <sys/mman.h> #ifdef _ P O S I X _ M A P P E D _ F I L E S #ifdef _POSIX_SYNCHRONIZED_IO int msync(const void *start, size_t length, int flags); #endif #endif DESCRIPTION msync flushes changes made to the in-core copy of a file that was mapped into memory using mmap (2) back to disk. Without use of this call there is no guarantee that changes are written back before munmap(2) is called. To be more precise, the part of the file that corresponds to'the memory area starting at start and having length length is updated. The flags argument may have the bits M S _ A S Y N C , M S _ S Y N C and MS_INVALIDATE set, but not both M S _ A S Y N C and M S _ S Y N C . M S _ A S Y N C specifies that an update be scheduled, but the call returns immediately. M S _ S Y N C asks for an update and waits for it to complete. M S _ I N V A L I D A T E asks to invalidate other mappings of the same file (so that they can be updated with the fresh values just written). R E T U R N V A L U E On success, zero is returned. On error, -1 is returned, and errno is set appropriately. ERRORS EINVAL start is not a multiple of PAGESIZE, or any bit other than M S _ A S Y N C | MS_INVALIDATE | M S _ S Y N C is set in flags. EFAULT The indicated memory (or part of it) was not mapped. CONFORMING T O POSIX.lb (formerly POSIX.4) SEE ALSO mmap(2), B.O. Gallmeister, POSIX.4, O'Reilly, pp. 128-129 and 389-391. Linux 1.3.86 1996-04-12 MSYNC(2) CHAPTER 11. API 61 11.2.3 Example use From the device driver ioctl calls we can obtain the base address of our card. Using this address, and the Wishbone addresses provided in this thesis, we can now use mmap to address the memory regions of our choice. We do have to note that, due to Intel architectural constraints, the minimum page size (granularity of the data) we can resolve is 4K. With 32bit word data sizes, this resolves to 1024 words. This is one of the reasons why we choose this as our memory size. It is more then enough for our system while we can avoid problems with the P C trying to address out-of-bounds memory areas on the PCI cards. To obtain a memory map we now use: buf_write = mmap(0, length, P R O T _ R E A D | P R O T _ W R I T E , M A P _ S H A R E D , fd, address); buf_write is a sufficiently-sized buffer and length the page size. The two flags that follow (the first one combined out of two flags) configure the operation in such a way it is compatible with our cards. The two combined ones say that read and write operations are allowed on this data segment (we would typically use one or the other). The last flag is chosen so that the memory segment is only update when we explicitly ask the system to do that, through an unmap or msync operation. Then there are the file descriptor and the actual address we which to operate on. This call alone is not enough. To actual write the data in, we need now to prepare the data in the buf_write buffer and call the msync or unmap operation. msync((int*)address, length, M S _ S Y N C | MS_INVALIDATE); Here address and length are the same again as above. Specifying a shorter length has no use as this will be rounded-off in the kernel to the next higher page size anyway. The two flags cause the write operation to block until the write is completed and all other references to the old data are invalidated, thus making sure we are always working with current data. For reading data the operations are similar. Chapter 12 Results At the time of writing, the system was still under development. As such, all results in this chapter are partial and refer to non-optimized hardware. As much as possible, we have attempted to present realistic values based on preliminary measurements of the system. In the current implementation, the maximum number of three-phase links per solver node is set at 84 (6 32 bit words per link) and can be readily expanded to twice as many, pending optimizations. It has to be noted that the total number of links in the system is limited by the bandwidth of the central node. Although ample, this does pose a limitation on the maximum system size in the current imple-mentation of OVNI using M A T E . However, future use of the full capabilities of PCI (66MHz, 64bit vs 33MHZ, 32bit used now) or upgrading to PCI Express can drastically alleviate this limitation. Also the issue with the serial links, when resolved, will significantly enhance the results as these links are currently operating at half-speed. To estimate the efficiency of the PCI bus, we used a logic analyzer to investigate the data produced on the Wishbone bus. From these observations, we say that using mmap operations roughly 2/3 of the data.cycle was actually used. Re-programming the latency and grant timers of the PCI bus can be done to further enhance the performance, even without using D M A [30]. Table 12.1 shows the performances of different interfaces, including OVNI-RTDNS and OVNI-NET, for the particular case of point to point communication. Even tough there is already an important advantage in the communication time for the point to point case, the full advantage of the new hardware interface is seen when full connectivity is used between the solver nodes of the simulator, and when the full bandwidth of the hardware is exploited. The comparison of available bandwidth is seen in table 12.2. It has to be noted that all the 62 CHAPTER 12. RESULTS 63 I/O Interface Dell P - M M X 233 MHz Bus Width Time needed for 3-phase Transmission Line E P P / E C P 0.4 M B / s 8 57.6 us PCI-EPP/ECP 1 M B / s 8 24 us OVNI-RTDNS 3.3 MB/s 16 7 us OVNI-NET 88 MB/s 32 4 p,s Table 12.1: Single-Line point-to-point data transfer performance Transmission lines OVNI-RTDNS OVNI-NET 1 3-phase 7 us 4 us 2 3-phase 14 us 4.4 us 84 3-phase not possible 37.6 /J,S Table 12.2: Point-to-point bandwidth OVNI-NET advantage OVNI-NET transmission times in these tables are for the complete OVNI-NET system, including the hub, running on a non-real-time version of Linux1 in user-space. The OVNI-RTDNS system ran on the real-time PharLap R T X operating system. As such, the OVNI-NET is significantly put in the disadvantage until the conversion of the driver to R T X is completed. From the results in table 12.2 we can really see the advantage of OVNI-NET the moment we take more than one three-phase link into account, as would be the case in a real simulation case. Due to the high bandwidth and very low, non-additive latency we can leave the previous system far behind. It has to be noted that these figures are for our current system. When fully optimized, we can double the performance for the data payload transmission time. The base latency however will not be influenced much while the data communications time will be halved. The total transmission time in OVNI-NET is dependent on the total number of links in the system2 plus an additional 30ns overhead for each solver node and can be computed as follows: • Latency from solver node to PCI card 1.5 /is • Largest data block in the system 33Mwords/s x | x data size • Sum of all the data in the system 33Mwords/s x | x total data size 'Standard RedHat 9 kernel 2 With a maximum of 84 per node and a maximum of 243 nodes in the entire system. CHAPTER 12. RESULTS 64 • Protocol overhead 30 ns per solver node • Latency from solver node to PCI card 1.5 (is As such, these point-to-point tables also reflect the communication time for a full-fledged MATE system since they only exclude the protocol overhead for the multiple solver nodes involved. Chapter 13 Conclusions OVNI-NET greatly enhances the flexibility of the OVNI Real-Time Power System Simulator while still maintaining its outstanding performance and full real-time capabilities. The addition of a shared-memory hub and distributed intelligence allow the M A T E concept to be directly mapped to the cluster interconnect itself, providing the ideal environment for this network solution scheme. Due to the adaptability of the system, it can be readily fitted to the older OVNI-RTDNS system if so desired by omitting the hub, for even lower latency, allowing it to profit from the enhanced interconnect performance. In this thesis we have discusses the following contributions: • Provided a new, faster and more flexible cluster interconnection system custom tailored to M A T E while still preserving full flexibility, allowing it to be re-used for other purposes and leaving the opportunity to make changes at a later state to reflect any new research on the algorithm side of the OVNI simulator • Devised a highly efficient non-additive latency architecture that while being cost effective, can still give better performance than industry leaders. • Devised an efficient way to dynamically map data on the shared memory and infer the oper-ational parameters at runtime. • Demonstrated a processor-less distributed network processor architecture capable of operating at full synchronous speed with the data transfer. Further work can be done to increase the bandwidth of the system and optimize the data flow. These enhancements, together with'the implementation of D M A , can at least double the bandwidth and further lower the latency. 6 5 Chapter 14 Statements Contrary to popular belief, Unix is user friendly. It just happens to be very selective about who its friends are. -Kyle Hearn ... "fire" does not matter, "earth" and "air" and "water" do not matter. "I" do not matter. No word matters. But man forgets reality and remembers words. The more words he remembers, the cleverer do his fellows esteem him. He looks upon the great transformations of the world, but he does not see them as they were seen when man looked upon reality for the first time. Their names come to his lips and he smiles as he tastes them, thinking he knows them in the naming. - Roger Zelazny, "Lord of Light" Most things make sense when you look at them right. It's just sometimes you have to look really, really cockeyed. -Florence Ambrose, Freefall My take on all this is pretty simple: in a country where it is considered a normal, sane and fun recreational activity to strap two greased sticks to your feet and throw yourself down the side of a frigging mountain, nobody has the right to call *my* minor peccadilloes "unsafe." -Nathan J. Mehl "I don't want to. achieve immortality through my works. I want to achieve it by not dying. "-Woody Allen "There is a theory which, states that if anyone ever discovers exactly what the Universe is for and why it is here, it will instantly disappear and be replaced by something even more bizarre and inexplicable. There is another theory which states this has already happened." -Douglas Adams 66 B i b l i o g r a p h y [1] J. Marti, L . Linares, J . Calvino, H. Dommel, and J. Lin, "Ovni: an object approach to real-time power system simulators," in Power System Technology, 1998. Pro-ceedings. POWERCON '98. 1998 International Conference on, vol. 2, pp. 977-981 vol.2, 1998. [2] J. Marti, L. Linares, J . Hollman, and F. Moreira, "Ovni: Integrated soft-ware/hardware solution for real-time simulation of large powersystems," in 14th Power Systems Computer Conference, (Seville), PSCC'02, 2002. [3] J. Marti, L. Linares, R. Rosales, and H. Dommel, "Ovni: a full-size real time power system simulator," in International Conference on Digital Power System Simulators, (Montreal, Quebec, Canada), pp. 1309-1317, ICDS'97, 1997. [4] J. Hollman, J. Marti, and J. Calvino Fraga, "Implementation of a real-time dis-tributed network simulator with pc-cluster," in Parallel Computing in Electrical Engineering, 2000. PARELEC 2000. Proceedings. International Conference on, pp. 223-227, 2000. ' ."' [5] J . Hollman and J. Marti, "Real time network simulation with pc-cluster," Power Systems, IEEE Transactions on, vol. 18, no. 2, pp. 563-569, 2003. [6] L. Linares, A Real time simulator for power electric networks. M.A.Sc. Thesis, Vancouver: University of British Columbia, 1993. [7] L. Linares, OVNI (Object Virtual Network Integrator) a new fast algorithm for the simulation of very large electric networks in real time. Ph.D. Thesis, Vancouver: University of British Columbia, 2000. [8] J . A. Hollman, Real time distributed network simulation with pc clusters. M.A.Sc. Thesis, Vancouver: University of British Columbia, 1999. 67 BIBLIOGRAPHY 68 [9] L. Linares and J . Marti, "Sub-area latency in a real time power network simulator," in International Conference on Power System Transients, (Lisbon), pp. 541-545, IPST'95, 1995, [10] F. Moreira, J . Hollman, L . Linares, and J. R. Marti, "Network decoupling by latency exploitation and distributed hardware architecture," in International Conference on Power System Transients, (Rio de Janeiro), IPST'01, 2001. [11] A. Semlyen and F. de Leon, "Computation of electro-magnetic transients usign dual or multiple time steps," IEEE Transactions on Power Systems, vol. 9, pp. 1274-1281, 1993. [12] M . Dolenc and T. Markovic, PCI IP Core Design Document rev 0.1. OpenCores.org, 2002. [13] M . Dolenc and T. Markovic, PCI IP Core Specification rev 0.6. OpenCores.org, 2002. [14] Texs Instruments, SCEB005 Increase the speed of parallel backplanes 3x with GTLP, 2000. [15] S. Blozis, SCEA019 Texas Instruments GTLP Frequently Asked Questions. Texas Instruments, 2001. [16] P. Forstner and J . Huchzermeier, SCBA015A Fast GTLP Backplanes With the GTLPH1655. Texas Instruments, 2001. [17] National Semiconductor, DS200143 DS92LV16 16-Bit Bus LVDS Serial-izer/Deserializer - 25 - 80 MHz, 2002. [18] J. Brunetti and B. Von Herzen, XAPP230 The LVDS I/O Standard vl.l. Xilinx, 1999. [19] J. Brunetti and B. Von Herzen, XAPP231 Multi-Drop LVDS with Virtex-E FPGAs vl.0. Xilinx, 1999. [20] S. Kitanovska and D. Schultz, XAPP243 Bus LVDS with Virtex-E Devices vl.0. Xilinx, 2000. [21] National Semiconductor, DS92LV16 Design Guide, 2002. [22] Opencores.org/Silicore, WISHBONE System-on-Chip (SoC) Interconnection Archi-tecture for Portable IP Cores rev.B3, 2002. [23] Texas Instruments, SCES292D SN74GTLPH16945 1 6-Bit LVTTL-To-GTLP Bus Tranceiver, 1999. BIBLIOGRAPHY [24] Xilinx, DS077 Spartan-IIE 1.8V FPGA Family: Complete Data Sheet, 2003. [25] Quality Semiconductor, AN-11A Bus Switches Provide 5V and 3VLogic Conversion with Zero Delay, 1998. [26] Integrated Device Technology, IDTQS3861 QuickSwitch Products High-Speed CMOS 10-BIT Bus Switch with Flow-Through Pinout, 2000. [27] E . e. a. Buterbaugh, Perfect Timing, A Desing Guide for Clock Generation and Distribution. Cypress Semiconductor, 2002. [28] D. Aldridge and T. Borr, AN1939 Clock Driver Primer - Functionality and Usage. Motorola, 2001. [29] Texas Instruments, SCAS640B CDCVF2505 3.3V Clock Phase-Lock Loop Clock Driver, 2002. [30] T. Shanley and D. Anderson, PCI System Architecture, 3rd edition. Addison Wesley, 1995. Chapter 15 Schematic diagrams and Board Designs 15.1 Serial Transceiver 1 1 1 1 1 1 1 1 1 1 1 1- 1 1 1 1 1 1 1 i L - ^ i >• <r^~~i 5-— • i , f 1 1 X T SOB iiZ£.3iia I" + ) TITLE: Tranciaver Document Number; Date: S/18/2003 16.-»>:S2 iShee<s ) / i 70 CHAPTER 15. SCHEMATIC DIAGRAMS AND BOARD DESIGNS 72 15.2 PCI IO Card CHAPTER 15. SCHEMATIC DIAGRAMS AND BOARD DESIGNS 73 CHAPTER 15. SCHEMATIC DIAGRAMS AND BOARD DESIGNS CHAPTER 15. SCHEMATIC DIAGRAMS AND BOARD DESIGNS CHAPTER 15. SCHEMATIC DIAGRAMS AND BOARD DESIGNS 15.3 Hub Bus Terminator 7 6 1 T j—O g — O :: CHAPTER 15. SCHEMATIC DIAGRAMS AND BOARD DESIGNS CHAPTER 15. SCHEMATIC DIAGRAMS AND BOARD DESIGNS 15.4 Hub Bus Link 7 8 ./f r-1 « i ai fli m. an an ' am a* 1+ +1 i f 11.+ p ta —» r j : l u CHAPTER 15. SCHEMATIC DIAGRAMS AND BOARD DESIGNS 15.5 Hub Card 81 m -* 1 I •T T n H E I i CHAPTER 15. SCHEMATIC DIAGRAMS AND BOARD DESIGNS CHAPTER 15. SCHEMATIC DIAGRAMS AND BOARD DESIGNS 8 4 C h a p t e r 1 6 Tests During a loop back test run of the central node card, we programmed the A t timer for a 2 /us interval. The oscilloscope picture of the timer waveform (Fig. 16.1) shows the accuracy of the signal. The exchanged data can be seen in the screen shot (Fig. 16.2) of the central computer. Note that the data sequence is repeated twice, this due to the fact that a bus emulator loop back module was used which only had 256 words of memory. The duplication is the result of address space wrap-around introduced by this module. A logic analyzer screen shot (Fig. 16.3) completes the set, showing some of the signals on the control bus according to table 16.1. It has to be noted that, due to the limited number of inputs to the logic analyzer, we could only observe partial busses. Trace Signal Name A7 W E A6 SEL2 A5 SEL1 A4 SELO A3 ADR3 A2 ADR2 A l ADR1 AO ADRO Table 16.1: Logic Analyzer signal order 85 CHAPTER 16. TESTS 8 6 Figure 16.1: A t timer 2 /is pulses Device opened Stop the timer.... Set the ti»er for 2us periods Re»d the timer coiifig..,, liner uord: 8x2 Start the timer Reset the links Set the number of nodes to three Read hack the number of nodes The number of nodes is set to: 8x3 The page size is: 8x1888 The Hub link TX memory noit is at 8xff5888B8, with length 8x188 Region to be mapped to user space: 8x48811888 - 8x48815888. offset ferffSRiWiM The RX unit is at 8xff588B88, with length 8x488 "KUSMHWI Region to be mmapped to user space: 8x4B81586B - 8x48816888, offset kitPVUm Bx8a Bxl l l lUl l 8x22222222 8x33333333 8x44444444 8x55555555 »'™«w 8x66666666 8x777????? 8x88888888 8x59933»9 Sxaaaaaaaa 8«88 8x11111111 8x1111111! 8x22222222 8x33333333 8x44444444 8x55555555 8x66666666 8x7?7?7777 8x88888888 BxTOSSW Bxaaaaaaaa 8x88 8x88 8x88 8x88 Bx88 8x88 8x88 8x88 8x88 Beyice re(rased — SeyKiitation fault froatSpoea-pcfH p c i „ c e i i i r « U t test , . ; - : H; ; : ; ; : ; ^ Figure 16.2: Central node loop back test screen shot CHAPTER 16. TESTS 87 Figure 16.3: Logic Analyzer screen shot C h a p t e r 17 Photographs CHAPTER 17. PHOTOGRAPHS CHAPTER 17. PHOTOGRAPHS 90 CHAPTER 17. PHOTOGRAPHS CHAPTER 17. PHOTOGRAPHS CHAPTER 17. PHOTOGRAPHS 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items