Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Design and implementation of a high-speed data logging facility for a dual DSP system Yedid Barba, Erika 2002

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2002-0293.pdf [ 6.49MB ]
Metadata
JSON: 831-1.0065366.json
JSON-LD: 831-1.0065366-ld.json
RDF/XML (Pretty): 831-1.0065366-rdf.xml
RDF/JSON: 831-1.0065366-rdf.json
Turtle: 831-1.0065366-turtle.txt
N-Triples: 831-1.0065366-rdf-ntriples.txt
Original Record: 831-1.0065366-source.json
Full Text
831-1.0065366-fulltext.txt
Citation
831-1.0065366.ris

Full Text

Design and Implementation of a High-Speed Data Logging Facility for a Dual DSP System By Erika Yedid Barba B.Sc. (Electronics and Communications Engineering) Instituto Tecnologico y de Estudios Superiores de Monterrey, 1995 A THESIS SUBMITTED IN THE PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Applied Science in THE FACULTY OF GRADUATE STUDIES (Department of Electrical and Computer Engineering) We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA March 2002 © Erika Yedid Barba, 2002 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of p W r W i m l n^A f W ^ i J ^ The University of British Columbia Vancouver, Canada D a t e A r v i l ±*\.4tTr>l. DE-6 (2/88) ABSTRACT This thesis has two main components: implementation and system evaluation. In the implementation phase, we port an existing real-time operating system, called ORTS, from a single processor version to a multiprocessor version, and integrate a DMA mechanism for data transfers, which has never been exploited in the previous ORTS versions. Two major disadvantages of the previous ORTS versions are: its incapability to sample at frequencies higher than 1 kHz, and the large amounts of data losses the system incurs when data is transferred between the DSP board and the host. With the aim of achieving higher sampling frequencies and improving DSP-host communication performance, ORTS was ported to a dual DSP processor board called Daytona, based on Texas Instruments TMS320C67 DSP processors running at 167 MHz. The DSP processor itself supports DMA transfers within its local memory map. In this thesis, the C67 DMA has been integrated with the real-time kernel mainly to reduce the duration of system calls to write or read data buffers (in ORTS called links) for burst transfers. A DMA algorithm and a mechanism called "tracking flags", were developed to control DMA access among all processes and allow multitasking, without locking the DMA controller to any semaphore. In the evaluation phase, we first demonstrate the existence of two previously unknown problems of ORTS that were preventing periodic processes to run periodically at the defined frequency, and provide the necessary information to fix the problems. Second, we present the static analysis for bounding the WCET of data transfers for the previous ORTS version, where no DSP instruction parallelism and DMA transfers were in play. We extend the previous analysis by presenting a method for dealing with the microarchitecture of the Tl TMS320C67 processor, which integrates VLIW technology that leads to complex instruction parallelism, and a method for bounding DMA transfers. Contents Abstract ii Contents iii List of Tables vi List of Figures vii List of Acronyms viii Acknowledgments ix 1. Introduction 1 1.1 Real-time systems 1 1.2 The real-time operating system 1 1.2.1 The real-time kernel 2 1.3 Real-time operating systems classification 2 1.3.1 Small proprietary kernels: commercial offerings and homegrown kernels 2 1.3.2 Real-time extensions to commercial timesharing OS 3 1.3.3 Research operating systems 4 1.4 Predicting worst-case performance 5 1.4.1 Benchmarking and monitoring 5 1.4.2 Static WCET analysis 6 1.5 ORTS, and other proprietary real-time kernels 7 1.5.1 ORTS: a real-time OS for control applications 7 1.5.2 Other proprietary real-time kernels 8 1.6 Previous research done in ORTS 10 1.6.1 ORTS for Daytona board 11 1.7 Previous research in multiprocessor real-time systems 11 1.8 Research component 15 1.9 Contributions 16 1.10 Thesis organization 18 2. Porting Considerations 20 2.1 Main software structure of ORTS 20 2.1.1 System-level communication 20 2.1.2 Inter-process communication 21 2.2 ORTS system operation 23 2.3 Hardware considerations 25 2.3.1 C3x platform 26 2.3.2 Dual processor board C67 26 iii 2.3.2.1 Hurricane DMA transfers 28 3. Design Considerations 30 3.1 ORTS C3x version 30 3.2 Dual ORTS 31 3.2.1 Host-DSP communication 32 3.2.1.1 First approach (Hurricane DMA transfers) 32 3.2.1.2 Second approach 35 3.2.1.3 First proposed solution for the implementation of host-DSP communication 36 3.2.1.4 Final solution to the implementation of the host-DSP communication 39 3.2.1.5 Hurricane DMA tradeoff 42 3.2.2 Local process communication 44 3.2.3 Internodes process communication 44 3.3 Dual ORTS system operation 45 3.4 Memory distribution 49 4. Process Periodicity Tests 50 4.1 Periodic processes 50 4.1.1 Measuring execution time and periodicity 51 4.2 Sampling frequency tests: non-logging case 52 4.2.1 Observations 55 4.3 Sampling frequency tests: data logging case 55 4.3.1 Observations 57 4.4 Interpreting the results 57 4.5 Problems of implementation in ORTS 58 4.5.1 Periodic processes timing flaws 58 4.5.2 Timer interrupt handler flaw 61 4.6 Final Observations 63 5. C67 DMA 64 5.1 C67 DMA controller 64 5.2 DMA algorithm 66 5.2.1 Main problem 66 5.2.2 Addressing DMA algorithm main problem 67 5.2.3 DMA algorithm first approach 70 5.2.4 DMA algorithm second approach 71 iv 6. Static WCET analysis 74 6.1 WCET analysis for the C32 ORTS version 74 6.1.1 The scheduler 77 6.1.2 The timer interrupt handler 80 6.1.3 Software preemption 81 6.1.4 Semaphore P 82 6.1.5 Semaphore V 84 6.1.6 System calls to write direct local links and buffered PC links 85 6.1.6.1 Writing to direct local links 86 6.1.6.2 Writing to buffered PC links 91 6.2 Important timing considerations 94 6.3 Microarchitecture analysis for the TMS320C67 processor 97 6.3.1 TMS320C67 pipeline operation overview 97 6.3.2 Pipeline analysis 98 6.4 Bounding DMA effects on a program 102 6.5 ORTS kernel optimization 105 7. Scaling issues 107 7.1 Multiple DMA channels 107 7.2 Multiple DSP processor nodes 108 8. Future work 109 9. Conclusions 110 References 112 Appendix A 116 V List of Tables 3.1 Host data transfer rates to SSRAM and SDRAM 36 3.2 first proposed solution: Host-DSP performance for the first approach 37 3.3 First proposed solution: Host-DSP performance for the second approach 38 3.4 Host-DSP performance for the first approach at a sampling rate of 10kHz 40 3.5 Host-DSP performance for the first approach at a sampling rate of 5kHz 41 4.1 Non-logging test 1 53 4.2 Non-logging test 2 53 4.3 Non-logging test 3 54 4.4 Non-logging test 4 54 4.5 Data-logging test 1 55 4.6 Data-logging test 2 56 4.7 Data-logging test 3 56 4.8 Data-logging test 4 57 5.1 "memcpy" function: Transfer rates SDRAM-SSRAM 65 5.2 C67 DMA: Transfer rates SDRAM-SSRAM 65 5.3 System call duration to write a PC/Hurricane link. Transfer: SDRAM-SSRAM 72 5.4 System call duration to write a Local link. Transfer: SDRAM-SDRAM 72 5.5 System call duration to write an Internode link. Transfer: SDRAM-DPRAM 73 6.1 External switching time between accesses by different requestors 104 vi List of Figures 1.1 Shared memory multiprocessor system structure 1.2 Multiprocessor configuration in ORTS 2.1 PC link structure 2.2 ORTS system operation 2.3 Single processor DSP board architecture 2.4 Dual processor board architecture 2.5 Internode DMA transfer 2.6 DMA transfer to and from the PC 3.1 First approach: Hurricane links allocation for host-DSP communication 3.2 First approach: Hurricane links control header allocation 3.3 Second approach: PC links allocation for host-DSP communication 3.4 Internode communication 3.5 Dual ORTS system level communication 4.1 Case A. Large execution time (10 kHz sampling frequency) 4.2 Case B. Small execution time (10 kHz sampling frequency) 4.3 Case C Large execution time (5 kHz sampling frequency) 4.4 Process running one time-quantum later 6.1 Control flow graph 6.2 WCET to write a local link 6.3 WCET to write a buffered PC link 6.4 Pipeline operation 6.5 Pipeline operation for symbolic code in block A 6.6 Pipeline operation for symbolic code in block B 6.7 Pipeline operation for combined block A and B 6.8 Control flow graph with associated timing effect 5 s 6.9 Concurrent execution of DMA and a sequence of E-cycles vii Acronyms and definitions ALU Arithmetic logic Unit C31 TMS320C31 Texas Instruments microprocessor C32 TMS320C32 Texas Instruments microprocessor C3x TMS32C3xx Texas Instruments microprocessor family C67 TMS320C67 Texas Instruments microprocessor CNC Computer Numeric Controller CPU Central Processing Unit Daytona Dual DSP board, from Spectrum Signal Processing DEC 21153 PCI-PCI bus bridge, from Digital Equipment Corporation DMA Direct memory Access DPRAM Dual Port Random Access Memory DSP Digital Signal Processor EDF Earliest deadline first HOAM CNC Hierarchical Open Architecture Multiprocessor CNC HPI Host Port Interface ISA Industry Standard Architecture MAL Manufacturing Automation Laboratory ORTS Open Architecture Real-Time Operating System OS Operating System OSACA Open System Architecture for Controls within Automation PCI Peripheral Component Interconnect PEM Processor Expansion Module PMC PCI Mezzanine Connector RAM Random Access Memory RTOS Real-time Operating System RTS Real-time Support SDRAM Synchronous Dynamic Random Access Memory SSRAM Synchronous Static Random Access Memory TI Texas Instruments UBC University of British Columbia VLIW Very Large Instruction Word WCET Worst-Case Execution time viii Acknowledgments I would like to thank my supervisors Prof. Dr. M.R. Ito and Prof. Dr. Yusuf Altintas for their support. I give especial thanks to Ahmet Gurcan for his so friendly, sincere and unconditional help he always offered me. As always, I give thanks to my parents and to my loving husband, Edgar, for all their support and encouragement. ix Introduction 1.1 Real-time systems Real-time systems are systems in which response time is critical to the correctness of performance; "In real-time computing the correctness of the system depends not only on the logical result of the computation but also on the time at which the results are produced" [6]. Responses that come too early or too late are as undesirable as incorrect computational results. Real-time systems must meet several properties. First is liveness; the system must be able to respond to every internal/external event so that deadlocks, lockouts and starvation do not occur. Second are safety and reliability; the response provided by the system must conform to authoritative specifications of its behavior. Finally, timeliness is important; the system must respond within required, rigid time constraints. Real-time systems can be classified as "hard real-time systems" or as "soft real-time systems" [34]. Hard real-time systems must always meet their deadlines, otherwise the overall system is considered to have failed. Thus, the collective timeliness of hard real-time tasks is binary; that is, either they all will always meet their deadlines (in a correctly functioning system), or they will not (the system is infeasible). Failure to meet deadlines may cause catastrophic results. Hard real-time systems conflict with the operation of time-sharing systems since it is uncertain how long an operation will take; thus these two systems must not be mixed. Determinism and predictability of the system's evolution is crucial in hard real-time systems. The relevant future behavioral characteristics and execution environment of the system must be known a priori in order to always meet deadlines. A hard real-time system can also handle non real-time and/or soft real-time tasks, as long as they do not interfere with any hard real-time tasks. Soft real-time systems are a less restricted type of real-time system. In soft real-time systems, missing some deadlines, by some amount, may be acceptable. The service of a soft real-time system can also be delivered late within an upper limit of tardiness. They do not support deadline scheduling, and therefore they are risky to use for critical real-time tasks. 1.2 The real-time operating system Basic features found in RTOS include process management, memory management, inter-process communication and synchronization, shared memory, and time access. Process management is responsible for creating and destroying processes, for the interaction of processes with signal events, for setting priorities, and for scheduling of concurrent processes (CPU allocation). Memory management is responsible for allocating and deallocating physical memory. Inter-process communication is responsible for data exchange, for deadlock and livelock detection and 1 handling, and for synchronization between processes through mechanisms such as semaphores, mailboxes and event flags. Shared memory allows processes on the same processor unit or on different processor units to communicate. Time access allows the execution of processes to be controlled by time. The main objective in real-time operating system design, as precisely stated in [4] is to establish a reliable, preemptive kernel with sufficient priority levels, and the notion of an accurate, real-time clock. 1.2.1 The real-time kernel The real-time kernel is the core of the real-time OS. It consists of the most vital low-level functions performed by the OS, which are CPU management and allocation. The real-time kernel is responsible for dividing the available CPU time among all active processes, including user processes and system processes, according to a fair system policy (CPU schedule) that assures the processes meet their time constraints. Thus the kernel creates a virtual CPU for every active process, which responds to several events with acceptable latencies predefined by each process. In order for processes to interact, some synchronization and communication primitives must also be provided by the real-time kernel. Switching the CPU from one process to another is called a context switch. For this task it is required to save the state of the old process and load the new process. Context switching is pure overhead, since the system does no other useful work while switching. "Context switch overhead is non-trivial but often not a major factor in performance; scheduling policies, process partitioning (...) are often more critical to obtaining good performance" [9]. 1.3 Real-time operating systems classification Three ways of classifying real-time operating systems exist: small proprietary kernels, which include commercially available kernels and homegrown kernels; real-time extensions to commercial timesharing operating systems; and research kernels [10,11], 1.3.1 Small proprietary kernels: commercial offerings and homegrown kernels Real-time operating systems falling into this category are often used for small-embedded applications that must be fast. The objective of small proprietary kernels is to achieve high performance, in terms of average response time to external events. Examples of commercial kernels include QNX (QNX software), pSOS (Integrated Systems), VxWorks (Wind River 2 Systems), VRTX (Microtec Research), LynxOs (LynuxWorks), OSE (Enea OSE systems) and OS-9 (Microwave systems). ORTS, which is studied in this thesis, is an example of a homegrown kernel. Homegrown kernels are usually designed to suit specific applications and have no commercial exposure. Small proprietary kernels have been effectively used in small-embedded applications such as instrumentation, intelligent peripherals and many areas of process control. Deadlines are relatively easy to meet for simple applications; however, as the application becomes more complex, demonstrating that the timing constraints are met becomes very difficult. This is because designers base their implementations entirely on mechanisms such as mailboxes, queues, semaphores, and event flags, which although designed to be fast, do not necessarily help meet time constraints. In order to reduce the overhead incurred by the kernel and to guarantee fast execution, the associated real-time operating system functionalities are kept to a minimum, thus minimizing the size of the kernel itself. Moreover, the context switching is designed to be fast, a maximum latency is given to respond to external interrupts, and the intervals during which interrupts are disabled are minimized. On the other hand, timing requirements are met by providing bounded execution times to primitives, such as intertask communication and synchronization, by maintaining a real-time clock, by providing time-outs, and by providing a priority schedule mechanism and real-time queuing. 1.3.2 Real-time extensions to commercial timesharing OS In this category, non-real-time operating systems are converted to real-time versions. Examples of these operating systems include extending UNIX to RT-UNIX, or POSIX to RT- POSIX, or Lynux to RT-Lynux. These real-time extensions have greater functionality, and increasingly better visual interfaces and software development environments. Furthermore, since they are based on standards, they are very easily portable. On the other hand, so many functionalities and services derived from traditional OS result in larger code sizes and too much overhead, which is often not tolerable for real-time applications. For example, virtual memory management, which is often part of the OS, is too overhead-prone to guarantee sampling rates in the range of kHz [13]. Moreover, the real-time versions of commercial operating systems are generally less predictable than proprietary kernels, because they mainly emphasize speed rather than predictability [10]. The main problems when converting a non-real-time OS to a RTOS are excessive overhead and excessive latency in responding to interrupts. These problems are mainly due to the non-preemptability of the kernel and because internal queues are FIFO; that is, the mechanisms they include are not designed with real-time applications in mind. 3 Another important problem is that RTOS, as extensions of time sharing operating systems, take control of resource management, and decide who should be granted a resource based on the best average case performance, not on real-time constraints. Real-time operating systems such as RT-UNIX or RT-POSIX or RT-MATCH have been recommended for use in real-time applications where missing a deadline has no consequences [10]. 1.3.3 Research operating systems There exist misconceptions in the real-time computing world, which seriously affect the evolution of next-generation real-time systems. Some real-time computing misconceptions, as stated in [6], include the idea that advances in supercomputer hardware will take care of real-time requirements, that real-time computing is equivalent to fast computing, or that real-time programming is just assembly coding, priority interrupt programming, and device driver writing. Research Operating Systems play an important role in putting an end to these misconceptions and designing new operating system paradigms for hard real-time applications. As explained above, there is a lack of predictability in both proprietary kernels and extensions to commercial time-sharing operating systems. It is the job of research operating systems to come up with solutions to these problems and open up new avenues in RTOS research. Trends in current research include [10,14] the following: • Modeling and verification of systems that are subject to timing constraints, • Hardware and software co-design in RTOS, • Emphasizing predictability at kernel level and application level, • Developing support for fault tolerance, • Providing support for multiprocessor and distributed real-time systems, • Expanding the number of features in a kernel and providing more complex features without degrading the responsiveness of the system, • Portability of real-time applications due to the advent of new processors and programming languages. Examples of research operating systems include Dreams [13], Spring [15], and CHAOS [16]. 4 1.4 Predicting worst-case performance In hard-real-time systems, failure to meet deadlines may cause catastrophic results. It is therefore highly recommended that the worst-case performance of the system be known. Worst-case execution time (WCET) analysis is of great importance for the successful implementation of a hard real-time system. The purpose of worst-case execution time analysis is to determine the worst, possible execution times for a piece of code before using it in the system; that is, the behavioral characteristics and execution environment of the system must be known a priori in order to always meet deadlines. The goal of WCET analysis is to generate a safe (i.e. guaranteed not to underestimate the execution time) and tight (i.e. provide low overestimation) worst-case execution of a program. WCET is a fundamental part of scheduling theory. 1.4.1 Benchmarking and monitoring Traditionally, WCET analysis has been achieved through testing of the system. Testing is done by running the program and making time measurements to reproduce the worst-case scenario. Programs normally have too many program paths, thus making testing a difficult task. As observed in [12], "measurement of execution times is a trial and error process where no guarantees can be given since exhaustive testing/measuring seldom can be performed." Two common measurement methods are benchmarking and monitoring. Benchmarks measure the performance of individual operations, or only parts of them. Two approaches used for benchmarking are fine-grain benchmarks and application-oriented benchmarks. The first approach, which is the most common one, is used to evaluate the efficiency of hardware and software interaction for the most frequently used services (e.g. task switching time, task preemption time, and interrupt latency). The second approach is used to evaluate RTOS at a much higher level by considering the number of deadlines met or missed, and the point at which the system starts breaking down. These two approaches, as observed in [23], are not adequate for real-time applications that demand a deep knowledge of the RTOS predictability to guarantee meeting timing requirements. They are more focused on providing average performance than worst-case measurements. Different metrics proposed for benchmarking response time to external events, intertask synchronization and resource sharing, and intertask data transferring can be found in [23], Proposed benchmarks for interrupt processing time, context switching time, and multiprocessing time for DSP real-time systems can be found in [24]. Monitoring, on the other hand, is used to gather information about a running system. It repeatedly measures the latency times of the system (for several possible combinations of input data values) until it is possible to place the system into the worst-case several times. As explained above, these kinds of techniques are exhausting, and there is no guarantee that the worst-case scenario 5 is achieved. Metrics and results for benchmarking and monitoring performance in commercial real-time operating systems can be found in [25]. 1.4.2 Static WCET analysis A recent technique of WCET analysis is to analyze the program without actually running it; that is, by using static analysis. The most common focus for static analysis is the assembler code or object code of a program. "Unlike measurements, static analysis can generate a safe bound on the worst possible execution time to be obtained" [26], The process of statically generating a WCET of a program can be divided into static WCET analysis of basic blocks (or low-level analysis), and static WCET analysis of tasks (also referred to as high-level analysis or program path analysis). The low-level analysis deals with the amount of time taken to execute a straight-line segment (basic block) of machine code. Architectural factors of the processor, such as caches that affect the duration of instructions, and pipelines that introduce parallelism between the execution of successive instructions, is an important consideration in this analysis. The high-level analysis deals with the number of iterations in loops, branching, infeasible paths, and call instructions around the basic blocks. This analysis mainly focuses on determining what sequence or path of instructions will be executed in the worst-case scenario. In the program flow analysis the programmer's help (or a good knowledge of the source code) is often needed to retrieve important information about the structure of the program (e.g. loop bounds). Given the results from the low-level and the high-level analysis, the WCET is finally calculated. When performing static WCET analysis (at the low and high level), it is assumed that the program execution is uninterrupted (no preemption or interrupts), and that tasks are never blocked. Blocking, preemption and interrupts, and static knowledge of the tasks such as periods, synchronization, and communication between tasks are considered in much higher level analysis, called the schedulability analysis. Much research has been done in the area of static analysis of the WCET. Aljifri, Pons and Tapia [27] propose a high level approach for statically calculating the WCET of a program by identifying its feasible paths (paths that can be taken during execution) from the assembler code, without listing or finding all possible paths (feasible and infeasible). This technique guarantees tighter WCET analysis since only feasible paths are considered. Lindgren, Hansson, and Thane [12] combine static high level analysis and measurements of execution times to calculate the WCET of a program. They propose a method for measuring the worst-case execution time for simple processor architectures that do not integrate cache effects and/or pipelining, directly from the high level analysis of the program. In contrast to another method which calculates WCET with pure monitoring, this method makes use of the control flow of the program to reduce the number of measurements to a set of independent paths or basic blocks. Li, Malik, and Wolfe [28] address 6 both program path analysis and microarchitecture analysis of WCET of a program. They propose a method in which WCET is computed by summing the products of the execution counts of the basic blocks in the program and their corresponding execution times. Their starting point is the construction of a control flow graph directly taken from the source code. Path information provided by the programmer is used to enable the elimination of infeasible paths. This method is expanded to cover the microarchitecture modeling of cache analysis and simple pipelining. More complex pipelining analysis is addressed by Engblom and Ermedahl [29]. They propose a method for low level WCET analysis that considers pipelined processors. Their method is based on the implicit path enumeration technique in which basic blocks (at the object code level) are assigned execution times and execution counts, which represent the number of times the basic blocks have been executed in the worst-case execution of the program. Pipeline is modeled by analyzing instruction overlaps within and across basic blocks, and by integrating a new timing variable, called the timing effects, associated with sequences of blocks. There is not always a direct correspondence between the source code of a program and its object code. This is due to transformations/optimizations performed by the compiler. Optimizing compilers are necessary to make a program small enough or fast enough, given a hardware platform. When optimization occurs the structure and execution characteristics of a program change; for example, loops are unrolled, statements removed, and functions inlined. Engblom, Ermedahl, and Altenbernd [26] propose a technique called co-transformation that maps information from the source code, where information about the program execution is more evident, to the object code. Their approach depends on specific information from the compiler about how a certain program is transformed, and how the compiler implements its transformations. This mapping is necessary before statically calculating the WCET of optimized programs. 1.5 ORTS, and other proprietary real-time kernels 1.5.1 ORTS: a real-time OS for control applications Real-time systems have been extensively used for a variety of real-time control applications, ranging from command-and-control systems, robots, and nuclear plants, to space shuttles and aircraft avionics. It is fundamental that the systems on which real-time control applications run be predictable and reliable, and that the systems be well open to modifications. ORTS designed and developed in the Manufacturing Automation Laboratory (MAL), University of British Columbia (UBC), is a real-time preemptive operating system running on a DSP board enhanced with Windows NT as the user application. The use of DSP processors as a supported target for real-time operating systems increases the responsiveness of real-time applications, such as process control applications, which require intensive computing. DSPs specialize in high speed arithmetic 7 (they usually have hardware to perform additions and multiplications in parallel within a single instruction), but most importantly, DSPs are capable of dealing with multiple sources of data from the real world (coming in and out of the system) without disrupting its performance. "Even though the performance in the current microprocessors is rapidly increasing, computation-intensive tasks often need to be carried out with a digital signal processor (DSP) to enable real-time execution of the applications" [8], ORTS is used for process control and monitoring of systems, such as machine tools, robots and XY-tables. Traditional control systems were "closed" to users, preventing them from integrating their customized applications or expanding the capabilities of their machines. ORTS has been developed to provide true open capabilities to give the end user a readily programmable and expandable platform for developing control systems for their machines with no need to modify the whole source code of the system. ORTS allows the integration of new modules or functions developed in the C language, and can be used as a modular operating system for progressively developing real-time processing, and process control applications. The C language (C-compiler) improves code reliability, software maintainability and portability [8]. These characteristics are essential in an open architecture real-time system. 1.5.2 Other proprietary real-time kernels Most commercial proprietary kernels can provide the same basic real-time features implemented in ORTS, which include semaphores, queues, priority inheritance, preemptive time-sliced schedulers, 8 priorities, shared memory, and interrupt handlers for external events. They can even improve some of these features by providing, for example, 256 priority levels without degrading the performance of the system, a prioritized interrupt handler not only for one (as in ORTS) but for several external interrupts, and hard real time responses to time critical events [30,31,33]. More importantly, any commercial proprietary kernel provides execution times of the real-time operating system they offer, which are at least, representative of the system's average behavior (very few RTOS come with execution times obtained through tests carefully designed to be representative of the system's worst-case behavior [11]). Following, we briefly present information about four widely used proprietary real-time operating systems. VxWorks is the most widely adopted real-time operating system in the embedded industry. It is a fully preemptive kernel, with priority-based round robin scheduling with 256 priorities, fast deterministic context switching, and with the capability of handling an unlimited number of tasks. It is used in mission-critical applications, including interplanetary exploration (used on the 8 Pathfinder mission to Mars). The VxWorks real-time operating system is found in a multitude of industries including aerospace, networking, medical, Industrial, digital imaging, transportation and multimedia. In the industrial market, the area ORTS was designed for, VxWorks is used in applications such as test and measurement equipment, robotics, CNC equipment, and process control systems, among others. VxWorks is compatible with numerous industry standards, and available on all popular CPU platforms [30]. LynuxOS is an open source hard real-time embedded operating system, developed from the ground-up with high performance, deterministic hard-real time response in mind. Even as the tasks running on the system increase massively, LynuxOS remains deterministic. As a hard-real time operating system, LynuxOS meets the needs of applications that must perform without fail in a wide range of mission-critical environments. LynuxOS support a variety of processor platforms, but does not support DSPs [31]. OSE, a fault-tolerant real-time operating system, is a fully preemptive kernel with priority-based scheduling, optimized to provide high rates of data throughput and low latency. OSE was designed and coded for safety critical applications, which led it to be the first commercial real-time operating system to be certified for safety, according to the International electrotechnical commission (IEC). OSE provides excellent deterministic performance; that is, all execution times in the OSE real time kernel are deterministic and are not affected by the size of the application, memory consumption, or by the number of processes. The OSE real-time operating system supports high performance DSP-DSP communication and complex CPU-DSP systems configurations without the additional overhead offered for less performance-dependent applications. OSE supports a variety of processor platforms including Texas Instruments DSPs: TMS320C54X, TMS320C625x, TMS320C67x, TMS320C55x, TMS320C64x. [33] Virtuoso, from Wind River, is a real-time operating system that helps developers create real-time digital signal processor (DSP) and multiprocessor applications with maximum performance and minimum overhead, while managing the complexity of inter-process communication. Virtuoso includes a small footprint kernel optimized for DSP, and an advanced virtual single processor (VSP) architecture for multicore distributed applications, as well as tools to enable customers to rapidly design, debug, and deploy a variety of applications with processing-intensive requirements. The Virtuoso design features two kernels to optimize performance: a high-level kernel, which is a multiprocessor microkernel with preemptive scheduling for application-level programming, and a low-level kernel, which is an ultra-high-performance kernel for driver scheduling with minimal overhead. Virtuoso is optimally designed for multiprocessor applications and applications that combine DSPs and CPUs. Virtuoso can communicate with VxWorks RTOS 9 through the host application, thus combining 2 leading RTOS solutions for memory-constrained and general-purpose-processing applications. Virtuoso is found in a variety of applications and markets including aerospace, data acquisition, image processing, industrial automation, radar and sonar, and process control, among others. Supported targets include Texas Instruments TMS320C6x and TMS320C4x DSPs. Supported hosts include Windows NT and Windows 2000. [32], 1.6 Previous research done in ORTS ORTS has been in continuous development. It was developed by Erol [1,5] in 1996 from its predecessor system HOAM-CNC, developed by Newell [2] in 1993. HOAM-CNC addresses some of the problems of closed CNC systems. However it is missing important real-time features, such as a preemptive architecture, which is an important feature for developing and executing time-critical tasks. ORTS addresses the important real-time features missing in its predecessor, HAOM-CNC, and improves the openness of the architecture. ORTS was initially implemented to run in a Spectrum PCC31 DSP board at 33Mhz, later it was ported to the Spectrum PCC32 DSP board at 50 MHz, and in 1999 it was ported to the Spectrum Indy at 60Mhz. The hardware selection was homogenous, in essence, using a single processor and a shared memory-based architecture between host and DSP. As for the porting task, the latest two versions follow the same design considerations as the first version. With the aim of enabling control applications defined for one standard to be used on another open architecture controller, in the year 2000 Otkunc [3] developed a communication interface for ORTS with the European Open System Architecture for Controls within Automation Systems controller (OSACA). Later that same year, Gurcan [4] developed a programming interface between ORTS and Simulink (a software package used for modeling, simulating and analyzing dynamical systems), that helped build new modules for ORTS. The latest version of ORTS (Indy version), was not able to run at sampling frequencies higher than 1kHz, and several data were lost in the communication between the host and the DSP. With the aim of achieving a higher sampling frequency and improving DSP-host communication performance, a new DSP board called Daytona, based on a Texas Instruments (Tl) DSP processor at 167 MHz, was acquired. Gurcan [4] attempted to port single processor ORTS version to the dual processor board, Daytona. The implementation considered only one processor, and followed the same principles as the single processor versions by communicating the DSP processor and the host via the Daytona's shared memory, which is no longer intended for host-DSP communication. 10 1.6.1 ORTS for Daytona board Spectrum's DSP boards, including the Daytona board, have been used for a variety of applications that require high speed signal processing, such as industrial control, radar, sonar, test and measurements, and imaging. The manufacturer provides the device drivers to handle the hardware in the Daytona board. They also provide a programming guide with examples of how to use the different functions that integrate the device drivers, which are followed in this thesis. The manufacturer offers different options for implementing communication between the host and the DSP, however, the best and most effective way to use the board strongly depends on the kind of application the designer is developing. What may be the most convenient implementation for one application may be the least convenient solution for another kind of application. Previous versions of ORTS support a single processor DSP board with a shared, memory-based architecture between the host and DSP for host-DSP communication, and an SSRAM for DSP processes communication. The DSP board to which ORTS is now being ported, Daytona, integrates two DSP nodes, each with SSRAM and SDRAM, and no longer consists of a shared memory between the host and the DSP nodes. The Daytona board also provides a new transfer mechanisms called Hurricane DMA for host-DSP communication. In this thesis we explore different approaches to achieve the most effective communication between host and DSP considering memory latencies, user's applications1, and some ORTS characteristics, such as data link structures. 1.7 Previous research in multiprocessor real-time systems Computing requirements have been gradually surpassing the power of processors, and multiprocessor systems are becoming the choice for today's applications in search of increased performance. Real-time systems are gradually embracing multiprocessing to improve system throughput. As stated by Rajkumar [17], "The speedups possible in multiprocessors are particularly attractive for real-time systems where additional computing power is in general desirable". New generation applications are becoming larger and more complex, and the need to implement them in multiple processors rather than in single processors is imperative. Examples of these applications include automated manufacturing systems, defense systems, automotive, avionics, and spacecraft control systems [18]. The need for multiprocessing brings new challenges to the design and development of new techniques for realizing timing constraints in multiprocessor systems. Very few results obtained 1 In this thesis we make reference to a single user, made up of a group of students from the Manufacturing Automation Laboratory at UBC, that uses ORTS. 11 from uniprocessor systems can be generalized to multiprocessor systems. For example, developing scheduling algorithms for multiprocessor systems is more difficult than for uniprocessor systems. As observed in [18], bringing in additional processors adds a new dimension to the scheduling problem. The scheduling of tasks from one processor to another (load balancing) is fundamental to guaranteeing that deadlines are met. Two approaches used for load balancing, as described in [14] are the "Focused Addressing" approach, and the "Bidding" approach. In the first approach, tasks that cannot be guaranteed to meet their deadlines are switched to another processor, which, based on past accumulations of loading information, has available surplus in time. This approach is not reliable since it is based on out-of-date information. With the second approach, when tasks cannot be guaranteed to meet their deadlines in the current processor, other processors are asked to bid for the right to run that task. Here, bids are made base on the surplus in time each processor has at the moment of the bid. Binding tasks to multiple processors falls into one of two modes: dynamic binding or static binding [17]: Given m processors, in the dynamic binding mode, a task can be executed on any of the m processors (tasks are dynamically bound to processors). Implementing scheduling algorithms for uniprocessor systems for a multiprocessor with dynamic binding is no longer efficient. Rajkumar [17] has proven, for example, that the rate-monotonic algorithm for uniprocessor systems works badly for multiprocessors with dynamic binding. In this case, schedulability of a task needs to be addressed by implementing a scheduling algorithm designed for multiprocessor systems. Different scheduling approaches and techniques for multiprocessor systems are discussed in [18,19,20]. On the other hand, in the static binding mode, a subset task can execute on only one processor called its host processor (tasks are statically bound to processors). To determine schedulability of a task in the static mode, it is necessary to consider only its computational requirements and its preemption by higher priority tasks that execute on the same processor. That is, the problem of solving schedulability of the m processors can be decomposed into m uniprocessor problems. Well known scheduling algorithms for uniprocessor systems such as rate-monotonic can be efficiently implemented in a multiprocessor system with static binding [17]. In ORTS, a subset of all the user's tasks is statically bounded to one processor node of the Daytona board while the other subset is statically bounded to the other processor node (no switching of tasks between nodes is needed). This way, for each of the processor nodes in the 12 Daytona board, a scheduling algorithm for uniprocessor systems can be implemented. That is, there is no need to implement or design a scheduling algorithm for multiprocessors to solve schedulability of tasks in dual ORTS. A prioritized, preemptive and time-slice uniprocessor scheduler is implemented on each processor node of the Daytona board. Communication among tasks is another important requirement in real-time systems. This communication is accomplished by sharing physical or logical resources; synchronization mechanisms are therefore necessary to control access to these shared resources. The use of synchronization primitives such as semaphores and monitors is a very common practice and a simple solution to enforce mutual exclusion. Mutual exclusion protects the consistency of the shared data by allowing only one process or task at a time to access the shared data. Mutual exclusion can lead to uncontrolled priority inversion, in which a high priority task can be blocked for an unbounded time by a lower priority task. Priority inversion becomes even more critical in multiprocessor systems. Many efforts have been made in the development of new synchronization mechanisms for shared memory multiprocessor real-time system configurations [17,21,22]. Atypical representation of this configuration is presented in Figure 1.1. Processor 1 Processor 2 Local memory Processor n Local memory Interconnection bus Local memory Shared memory Figure 1.1. Shared memory multiprocessor system structure In the configuration presented in Figure 1.1, each processor node has its own local memory which is used for communicating all tasks bounded on the same processor. Multiple processors are connected together via a backplane bus along with a shared memory, used for intertask communication among tasks bounded on different processor nodes. To solve synchronization between processors, Rajkumar [17] proposes a shared memory synchronization protocol, which is an extension of the uniprocessor priority ceiling protocol. 13 Resources guarded by local semaphores (within a specific processor) are accessed using the uniprocessor priority ceiling protocol, which is implemented locally in each processor node. For intertask communication between processors, a global semaphore residing in the shared memory space is proposed. Global semaphores does not follow any particular protocol except for some important and well studied priority considerations, which are that global critical sections execute at pre-specified high priorities, and semaphore queues are priority-ordered. This synchronization protocol minimizes the schedulability loss due to the blocking of tasks by other tasks running on different processors. ORTS differs from previous research in that synchronization between processors (CPU-DSP or DSP-DSP) has been implemented using the simple consumer-producer mechanism. This mechanism has been used since the first version of ORTS. Consumer-producer synchronization works efficiently due to the unidirectional nature of the buffers which allow consumers to access one location of the buffer's control header (tail) to check for new data, and the producers to access a different location of the control header (head) to check if there is space to write new data. Multiple consumers or multiple producers of the same resource are synchronized by semaphores within each processor or node. Processor 1 Local memory Processor 2 DPRAM Local memory Figure 1.2. Multiprocessor configuration in ORTS An abstraction of the shared memory multiprocessor architecture used to implement ORTS is shown in Figure 1.2. Observe that it is slightly different than the configuration shown in Figure 1.1. The main difference is that the processor nodes are connected via a dual port RAM instead of a bus. The dual ports allow concurrent access by both processors to the DPRAM. In this thesis, the DPRAM is divided into unidirectional buffers; that is, for each buffer, if a task is defined to be a consumer in one processor, the corresponding task on the other processor is necessarily defined as a producer. If a task requires data to be read, it is blocked until the corresponding task on the other processor node writes data to the corresponding buffer, and wee versa. 14 1.8 Research component This thesis is divided into two phases: 1. Porting ORTS from its single processor version to a dual processor version, which is the implementation phase of the thesis, 2. Performance tests and WCET analysis of the RTOS, which is the evaluation phase of the thesis. The research components in the implementation phase are these: • To improve the communication between the PC and DSP board. This implies exploring and evaluating the different techniques used to implement the communication between PC and DSP to propose and implement a solution that guarantees minimum, or zero, sample losses for transfer rates as high as 10khz. • To improve transfer rates for large data transfers between memories. This implies proposing and designing a mechanism for implementing the Tl TMS320C67 DMA controller, with the aim of reducing system call duration to read and write data to the links for large transfers. This thesis also explores single and multiple channel options for implementing the DMA mechanism that better suits ORTS in both current applications (for small data transfers), and future applications that might require transferring large amounts of data. The research components in the evaluation phase are these: • To analyze the performance of periodic processes to determine the source of their non-deterministic behavior. This thesis proves, numerically and graphically, the existence of 2 timing flaw conditions in ORTS that prevent periodic processes from running periodically at the defined frequency, and provides the necessary information to fix the problems. • To conduct a WCET analysis of data transfers and kernel structures of ORTS. This implies exploring previous research done in the area of static analysis. In this thesis we use existing techniques to conduct the WCET static analysis of a program with non pipelined processors, and propose techniques for conducting a good approximation of the worst-case performance of a program when pipelined processors and cycle stealing DMA mechanisms are considered. • To determine the reason(s) why ORTS for Daytona does not run at frequencies faster than 10 kHz as stated by previous researchers. This implies 1) analyzing the behavior of the kernel in terms of execution times and CPU utilization, to determine how much overhead is introduced by the system. This overhead directly impacts the maximum frequency at which 15 processes can be executed in ORTS. 2) it implies accomplishing the research component mentioned above, in which we demonstrate the existence of previously unknown timing flaws in the ORTS kernel that affect the frequency at which periodic process are executed. 3) Providing an explanation as to why only improvements in the execution time per instruction (due to a faster clock) are evident, even when a DSP processor with VLIW architecture is used, which allows software pipelining. 1.9 Contributions As will be seen in the following chapters, this thesis provides several contributions ranging from ORTS system operation diagrams that provide a friendly guide to the operating system for new ORTS developers, to a WCET analysis of data transfers and kernel structures of ORTS. The importance of this thesis and its research component goes beyond the simple original objective of porting a RTOS to a more complex architecture. In fact, as the result of the WCET analysis and the performance tests conducted to ORTS, this thesis disproves some misconceptions that the user and previous ORTS researchers have regarding real-time systems. The two most relevant of such misconceptions are 1) ORTS is capable of supporting hard real-time applications on more powerful execution platforms [3], and 2) that timing constraints (mainly faster sampling frequencies) are automatically met using faster processors, as stated by the user along the implementation of the newly ported version. First, hard real-time applications are not equivalent to fast computing, but to predictability. A more powerful platform minimizes the average response time of a program (process), but does not guarantee the predictability hard real-time systems are characterized for. ORTS has been under development for the past 5 years, but no worst-case performance tests had ever been performed that can show the behavioral characteristics of the system. Therefore, ORTS cannot be considered capable of handling hard real-time applications, unless strictly and carefully designed tests are performed to the system that prove otherwise. In this thesis we present an analysis that provides a first insight into the ORTS kernel in terms of worst case execution times. This analysis provides important guidelines for conducting an extensive static WCET analysis of the ORTS kernel, which is necessary to guarantee that any hard real-time application executes timely, reliably and safely. This thesis also presents the research necessary to extend the WCET analysis when software pipelining and DMA transfers are considered. Second, the use of faster processors helps improve the system throughput, however, this does not imply that timing constraints will be automatically met. There is still much work to do on the ORTS kernel to achieve higher sampling frequencies and better system responsiveness before 16 exploring faster supercomputer hardware. This thesis shows three aspects of the ORTS kernel that need improvement: • The ORTS kernel is too overhead-prone; it takes up too much processor time to execute context switching and semaphores, consequently reducing the available computing time for the application. For the previous version of ORTS, in the worst-case scenario, the process dispatch latency time was found to utilize approximately 72% of the CPU time, semaphores take up 42% of the CPU time to execute, and approximately 87% of the time needed for a system call to write data to a memory location, is dedicated to execute operations other than the data transfer itself. Alleviating the effects of this large source of overhead is not completely up to a faster processor, but to better kernel design. • The ORTS kernel presents 2 implementation flaws which cause periodic processes to execute at frequencies both faster and slower than that defined (this is not acceptable for real-time systems). In this thesis we provide measurements, a description, and graphical representations of these two flaws, and provide the information necessary to fix the problem, which exists in all versions of ORTS, including those still in use (e.g. Indy versions of ORTS). The use of a faster processor cannot counteract flaws in the real-time system. • The ORTS kernel, in its actual form, does not scale fully to the TMS320C67 processor. That is, the use of the C67 DSP processor running at 167 MHz, helps improve speed per instruction cycle, compared to the C67 DSP running at 60 MHz; however, the ORTS kernel (main source of overhead) does not take full advantage of the processor's software pipelining (parallelism). The way the ORTS kernel is coded violates important conditions a program should meet in order to be software pipelined. To fully take advantage of the software pipelining featured in the new processor, the ORTS kernel needs to be redesigned. Other contributions of this thesis include the following: • The implementation of the C67 DMA mechanism for data transfers between memory locations. This mechanism enhances the system and opens up the possibility of achieving larger data transfers while maintaining good performance, which otherwise could be crudely degraded using the simple "memcpy" function. Although the user nowadays handles small data transfers (around 50 to 60 words), this implementation is ready for future applications that require performing transfers of large amounts of data. 17 • The discovery of a problem of implementation in one of the device drivers provided by the DSP board manufacturer. This problem is explained in appendix A. Thanks to the implementation in this thesis, the manufacturer is now aware of the problem for the first time. It now depends on the manufacturer to fix it, release a new version of the software, and provide patches or new software to all customers that use Daytona boards, including the Manufacturing Automation Laboratory (MAL). 1.10 Thesis organization This first chapter introduces the most important features of real-time systems and gives a brief background of ORTS, the system being ported. Further, the research component and contributions of this thesis are outlined in this chapter. The second chapter explains the main software structure and operation of ORTS, and explains the difference between the single C3x processor boards supported by the previous ORTS versions versus the dual C67 processor board, Daytona. This chapter contains a study of the most relevant characteristics, features, and the operation of ORTS, needed to port the system. Daytona offers a new range of features that widen the possibilities for designing new methods of communication between PC and DSP, communication between DSP nodes, and communication between processes running on the same DSP. The third chapter evaluates these different possible designs, and focuses on the evaluation and comparison of two proposed approaches for the communication between the host and the DSP nodes. Some benchmarks are presented to support what is stated in each of the approaches. Based on the evaluations, the final design for dual ORTS is selected, and the final system operation and memory distribution of the newly ported system is presented. Chapter four presents the tests performed to the system. These tests focus on evaluating the periodicity at which processes run, for a given' sampling frequency. Numerical results and graphical representations are obtained that prove that timing flaws in ORTS are affecting the performance of the system. Chapter five explains the implementation details of the C67 DMA as a mechanism to transfer data between communication links. A mechanism called "tracking flags" and two DMA algorithms are discussed in this chapter. 18 Chapters six presents an analysis for bounding the WCET of the system calls to write data to local and PC links for the previous ORTS version. Next, it proposes two methods for extending the previous analysis to the current ORTS version (dual ORTS), which handles complex instruction parallelism and DMA transfers. Chapter seven discusses the issues of scaling ORTS in the number of DSP processor nodes, and issues regarding scaling the C67 DMA implementation to multiple DMA channels. Chapters eight and nine present future work and conclusions respectively. 19 Chapter 2 Porting Considerations Porting a system to a new platform requires a good knowledge of the system in both software and supported hardware, as well as a good knowledge of the new platform itself. To better understand the implications of porting ORTS to Daytona, it is important to have some basic information about ORTS. The first part of this chapter explains the main software structure of ORTS, followed by a description of the system level operation, and concludes with a comparison of the platforms ORTS has been previously implemented on and Daytona, the new dual processor board to which ORTS is now being ported. 2.1 Main software structure of ORTS ORTS is composed of a real-time kernel and a host application. The real-time kernel runs on a TI processor-based DSP board. It basically performs CPU scheduling and resource allocation, and provides all the synchronization and communication primitives that allow the user DSP processes to interact. The real-time kernel supports a prioritized, preemptive and time-sliced scheduler. The host application, on the other hand, runs on Windows NT and is based on WinNT threads. It serves as an interface between the user and the real-time kernel. It runs the user PC processes, interprets the user inputs through a script file, provides a Generic DSP driver which is the base class for every DSP board supported by the system, and provides a DSP device driver for DSP dependent operations [5]. Both the real-time kernel and the host application are part of the same system, so as a system they need to communicate. There are two levels of communication between the DSP side and the host side of ORTS: system level communication and inter-process communication2. The following two subsections describe these two levels of communication between the host application (host side of ORTS) and the real-time kernel (DSP side of ORTS). 2.1.1 System-level communication System level communication, as the name suggests, is a low-level communication between the host side and the DSP side of ORTS, and it is exclusively for system-related operations. For this type of communication, both the DSP side and the host side of ORTS integrate a so-called 2 In this thesis, we make use of the term "inter-process communication" to refer to the communication between processes anywhere in the ORTS system; namely, host-DSP, DSP1-DSP1, DSP2-DSP2, orDSP1-DSP2. 20 Remote Subsystem. On the DSP side, this subsystem consists of a non-terminating aperiodic process called the CommandManager, the purpose of which is to enable the host to control operations, such as creating a process, terminating a given process, running a process, and creating communication links for processes communication. These operations are not under the user's control, and that is why they are considered to be system-related operations. The host side, on the other hand, provides the Remote DSP Control Interface, the main task of which is to send commands to the CommandManager to control the operations mentioned above. Three Control links (shared buffers) make it possible for the CommandManager and the Remote DSP Control Interface to communicate; therefore, the Control links are the first links created in ORTS and are created on startup. Each of these links has a specific purpose: • The CommandLink is used by the Remote Control Interface in the host side to send commands to the CommandManager in the DSP side; • The ReplyLink is used by the CommandManager to acknowledge the host after execution of a command; • The PrintLink is used to display text on the PC screen. 2.1.2 Inter-process communication Unlike System-level communication, inter-process communication makes it possible for user processes to interact and exchange information amongst each other. This type of communication is completely under the individual user's control. The way user processes communicate is through communication links (shared buffers). Different from the Control links, these links are created after the host interprets a user script file. In the script file the user determines how many communication links are needed and the characteristics of each. Communication links can be classified this way: • Local links are used to communicate processes running on the same DSP processor; • PC links are used to communicate processes running on the host to processes running on the DSP (Control links are classified as a kind of PC link), and • Internode links are those through which processes running on different DSPs processors communicate. All links are unidirectional, and can have multiple producers and consumers. Concurrent access is avoided by using semaphores. Links can be of three kinds: buffered, direct or asynchronous. Buffered links are circular buffers with a user-defined size. Consumers can only read from a 21 buffered link when the buffer is not empty; producers can only write buffered links when the buffer is not full. Direct links are the same as buffered links with a buffer size of one. Asynchronous links are of size one and do not force any synchronization between producers and consumers. In asynchronous links, the producer may override the data several times before the consumer reads it, and the consumer may read the same data several times before new data is produced in the buffer. Internode links and PC links integrate, as part of each link, a control header for consumer-producer synchronization. This header contains information about the following: • The number of elements written in the link; • the index of the next element in the link to be read; • the type of link; • the number of items in the links; • the size of each item, and • the base address of the link data block. Every time an element in a link is read or written, the first two indexes in the control header are read. In this way the producer knows if there is enough space for writing, and the consumer knows if there is new data to read. After reading, the producer updates the number of elements written in the link; after writing, the consumer updates the index of the next element in the link to be read. Figure 2.1 shows this link structure. HEADER 1 LINK 1 HEADER 2 LINK 2 HEADER 3 LINK 3 Figure 2.1 PC link structure Local links make use of counting semaphores to determine if a link is suitable for writing or reading, and do not make use of this structure. 22 2.2 ORTS system operation The main software structure of ORTS is composed of several parts that can change by implementing ORTS in different hardware platforms; however, the system core operation is hardware independent, in other words, the way ORTS operates remains unchanged no matter what the hardware is. To better explicate how ORTS operates, a step by step description is given of what takes place from the startup of the system to when the DSP processes are commanded to run. It is important to note that ORTS system operation makes use of the system-level communication, explained in section 2.1.1. ORTS operation is explained using Figure 2.2 as a model. HOST x © ' C R E A T E ^ - ^ CONTROL LINKS 3 GENERIC DSP DRIVER CREATE PC LINKS SCRIPT-FILE INTERPRETER INITIATE PC LINKS CR LC Ll EATE )CAL NKS CREATE DSP PROC. RUN DSP PROC. <> . > •<> -^J^ REMOTE DSP CONTROL INTERFACE D P R A M Figure 2.2 ORTS system operation DSP SSRAM DSP CODE RUNNING CREATE COMMAND MANAGER RUN COMMAND MANAGER INITIALIZE CONTROL LINKS SYNCH & COMM 23 Step 1 On startup, the system priority is to create (allocate in memory) the Control links that allow the host application and the real-time kernel to communicate. The Generic DSP driver, in the host application, is responsible for creating the Control links. First, the Generic DSP Driver creates a Control Block in the PC memory for each control link. Then, it allocates space for these links in a designated memory in the DSP board (DPRAM) through the DSP device driver. The Generic DSP driver is also responsible for generating the control header, and for writing it at the start address of each PC link. Up to this point, just the host application is running. In the DSP side, executable code needs to be downloaded to the DSP board in order to start execution. Step 2. The host downloads the real-time kernel and the user processes to the DSP board through the DSP device driver. The code in the DSP board automatically starts running and a series of handshakes between the host application and the DSP code takes place. Handshaking is necessary in order to assure the system runs properly. After handshaking, the DSP code runs freely. On the DSP side of ORTS, once the code is running, some initializations take place and the Command Manager is created. Right after its creation, the Command Manager makes calls to the real-time kernel to initialize the Control links. Initializing a link means looking for it in the designated DSP board's memory, getting knowledge of its location, and creating lists with the necessary information for future use of the link. Initializing a link is only necessary when the host is responsible for allocating the links in memory; otherwise, the DSP wouldn't know the location of the links in memory. Just after initializing the Control links, the Command Manager loops waiting for the host commands to arrive. By this point, the system has already been initialized and is ready to receive and execute the user input. Step 3. The user introduces a script file that defines the DSP and PC processes to be created, and the communication links through which those processes will communicate. The script file interpreter, in the host application, interprets it line by line. The interpreter makes calls to the Generic DSP driver and/or to the Remote DSP control Interface, depending on the kind of definition it encounters in the user script file. 24 When a PC link definition is found, the interpreter calls the Generic DSP Driver to create the link. The Generic DSP Driver creates PC links in the same way it creates Control links (refer to step 1). PC links are allocated in the same designated DSP board's memory as Control links. Just after the Generic DSP Driver creates the PC link, the interpreter calls the Remote DSP Control Interface to command the CommandManager in the DSP side of ORTS to initialize the new PC link. When a Local link definition is found, the interpreter directly calls the Remote DSP Control Interface to command the CommandManager in the DSP side to create the link. Local links are directly created by the DSP, which allocates space for the link in a designated DSP board's memory for Local links (SSRAM), and creates a list with control information for the future use of the link. For the creation of DSP processes, the interpreter makes a call to the Remote DSP Control Interface to command the DSP side to create them. Creating a DSP process involves creating the process stack, creating a process control block with information, such as the process name, process priority, process state, process stack, process stack pointer, process frequency, and so forth, and lining the process up in a "ready to run" list. After all DSP processes are created, a second command is sent to the CommandManager to run the processes. PC processes, on the other hand, are locally created and run by the host application. On the DSP side of ORTS, the synchronization and communication primitives are used for controlling access to shared resources, thus making it possible to read from and write to the Control links, and successfully establish communication between the Remote DSP control Interface and the CommandManager. The synchronization and communications primitives are also used to mediate access to the communication links between DSP processes. 2.3 Hardware considerations This subsection makes a comparison between the single C3x processor boards supported by the previous ORTS versions versus the dual C67 processor board, Daytona. Daytona is a more robust platform than the previous ones, and supports new functionalities such as DMA mechanisms for transferring data between memories, which are also introduced in this subsection. 25 2.3.1 C3x platform ORTS has been ported to three different TI C3x based DSP boards, namely the Spectrum PCC31 at 33Mhz, the Spectrum PCC32 at 50 MHz, and the Spectrum Indy at 60Mhz. These three DSP boards, although integrating different TI digital signal processors, share basically the same hardware architecture. This architecture is shown in Figure 2.3. Basically, 2 different types of memory are present: a 1K X 32 Dual Port RAM (DPRAM) and one or two 128K X 32 banks of SSRAM. The architecture is ISA bus operational, integrated with a DSP~LINK interface for interconnecting a peripheral I/O board. In this architecture, the Dual Port RAM is specifically designed for host-DSP communication; one of the port is directly connected to the DSP processor bus, and the other port of the DPRAM is connected to the ISA bus of the host. The dual port allows fast information exchange between the host and the C3x without disrupting either device processing. DSP C3X DSPLINK INTERFACE 7 \ Figure 2.3 Single processor DSP board architecture 2.3.2 Dual processor board C67 Porting ORTS to Daytona's dual processor board has great advantages over porting to the previous single processor boards, mainly in improved processing time. Daytona consists of two DSP nodes, each one equipped with a TI TMS320C6701 floating point DSP running at 167MHz. This architecture uses multiple execution units operating in parallel to execute multiple instructions during a single clock cycle. C6701 executes up to eight 32-bit instructions per cycle. 26 The device's core CPU consists of thirty-two, 32-bit general-purpose registers. There are two multipliers and six arithmetic logic units (ALUs) in one CPU. This processor is one of the first off-the-shelf DSPs to use advanced very long instruction word (VLIW) to achieve increased instruction-level parallelism [38]. Figure 2.4 shows Daytona's DSP architecture. DSP-LINK INTERFACE 128Kx32 SSRAM 4Mx32 SDRAM PCI-PCI BR IDGE Node A C6x Node B C6x PEM Site 8Kx32 DPRAM 128Kx32 SSRAM 4M x 32 SDRAM HURRICANE A HURRICANE B HOST PORT INT. 7\ 12 HURRICANE X Figure 2.4 Dual processor board architecture [40] As shown in Figure 2.4, the DSP nodes in the Daytona board integrate a DPRAM, however, unlike the previous C3x boards, it is not designed for communication between host and DSP, but for communication between DSP nodes. Moreover, a third type of memory, SDRAM, is integrated as part of each node. Each processing node is connected to a local PCI bus through two bridges called "Hurricane A" and "Hurricane B". The local PCI bus, on the other hand, interconnects the processor nodes together. It is important to note that "Hurricane A" and "Hurricane B" can only have access to the SSRAM in its corresponding node. The way the local PCI bus interconnects with the host is through a DEC 21153 chip, which provides the PCI-to-PCI bridge from the local PCI bus to the host PCI bus, enabling the DSP-host communication. "Hurricane X", on the other hand, is used as the interface between the C6x host Port Interface (HPI) bus and the local PCI bus. 27 The nodes have identical sets of resources, except that node A also supports a DSP~Link3 I/O interface. 2.3.2.1 Hurricane DMA transfers "Hurricane A" and "Hurricane B", do more than bridge the local PCI bus and the DSP bus, they also integrate a DMA controller for performing DMA transfers between memories. Different types of Hurricane DMA transfers are possible: • DMA transfers between different DSP nodes (Internode DMA transfers ) and, • DMA transfers between DSP nodes and host. Following, we present each of these types of DMA transfers. Internode DMA transfers Figure 2.5 shows a DMA transfer from the SSRAM of one node to the SSRAM of the other node performed through the local Hurricane chips. PCI-PCI BRIDGE HOST PCI BUS Figure 2.5 Internode DMA transfer [40] 28 DMA transfers between nodes and host This DMA transfer performs exclusively between the SSRAM of either node A or node B and the host RAM. The transfers are performed via the node's Hurricane and the PCI-PCI bridge to access the host. See Figure 2.6 Node A C6x Node B C6x DSP-LINK INTERFACE 128Kx32 SSRAM 4Mx32 SDRAM 3 PEM Site - \ 8 K x 3 2 V DPRAM — 1 1 — \ 128Kx32 SSRAM \l—1 4M x32 SDRAM PCI-PCI BRIDGE HOST PCI BUS HURRICANE A 3k. HURRICANE B ± Figure 2.6 DMA transfer to and from the PC [40] 29 Chapter 3 Design Considerations To better understand the implications of porting ORTS to Daytona, this chapter first describes the implementation details of ORTS for a C3x platform. Second, the implications for porting ORTS to Daytona are presented. Daytona offers a new range of features that widen the possibilities for designing new methods of communication between PC and DSP, communication between DSP nodes, and communication between processes running on the same DSP. These new possible method of communication are evaluated to determine the appropriate implementation for dual ORTS. Following this evaluation, we describe how the ORTS system operation, explained in Chapter 2, is modified to support the new implementation. The chapter concludes by describing the memory distribution for dual ORTS. 3.1 ORTS C3x version Software design depends to some extent on the platform on which the application will be run, and ORTS is no exception. The first version of ORTS was designed to run on a single processor board with shared memory-based architecture between the host and the DSP (Refer to Figure 2.3). In this architecture, a DPRAM is the only way the DSP and the host can communicate. The DPRAM is also referred to as 'shared memory' since it is shared between the two subsystems. Besides the DPRAM, a second type of memory, an SSRAM, is available in the DSP board. Unlike the DPRAM, the SSRAM is attached exclusively to the DSP local bus, thus serving DSP's purposes only. The memory allocation for both communication and control links largely depends on the architecture of the selected board. Since host-DSP communication in both system level and inter-process communication can only be achieved through the DPRAM, every link between the host system and the DSP board, which includes Control and PC links, is allocated in the DPRAM. On the other hand, links that communicate processes in the same DSP processor (i.e., Local links) are allocated in the SSRAM. This distribution of links is the only possible design when considering this architecture, and the system software was designed and implemented based on this distribution. The next two ORTS C3x versions are based on the, same kind of DSP board architecture, thus following the same software design pattern as the first version. The main difference among these three C3x versions, as mentioned in subsection 2.3.1, is the processor. Due to the many similarities between DSP boards, the job of porting ORTS was reduced to the following tasks: 30 • in the DSP side of ORTS, only the processor dependent operations, such as timers, interrupts handlers and memory maps needed to be modified; • in the host side of ORTS, a new DSP device driver needed to be created in order to support the unique functionality of the DSP board. So far, porting ORTS had. been a transparent task, first due to the openness of the system, and second due to the homogenous hardware selection of DSP boards used for the first three ORTS versions [1,3,4,5]. 3.2 Dual ORTS In chapter two the architecture of the Daytona board was compared with the architecture of previous C3x boards. The Daytona board has various enhancements the previous boards lack, which can be summarized as follows: • two processor nodes; • three different types of memories, which include DPRAM, SSRAM and SDRAM; • larger memory space; • host access to any of the DSP's C67 address space through the HPI; • host access to any of the node's SSRAM through the Hurricanes; • Hurricane DMA transfers between a node's SSRAM and the host RAM and, • Hurricane DMA transfers between SSRAMs in different nodes. These new features and enhancements widen the possibility of implementing new methods of communication between the host and the DSP, between DSP nodes, and even between processes running on the same DSP processor. Gurcan [4] attempted to port single processor ORTS version to Daytona. The implementation considered porting one DSP node exclusively. The communication between.the host and the DSP was achieved through the DPRAM (just as it was done in the single processor versions), however, in Daytona, the DPRAM is not intended to communicate host and DSP, but DSP nodes. Furthermore, ORTS demands large exchange of information between host and DSP. The DPRAM is a 32KB memory (4k x 64), the smallest memory in the board. If both DSP nodes were intended to communicate evenly with the host via the DPRAM, only 2k would have been available for each node. Moreover, DPRAM has also the lowest throughput rates among all the different memories in the Daytona board. Local links in ORTS are dynamically created with the "malloc" function. In the "Linker command file" used in Gurcan's implementation, every allocation with the "malloc" function were linked in the Internal variables (IVARS) section on on-chip memory. The IVARS section is 16KB, which is 31 even smaller than the DPRAM. With this implementation, local links were limited to a space less than 2k x 64, while SSRAM and SDRAM were completely idle. Neither local links nor PC links were implemented to handle double precision data (64 bits). As part of Gurcan's implementation, the creation of process' stack and the Interrupt handler, which saves and restores process information during preemption, were successfully ported, and are being fully used in this thesis. All other porting considerations were designed and developed from scratch. The porting process, as well as the new techniques in data transfer and possible design approaches for implementing host-DSP communication, local processes communication, and interNode processes communication are described in the following sections. 3.2.1 Host - DSP communication Host-DSP communication in previous ORTS versions is performed via the DPRAM, with no other option possible for communication. In Daytona, the main use of the DPRAM is to communicate between processor nodes, so it is not convenient to keep following the same implementation as previous versions for this type of communication. However, new options are possible, which include Hurricane DMA transfers between an SSRAM and the host RAM, or simple accesses from the host to any of the SSRAMs. Following, two different design approaches for host-DSP communication are studied. 3.2.1.1 First approach (Hurricane DMA transfers) The first approach takes advantage of the Hurricane DMA transfers between the SSRAM of any of the nodes and the host RAM by allocating the so-called Hurricane links in the host RAM. Hurricane links are a newly created link that communicate DSP processes with host processes using Hurricane DMA transfers. In this approach, every access performed by the host, to either read or write a link, is local; no accesses to the DSP local PCI bus need to be performed. The DSP, on the other hand, performs external DMA transfers from and to the host RAM to either read or write a link. This approach is mainly attractive because of the DMA capacity to handle transfers of large amounts of data without CPU intervention. Furthermore, host reads and writes to the links are executed much faster locally than having to access the DSP board. See Figure 3.1 32 HOST DSP B O A R D Figure 3.1 First approach Hurricane links allocation for host-DSP communication From this Figure it can be observed that the control header for the Hurricane links is allocated in the host RAM along with the data. An immediate drawback to having the control links in the host RAM is the overhead which the application incurs when the DSP side of ORTS reads and/or writes the links control header (refer to section 2.1.2.). Performing a single data transfer to or from the host requires not one but three different DMA transfers: • one DMA transfer to read the indexes in the control header; • one DMA transfer to either write or read the block of data (actual data transfer); • one DMA transfer to update the control header. A solution to this drawback is to allocate the data portion of the Hurricane links in the host RAM and allocate the links control header in either SSRAM or SDRAM of the corresponding DSP node. This way, accesses to the links control headers by the DSP side are now performed through pointers locally in the DSP node. The host, on the other hand, now needs to access the DSP node's memory in order to read and update the header, however, accesses to data are still performed locally. With this solution we get rid of the extra overhead caused by performing Hurricane DMAs to read and update the links header. See Figure 3.2 33 HOST DSP BOARD LINK 1 LINK 2 <j l LINK 3 LINK 1 LINK 2 LINK 3 LINK 4 H U R R _ U N K S Figure 3.2 First approach: Hurricane links control header allocation One important characteristic of Hurricane DMA is that transfers are only possible through S S R A M (refer to Figure 2.8), therefore, data to be transferred to or from the host R A M must hold a location in S S R A M . Since data is dynamically created by the running user process, data to be transferred to the host R A M must first be transferred to a location in S S R A M to finally be transferred to the indicated Hurricane link in the host RAM. To read data from the host R A M , the same applies; that is, data in the host R A M should be first placed in the S S R A M and then read from the S S R A M to finally be transferred to the D S P process that requested it. This means that two transfers of the same data block should take place. In regard to memory space, although Hurricane links are not allocated in S S R A M , important space in memory is used for partial allocation; thus, no D S P memory is saved whatsoever. Another important point to take into account is the implementation of a DMA algorithm for multitasking. By allocating the Hurricane links in the host RAM, every D S P process that needs to read or write a Hurricane link must make use of the Hurricane DMA. The Hurricane DMA has the ability to transfer data to and from the host R A M without processor intervention. With this capability, the Hurricane DMA can execute during context switching, and can continue executing the transfer of a specific process during another process running period. Therefore, an algorithm needs to be implemented in order to share this mechanism among all processes. Section 5.2 explains an algorithm used for implementing D S P C67 DMA. Refer to section 5.2 for a detailed description of a DMA algorithm. 34 3.2.1.2 Second approach: The second approach proposed for implementing host-DSP communication is to allocate the PC links in the SSRAM and have the host access any of the node's SSRAM through the Hurricanes. It is important to make a distinction between the new Hurricane links, which communicate DSP processes with host processes using Hurricane DMA transfers, and the PC links which communicate DSP processes and host processes, but that do not make use of the Hurricane DMA. With this approach, every host access to the PC links needs to be performed through the local PCI bus. As for the DSP side, every access to the links is performed locally in the DSP board and no Hurricane DMA transfers are performed. See Figure 3.3 HOST DSP DSP A DSP B H U R R . A \ ^ \ H U R R . B Figure 3.3 Second approach PC links allocation for host-DSP communication P C J J N K S HEADERS HEADER 1 LINK HEADER 2 LINK HEADER 3 HEADER 1 LINK HEADER 2 LINK HEADER 3 Allocation of the PC links in the SSRAM is selected over allocation in the SDRAM since host accesses to the SSRAM are done at a higher throughput rate than accesses to SDRAM. See table 3.1 35 TEST PERFORMED Rate [Mb/s] Transfer Size: OxFCOO Read SSRAM through Hurricane 2.73 Write SSRAM through Hurricane 24.61 Read SDRAM through host Port 2.05 Write SDRAM through host Port 2.30 Table 3.1 Host data transfer rates to SSRAM and SDRAM The main drawback to this approach is that although host accesses to write to the SSRAM are performed at a relatively high rate, transfer rates to read from the SSRAM are too low, which means slow downs for consumers and increased data losses. On the DSP side, every access from the DSP to the links is performed locally in the DSP board. Contrasting with the previous approach, with this second approach, every read or write from the link is performed directly by accessing the SSRAM; that is, no double transfers need to be performed. This way, system calls to read and write a link are shorter. Reduced system calls duration for real time applications (DSP processes) is the major advantage of this approach. 3.2.1.3 First proposed solution for the implementation of host-DSP communication When host-DSP communication was first implemented, we considered that every read and write to the links had to be performed item by item. It is important to know that each item is given a size (item-size), which is specified in 64-bit words, and which determines the length of the data block (length of item) to be transferred between links. With this in mind there were not many apparent advantages of one approach over the other. On one hand, in the first approach the Hurricane DMA's main advantage of being able to handle transfers of large amounts of data without CPU intervention is not exploited at all. This is because the ORTS user defines item sizes of even a single data word, and rarely define items greater than 50 words, which definitely should be considered a small size transfer. Moreover, it is important to note that the overhead of setting up a DMA transfer for just a few bytes of data is higher than the transfer time itself; however, for larger blocks of data the set up time becomes negligible, compared to the transfer time. Therefore, it is costly to use DMA mechanisms for small transfers. For example, the difference in transfer time (Hurricane DMA read) between one-32 bit word and two-32 bit words is just 0.02396us. In percentages, transferring one-32 bit word is only 0.47% faster than transferring twice as double information. For this specific case, this time difference can 36 be used as an estimate of the transfer time of a single 32 bit word without the function call overhead. In this case, we obtain that 99.53% of the transfer time of a single 32 bit word is pure overhead, and just 0.47% of the time is used for effective data. Transferring small blocks of data also has a direct impact on the system calls' performance to read and/or write a link. If a DMA algorithm had to be implemented to control access to the DMA controller among different processes, this DMA algorithm would have to be executed even for a single data word transfer, thus considerably increasing system call duration, which directly affects the real time performance of the system. The only noticeable advantage to using the first approach for implementing the host-DSP communication is that host reads and writes to the links are performed locally. This way, the host doesn't have to deal with arbitration, which means that reads and writes to the links are performed at higher rates. On the other hand, in the second approach, although the system calls to read and write a link are shorter for any transfer length (which strongly benefiting the system), the host low throughput rate of reading from the SSRAM is a major drawback. The following tables, table 3.2 and table 3.3, illustrate the performance of the host-DSP communication for both the first and second approaches, showing an estimated number of samples lost for a given number of transfers. The results were obtained by having the host read items from a PC link and log the data on a host file. On the DSP side of ORTS, a process is periodically writing to the PC link at either a frequency of 5,000 or 10,000 Hz. Approach No. transfers Item Hurricane link Samples lost Size Size (lagging host) First 100 20,894 approach 100,000 3 words 500 9,187 (5,000 Hz) 1000 1,124 First 100 30,066 Approach 100,000 3 words 500 28,490 (10,000 Hz) 1000 27,985 Table 3.2 first proposed solution: Host-DSP performance for the first approach 37 Approach No. transfers Item PC link Samples lost Size Size (lagging host) Second 100 51,726 approach 100,000 3 words 500 50,624 (5000 Hz) 1000 48,725 Second 100 72,384 Approach 100,000 3 words 500 71,695 (10000 Hz) 1000 70,025 Table 3.3 first proposed solution: Host-DSP performance for the second approach From table 3.3 it can be observed that nearly all the tests register sample loss of more than 50% of the total transfers. This clearly shows the very poor performance of host-DSP communication using the second approach. In this second approach, the host experiences lag when reading the data from the PC link, causing the DSP process to constantly find the PC link full, and then be forced to either sleep or simply skip the next sample to be written. As the sampling frequency increases, the host lags even more, causing an increase in the number of samples lost. The bottleneck occurs when the host negotiates access, first to the local PCI bus and then, through the Hurricanes to the DSP bus every time it needs to read data from the PC link. Negotiating the DSP bus extends the execution time of the process as well. Sample losses, for very low sampling frequencies, can be reduced to some extent by increasing the PC link size. However, for high sampling frequencies, the SSRAM where the PC links are allocated is not large enough to significantly counteract this effect. host-DSP communication using the second approach cannot be, therefore, significantly improved for high frequencies. Table 3.2 shows the performance of the host-DSP communication using the first approach. As can be seen, the performance in this case is considerably better than the performance in the second approach. For example, for a sampling frequency of 5000Hz, sample losses are reduced to nearly 1% of the total transfers with a Hurricane link size of 1000. As the sampling frequency increases, the sample losses increase as well; however, unlike in the second approach, the Hurricane links can be defined to be significantly larger, to considerably improve the performance of host-DSP communication. The Hurricane link size is limited depending on how much RAM is available in the user's computer for allocation of the Hurricane links. 38 A major drawback to using the first approach is that if a DMA algorithm is implemented, the DMA algorithm will have to be executed even for a single data word transfer. This introduces excessive overhead and considerably increases the execution time of the processes, thus reducing the sampling frequency at which the processes can be executed. 3.2.1.4 Final solution to the implementation of the host-DSP communication To make host-DSP communication satisfactory, a new way of achieving data transfer was considered. That is, instead of performing data transfers item by item, the items are first stored in a buffer until a complete block of data, of a size specified by the user, is filled out. Then, the data transfer to or from the host is performed (buffer flushed). This makes possible to take complete advantage of the real strengths of Hurricane DMA. This new method of storing data before actually transferring it is totally permissible by the user applications. Inter-process communication between the host and the DSP was finally implemented using the first approach, which is undoubtedly the best choice, considering the new transfer method explained above. In this implementation, when a Hurricane link is defined to communicate a host process with a DSP process, the user is enabled to set up the following: • the buffer size for each Hurricane link, which determines how many items to store before the DMA transfer takes place; • the size of each Hurricane link in multiples of the buffer size ("n" times the buffer size) and, • the total size of the host RAM, which could be in the order of Megabytes, to be allocated for the Hurricane links. • The definition of a Hurricane link in the user script file is as follows: Link link name (HurrWr(5), Node_A, buffered, 1000, 5) In this line, HurrWr refers to a Hurricane link to which a DSP process writes. It is important to remember that links are unidirectional (refer to section 2.1.2), and therefore, two different links should be defined for reading and writing. For example, a HurrRd definition refers to a Hurricane link from which a DSP process reads. Next, the parameter inside the parenthesis refers to the size of the Hurricane link, which in this case is 5 times the size of the storing buffer. The last two parameters, 1000 and 5, refer to the buffer and item sizes respectively. Finally, the word "buffered" refers to one of the four different kinds of links that can be created (refer to section 2.1.2), and Node_A means that the Hurricane link communicates a host process with a DSP process running on node A. The total size of the host RAM to be allocated for the Hurricane links is specified in a startup file. 39 It is still possible for the user, if the application demands it, to perform transfers between the host RAM and the DSP processes item by item, by simply defining the right parameters in the link definition. For example, the following line, specifies a Hurricane link size of 1000 and a buffer size of 1, which means that transfers to the host RAM will be performed item by item: Link link name (HurrWr(1000), Node_A, buffered, 1, 5); Tables 3.4 and 3.5 illustrate the performance of the host-DSP communication showing an estimated number of samples lost for different buffer sizes and Hurricane link sizes. The results were obtained by having the host read items from a Hurricane link and log the data on a host file. On the DSP side of ORTS, a process is assumed to be writing periodically to the PC link at either a frequency of 5000 or 10000 Hz. No. Transfers Item Size Buffer Size Hurr. link size (n times Buffer size) Samples Lost due to DMA's transfer time Samples Lost (Lagging host) Total Samples lost 1 84,269 89,264 100,000 3 100 3 5 each 100 70,555 75,550 5 56,187 61,182 10 40,415 45,410 1 64,918 69,903 100,000 3 300 5 15 each 300 22,800 27,785 10 3,847 8,832 - - -5 2,428 7,204 100,000 3 500 10 24 each 500 0 4,776 - - -2 9,444 14,196 100,000 3 1000 3 48 each 1000 3,250 8,002 5 0 4,752 1 142 each 4,977 9,568 100,000 3 3000 2 3000 0 4,591 - - -100,000 3 5000 1 234 each 507 4,953 2 5000 0 4,446 Table 3.4 Host-DSP performance for the first approach at a sampling rate of 10kHz 40 No. Transfers Item Size Buffer Size Hurr. link size (n times Buffer size) Sample Lost due to DMA's transfer time Samples Lost (Lagging host) Total Samples lost 100,000 3 100 5 3 each 100 52,821 55,818 10 16,910 19,907 100,000 3 300 5 6 each 300 5,130 7,128 10 0 1,998 100,000 3 500 3 11 each 500 3,233 5,422 5 0 2,189 100,000 3 1000 2 24 each 1000 538 2,914 3 0 2,376 100,000 3 3000 1 68 each 3000 916 3,114 2 0 2,198 100,000 3 5000 1 116 each 0 2,204 - 5000 - -Table 3.5 Host-DSP performance for the first approach at a sampling rate of 5kHz Table 3.4 and table 3.5 reveal two kinds of sample loses, those due to a lagging host and those due to the DMA function transfer time. Sample losses due to a lagging host are caused when the host experience lag while reading the data from the Hurricane link, and the DSP process has no space in the link to write new data without overlapping unread data, which then forces it to either sleep or skip the sample that was to be written next. The second kind of sample lost, due to DMA function transfer time, is the result of having one buffer for both storing samples and for performing the DMA transfers to the host RAM. As explained in the first approach, section 3.2.1.1, data to be transferred to or from the host RAM through the Hurricane DMA must hold a location in SSRAM. The buffer used to store the samples in the DSP is the same SSRAM location used to perform the DMA transfer. This means that while the DMA transfer is taking place, it is not possible to share the buffer with the DSP process to start storing more samples. Therefore, samples are lost between DMA transfers. The larger the buffer, the more samples are lost. Sample loss due to a lagging host can be reduced by increasing the size of the storing buffer and/or by increasing the size of the Hurricane link in the host RAM. This way, for sampling frequencies as high as 10 kHz, sample losses can be reduced to as low as zero. There is no rule 41 for the efficient handling of the size of the storing buffer and the size of the Hurricane link; it depends on how the user makes use of the memory and how much memory s/he has left for increasing the buffers and links sizes. An important advantage is that the host RAM allocated for Hurricane links can be sufficiently large, in the order of megabytes, to compensate for any lack of memory in the SSRAM location, the total size of which is relatively small. If the buffer size is defined to be large, sample loss due to the DMA function's transfer time cannot be eliminated, unless a second buffer is used for storing data (see section 6.2). By implementing Inter-process host-DSP communication with the first approach and by considering storing data before performing the DMA transfers, the performance of the host-DSP communication dramatically improves, compared with the performance of the first proposed solution. Hurricane links are definitely the best choice for applications that require periodic transferring of large blocks of data to and from the host. With this new solution and the recommendations for future work explained in section 6.2, the total number of sample loss can be reduced to zero for even high sampling frequencies. As explained above, host-DSP Inter-process communication was implemented using the first approach, but for host-DSP system Level communication, which is performed through Control links, a different approach is implemented. Unlike host and DSP processes that can periodically interchange large amounts of data at high frequencies, system level communication between the host and the DSP is sporadic, and mainly on startup. Except for the Print link, which transfers messages of different lengths, the Command and Reply links exchange data of just a single word. In this case, sporadic single-item transfers are more appropriately handled by PC links using the second approach than by using a DMA mechanism designed for burst transfers. Sporadic or aperiodic tasks such as a DSP messages to the console, are also appropriate to be handled by PC links. In this way, the DMA controller is always kept available for processes that fully take advantage of such mechanisms. 3.2.1.5 Hurricane DMA tradeoff It was concluded in subsection 3.2.1.4 that Hurricane DMA is the most convenient way to communicate periodic processes between host and DSP in Daytona; however, Hurricane DMA, like any other DMA controller, presents an important tradeoff that should be taken into consideration. That is, performing reliable transfers of large amounts of data between the host and the DSP while keeping sample losses to a minimum (or to zero with a double buffer), comes at a cost. On one hand, a DMA controller is an excellent mechanism for transfering large amounts of data without the CPU intervention; on the other hand, data transfers are performed by stealing 42 cycles to the CPU. This cycle stealing operation retards the progress of the executing program and extends its execution time [7]. The amount of data that can be transferred between the host and the DSP through the Hurricane links can range from a few words to very large blocks of data. For example, take the following: Lets assume the "item-size" of an item to be 4 (refer to section 3.2.1.3). If we consider a block of 100 items, that means that Hurricane DMA will be transferring 400 words of 64 bits each, or 800 words of 32 bits each. If we consider a block of 2,000 items, that means that the Hurricane DMA will be transferring 8,000 words of 64 bit each, or 16,000 words of 32 bits each. As can be observed, the amount of data that the Hurricane DMA can be asked to transfer by an application is very variable, ranging from a few hundreds words or less, to thousands of words. Therefore, it is very important to know in advance the behavior of the Hurricane DMA controller to give some predictability to the system. "Predictability, not speed, is the foremost goal in real-time-system design" [6]. For this purpose, it is important to know how the Hurricane makes use of the DSP bus to perform DMA transfers [41]. Hurricane's utilization of the DSP bus is controlled by a so-called DSP bus bandwidth timer which sets an upper limit on the time that the Hurricane can use the bus. Whenever the Hurricane owns the DSP bus, the timer starts counting up to a timer compare value. When the timer reaches that value, the Hurricane is forced to release the bus, and the timer begins to count down. The Hurricane does not reacquire the bus until the counter has reached zero. This operation mode cannot be considered as any of the two well-known DMA controllers operations, which include burst mode and cycle-stealing mode. In a burst mode, the DMA controller becomes the bus master when the bus is free, and holds the bus until all data is completely transferred. With the DSP bus bandwidth timer, the Hurricane also becomes a bus master, but only for a fixed amount of time, and when the time is up, it is forced to release the bus even if the transfer is not complete. In cycle-stealing mode, the DMA controller transfers data by stealing cycles from the executing program when the CPU is not using the bus. With the DSP bus bandwidth timer, the Hurricane acquires the bus when the counter reaches zero, whether the CPU uses the bus or not. 43 In this way, the Hurricane DMA was calculated to be using 8% of the DSP bus bandwidth approximately every 13us (i.e. 1.1ns every 13ns). This means that the Hurricane DMA controller will not block the CPU for more than approximately 8.5ns every 100 ns. The latter is always true, whatever the size of the transfer is. 3.2.2 Local process communication Processes running locally in the same DSP communicate through local links. Local links are created by the kernel after the user defines them in the script file. In dual ORTS the kernel dynamically allocates space for these links in the SDRAM. The SSRAM is not a good choice because it already allocates space for PC and Hurricane links. Moreover, as explained in section 3.4, the user's executable code is allocated in SSRAM. On its down side because of Daytona's board architecture, the SDRAM has no dedicated purpose unlike the SSRAM, and its memory capacity is 4M x 32, which is 32 times bigger than the SSRAM memory size. This makes SDRAM the best choice for allocating local links. 3.2.3 Internodes process communication An important functionality that dual ORTS should comply with is multiprocessing. That is, dual ORTS should have the capability to support processes running on the different DSP processors that the DSP board integrates. As a design requirement, ORTS runs an exact copy of the kernel in each processor, as a mirror system. With this configuration, each node runs independently from each other, but with a common link (dual-port RAM) that enables internode process communication. Processes in ORTS can only be statically bounded to one processor node, and the user determines the workload on each processor. Statically bounding processes helps lighten the workload of each processor and speed up the system; however, the system does not count with any tool or scheduling mechanism to help the user balance the workload properly, which can result in performance deterioration of one, or even both of the processor nodes. ORTS for Daytona can be seen as a two-single processor RTOS, rather than one-multiprocessor RTOS. The Internode link is a new type of link specifically created to communicate processes running on the same board, but in different DSP nodes. The Internode links are allocated in the shared dual port RAM, to which both nodes are tightly coupled. Internode communication is shown in Figure 3.4 Daytona differs from the single processor boards where all DSP processes were run and downloaded in the same processor. In Daytona, the user should determine which DSP process to 44 run in which of the two processor nodes. For this purpose the DSP code to be run on node A and the DSP code to run on node B are compiled and linked separately, and saved in different files so that the right code is downloaded to the right node. Figure 3.4 Internode communication [40] 3.3 Dual ORTS system operation In dual ORTS, the system core operation, explained in section 2.2, remains the same; however, due to the differences in hardware architecture between platforms, some of the software structures are modified to comply with multiprocessing and with the proposed implementations discussed in section 3.2. dual ORTS system operation is explained using Figure 3.5 as a model. Step l . In dual ORTS, the host application and the real-time kernel in both processor nodes need to communicate. Therefore, different paths of communication should exist between the host and processor node A and between the host and processor node B. For this purpose, on startup, two 45 set of control links are created. The Generic DSP driver is now modified to handle the creation of control links for both nodes. The Generic DSP driver creates a control header for each link and allocates the Control links in the SSRAM of the corresponding processor node, as proposed in section 3.2.1.4. A new DSP Device Driver is developed to enable communication to either of the processor nodes to which it is specified. Finally, host RAM is allocated for the Hurricane links. A different block of memory is allocated for each of the processor nodes. The address of each block is then stored in a memory location in the SSRAM of each processor node, respectively. HOST DSP NODE G E N E R I C DSP DRIVER Create: A/B A/B P C H U R R INTERNODE LINKS LINKS LINKS T V N O D E A/B SCRIPT-FILE I N T E R P R E T E R 4F INITIATE IPC / H U R R LINKS A /B C R E A T E L O C A L LINKS C R E A T E INT. N O D E LINKS N O D E A /B N O D E A/B INITIATE INT. N O D E LINKS C R E A T E D S P P R O C . R U N D S P P R O C . N O D E A /B N O D E P J N O D E A /B J \A/B R E M O T E D S P C O N T R O L I N T E R F A C E >A/B D S P S S R A M D S P D P R A M 3 D S P C O D E RUNNING C R E A T E C O M M A N D M A N A G E R R U N C O M M A N D M A N A G E R INITIALIZE I N T E R P R O C E S S O R LINKS S Y N C H & C O M M S S R A M r Figure 3.5 Dual ORTS system level communication 46 Step2 The host downloads the real-time kernel and the user processes of each node to the corresponding processor through the DSP device driver. The code in each node automatically starts running and a series of handshakes with the host application takes place. On the DSP side, in each node, the Command Manager is created and the control links are initialized. Just after the initialization of the Control links, the Command Manager in each processor node loops waiting for the host commands to arrive. Step3 At this point the system is ready to receive and execute the user script file. A new script file format is implemented that allows the user to determine the node in which that process is to be executed. The node's specification (i.e., Node_A or Node_B) is a mandatory parameter to be introduced by the user. In the same way, a node's parameters (i.e., Node_A, Node_B, or Internode) need to be defined in the link definition to indicate whether the link is to be allocated in node A, node B, or between nodes in the DPRAM (Internode). Links can be defined in dual ORTS as local links, PC links, or Hurricane links. The script file interpreter, on the other hand, needs to be able to handle these new parameters and provide the Generic DSP Driver and the Remote DSP Control interface with the correct node's information so that they can distribute the right instruction or command to the right processor node. The script file interpreter calls the Generic DSP driver and/or the Remote DSP control Interface, depending on the kind of definition encountered in the script file. The script file Interpreter directs every PC links, Internode link and Hurricane link found in the script file to the Generic DSP driver, which is now responsible for the creation of these three links. PC links, Internode links and Hurricane links are created in the same way except that PC links are allocated in the SSRAM, Internode links in the DPRAM and Hurricane links in the host RAM. Contrary to the PC and Internode links control header, which is included in the first addresses of each link, the Hurricane links control header is included not in the host RAM where the links are located, but in the SSRAM where the storing buffers are allocated. Immediately after a PC, Internode or Hurricane link is created, the script file interpreter calls the Remote DSP Control Interface to command the CommandManager, in the corresponding processor node, to initialize it. On the DSP side of ORTS, the kernel takes advantage of the existing function used to initialize PC links to initialize Internode and Hurricane links as well. A link 47 identifier is all that is needed to initialize the links in either the SSRAM for PC links or in the DPRAM for Internode links. In the case of Hurricane links, the storing buffer allocated in the SSRAM is initialized. When a Local link definition is found, the interpreter directly calls the Remote DSP Control Interface to command the CommandManager in the DSP side to create the link. Local links are directly created by the DSP, which allocates space for the specified link in the SDRAM of the corresponding processor node and creates a list with control information for the future use of the link. For the creation of DSP processes, the interpreter makes a call to the Remote DSP Control Interface to command the corresponding processor node to create them. After all DSP processes are created, a second command is sent to the CommandManager in both nodes to run the processes. PC processes, on the other hand, are locally created and run by the host application. At the beginning of this section it was explained that some software structures were modified due to the differences in hardware architecture between the single processor boards and Daytona. Some of these software modifications involve the following: In the DSP side of ORTS, • the processor dependent operations, such as timers, interrupts handlers and memory maps; • all the communication primitives, which include the creation and initialization of links, and system calls for communication between links; and in the host side of ORTS, • the DSP device driver and communication primitives ; • the Generic DSP Device Driver; • the Remote DSP Control Interface and; • the script file interpreter. Dual ORTS is user-hardware-independent, which means that the user can use the new platform without having to program hardware-specific code. The user's applications (processes) that have been used since the first ORTS version can be used in dual ORTS without any change. All kind of processes, which include aperiodic processes, periodic processes, group processes (which run several processes consecutively without synchronization among them during its time slice), and processes that run in the fast cyclic executive mode (disabling preemption), can be handled in dual ORTS. 48 3.4 Memory distribution ORTS real-time operating system has a size of 27K bytes, out of which, 13K bytes are the kernel itself. Runtime-support functions (RTS functions) occupy approximately 27K bytes, which equals the size of the whole real-time operating system. Placing RTS functions in off-chip memory has the advantage of saving valuable on chip memory, however it comes at a cost; the RTS functions will run much more slowly, and that is not acceptable for a real-time application. On the other hand, approximately 41K bytes of executable user code are already being programmed, and it is expected to gradually increase. There is no way that all the code, including the real-time kernel, RTS functions, and the user's code, can fit in the internal program memory of the TMS320C67 DSP, which consists of 64KB. The approach proposed for memory distribution is to keep the real-time operating system and the RTS in on-chip memory, and place the user's code in SSRAM, which has the fastest throughput rates of all other memories in Daytona board. As well as the user code, which is allocated in SSRAM, some space needs to be reserved in this memory for allocating the PC and Hurricane links buffers. In the "linker command file", the first 1C000h memory locations, which correspond to 112KB, are specifically reserved for user executable code. The following 64000h memory locations, or 400KB, are reserved for PC and Hurricane links' buffers specific purposes. This memory distribution can be modified, depending on the need for space of the PC and Hurricane links buffers and the user applications. With regard to SDRAM, there are 16MB (2M x 64) of available space. Since every Local link is dynamically allocated, and the approach is to allocate them in SDRAM, the ".sysmem" section in the "linker command file" is linked in SDRAM, and is actually given a heap size of 256K bytes. The user stack is dynamically allocated too, and since the runtime support function, malloc, allocates from a global pool or heap, which is the defined by the ".sysmem" section, the user stack is allocated in SDRAM as well. 49 Chapter 4 Process Periodicity Tests Very interesting results appeared while performing tests to dual ORTS. Under certain circumstances, the periodicity at which a process is put to run does not match the sampling frequency defined for that process. Of great concern are those cases in which a process is periodically run at a higher frequency than indicated. This strongly suggests an inadequate operation of the periodic wake up times of the system. The first subsection describes how the code works for generating periodic processes and how timing is measured. The next two subsections present the results of the tests performed to periodic processes. These tests focus on evaluating the periodicity at which a given process runs, for a given sampling frequency. The last subsection describes how periodic wake up times and context switching work in ORTS and the flaws found in such implementation, and show graphical representations of how those flaws affect the performance of the system. 4.1 Periodic processes Periodic processes have a user-defined frequency. In order for a process to run periodically, the system function, WaitPeriodicWakeUp, must be called at the end of the body of every periodic process. This system function puts the process in the sleeping list until its period has elapsed. Once the period is completed, the scheduler puts the process back in its corresponding priorityjevel list (ready to run list) where it waits to be scheduled. The structure of a periodic process is shown next: void Periodic Process name (void) { While (1) { Body of the process; WaitPeriodicWakeUp(); } } The wake up time for periodic processes is explained as follows: every periodic process is given two values that are used for periodic wake ups, which are the "WakeUpPeriod" and the "WakeUpTime". The "WakeUpPeriod" of a process defines the number of time-quantums, or ticks (basic unit of time), that have to elapse before waking up the process. The "WakeUpTime" of a process computes the exact time the process should be woken up and put to run. The granularity 50 of the "WakeUpTime" is defined by the granularity of the time_quantum. The time-quantum has a -4 constant value of 1x10 seconds, which is the granularity of the timer interrupt. Every time a periodic process goes to sleep (by calling the WaitPeriodicWakeUp system function), its "WakeUpTime" is computed based on the "WakeUpPeriod" and on the previous "WakeUpTime" of that process: WakeUpTime (t+1) = WakeUpTime(t) + WakeUpPeriod On the scheduler, the processes whose "WakeUpTime" is less than or equal to the current time are removed from the sleeping list and put back in its priorityjevel list, where they wait to be scheduled. Processes that are put in the ready to run list when their WakeUpTime is less than the current time have already missed their deadlines. 4.1.1 Measuring execution time and periodicity Periodicity tests were performed using a clock cycle counter on the DSP processor. The advantages of using a clock cycle counter are that measurements can be performed with a fine granularity (the C67 counter runs at the CPU clock rate of 7ns), and it is non-intrusive, which means that it does not disturb the execution of the software. For the tests, it was necessary to measure both the execution time of the periodic process, and its periodicity. To measure the execution time of a process, the interrupts were disabled so that no other times or latencies were included as part of the total execution time. The cycle counter was then started at the beginning of the process (as the first instruction of the body of the process), and stopped just at the end of the process. The readout was displayed on the screen. T 2J 3T 4T Process running every period T Execution time of process #— Start counter Stop counter 51 To obtain the periodicity of a process, measurements where taken from the beginning of one period to the beginning of the following period of the process. The readouts were continuously displayed on the screen until the process was terminated. T 2T 3T 4T Start counter Stop counter Process running every period T Execution time of process #— Another technique used to measure the periodicity of a process involved an acquisition board and an oscilloscope. With this technique, a process was created to write periodically (at a defined frequency) in one of the DAC channels of the acquisition board. Sampling periods were obtained after connecting the oscilloscope to the corresponding ADC channel of the acquisition board and measuring the time between samples. The readouts of both the DSP counter and the oscilloscope always matched. 4.2 Sampling frequency tests: non-logging case The results are obtained by periodically running a DSP process at frequencies of both 5,000 and 10,000 Hz. Logging data to the host is not considered; that is, neither PC links nor Hurricane links are used. Two different process execution times are considered. Non- logging Test 1: Time-quantum of the system: 100 us Process frequency: 5 kHz Process execution time: ~42us 52 Process Periodicity [2x10"J] .200 .200 .200 .200 .200 .201 .200 .200 .200 .200 .201 .201 .200 .200 .200 .200 .201 .200 .200 .200 Table 4.1 Non-logging test 1 Non- logging Test 2: Time-quantum: Process frequency: Process execution time: 100 ps 10kHz -42 ns Process periodicity [1 x 10J] .100 .101 .100 .100 .100 .099 .100 .100 .099 .100 .100 .101 .100 .099 .101 .100 .100 .100 .099 .100 Table 4.2 Non-logging test 2 53 Non- logging Test 3: Time-quantum: 100 fis Process frequency: 5kHz Process execution time: ~76ps s Period between running processes [2 x 10"J] .200 .201 .201 .201 .201 .200 .199 .200 .200 .200 .201 .200 .201 .200 .200 .201 .200 .200 .200 .200 Table 4.3 Non-logging test 3 Non- logging Test 4: Time-quantum: 100 ps Process frequency: 10kHz Process execution time: ~76|as Period between running processes [1 x 10J] .093 .093 .094 .089 .094 .092 .093 .093 .092 .093 .093 .093 .089 .093 .094 .094 .093 .199 .094 .094 Table 4.4 Non-logging test 4 54 4.2.1 Observations From these results, observe that the process whose execution time is 42 us, can run periodically at 5 kHz and 10 kHz with minimal period variation (tables 4.1 and 4.2). On the other hand, the process whose execution time is 76 us can run periodically at 5 kHz with minimal period variation, as expected (table 4.3), but it runs at a higher frequency (approximately 8% higher) for a sampling frequency of 10 kHz (table 4.4). Moreover, from table 4.4 we can also observe a sample, which is half its sampling frequency. 4.3 Sampling frequency tests: data logging case The results are obtained by periodically running a DSP process at frequencies of both 5,000 or 10,000 Hz. Data is being logged to the host RAM through Hurricane links. Two different process execution times are considered. Data Logging test 1: Time-quantum: 100 us Process frequency: 5 kHz Process execution time: ~40us Period between running processes [2 x 10J] .199 .199 .200 .200 .199 .200 .200 .200 .200 .199 .200 .200 .200 .200 .200 .200 .200 .200 .199 .199 Table 4.5 Data-logging test 1 55 Data Logging test 2: Time-quantum: 100 us Process frequency: 10kHz Process execution time: ~40us Period between running processes [1 x 10'J] .099 .100 .100 .099 .100 .099 .099 .100 .100 .100 .100 .100 .099 .099 .100 .100 .101 .100 .099 .100 Table 4.6 Data-logging test 2 Data Logging test 3: Time-quantum: Process frequency: Process execution time: 100 us 5kHz ~7.3us Period between running processes [2 x 10"3] .199 .200 .200 .199 .199 .199 .199 .200 .200 .200 .199 .199 .200 .199 .200 .200 .200 .200 .200 .200 Table 4.7 Data-logging test 3 56 Data Logging test 4: Time-quantum: 100 LIS Process frequency: 10kHz Process execution time: ~73us Period between running processes [1 x 10"J] .097 .096 .096 .097 .097 .097 .096 .096 .199 .097 .098 .089 .096 .096 .097 .096 .097 .097 .097 .097 Table 4.8 Data-logging test 4 4.3.1 Observations Tables 4.5 to 4.8 show the same results as those presented in the previous tests; that is, the process whose execution time is smaller, 40 ps in this case, can run periodically at 5 kHz and 10kHz (tables 4.5 and 4.6). The process whose execution time is larger, 73 \is, can run periodically at 5 kHz (table 4.7), however, it runs 4% faster when the sampling frequency is 10kHz (table 4.8). A frequency of 5 kHz can also be observed in table 4.8. 4.4 Interpreting the results When the execution time of a process gets closer to the sampling period, it is possible to expect that any variation in the execution time could exceed that period, thus causing the process to miss the deadline, or what is the same, to run at a slower frequency. However, the latter, is contrary to the results obtained from the previous tests; that is, as the execution time of a process gets closer to the sampling period, we observe that the frequency increases as well. This effect is observed in both the non-logging tests, and the data logging tests. A process that runs ahead of time undoubtedly indicates a failure in the performance of the periodic wake up times of the system. 57 4.5 Problems of implementation in ORTS The framework used to generate periodic processes in ORTS is reliable; however, the implementation of the wake up times of periodic processes and of the hardware and software preemption are not handled properly. This is why we observe the unexpected results shown in section 4.2 and 4.3, in which processes either run at faster frequencies and/or are delayed by one time quantum. These two problems are explained next. 4.5.1 Periodic processes timing flaws In ORTS, when a periodic process is being created, it is given an initial "WakeUpTime" value prior to the time when the process is fully created and put in the ready to run list. By the time the process is completely created and in the ready to run list waiting to be scheduled, its WakeUpTime is already less than the current time, which means that the process has already missed its first deadline. In this case, the just created process is already holding a stale "WakeUpTime" value. After the process is put to run, it might be the case that the newly calculated "WakeUpTime" is still less than the current time by many time quantums (remember that the WakeUpTime is calculated based on the latest WakeUpTime value, which in this case is stale). Therefore, just after the process is executed and put in the sleeping list, since the WakeUpTime is less than the current time, the process is immediately put back in the ready to run list even before it completes its period. This causes consecutive wake ups to be performed at faster frequencies. Processes that have execution times close to the sampling frequency are the most affected by this problem. Following, we illustrate the effect of this implementation problem considering 3 different cases: • Case A: a theoretical process with a defined period of 100ps (10Khz) and a large execution time of -72 ps, • Case B: a theoretical process with a defined period of 100ps (10Khz) and a small execution time of -40 ps, • Case C: a theoretical process with a defined period of 200ps (5 kHz) and a large execution time of -72 ps. Case A In Figure 4.1 we assume that the new process is given its initial "WakeUpTime" value sometime during time_quantum 3 (initial WUT =current time =3); however, it is not until time_quantum 4 when it is finally created and put to run. By the time the process is put to run in time_quantum 4, the process has already missed its first deadline, specified at time_quantum 3. 58 © tiinimi m 100 4^H WUT=3 WUT=3+1=4 ni • • t 84 200 Current time (n) -I [us] mil n i l WUT=4+1=5 84 \is T168 [HJJi 84 us WUT=5+1=6, t 252 84 us Hardware Context Switching [~4us] Periodic Process: Execution time [~ 72us] Software Context Switching [~ 8us] Idle Process Figure 4.1. Case A. Large execution time (10 kHz sampling frequency) After the process is run for the first time, the new WakeUpTime is calculated with a stale WUT value (WUT =3+1). Since the process completes execution before time_quantum 4 has elapsed, a software context switch still takes place during time-quantum 4. Since the new WakeUpTime is equal to 4, the process is put to run once again even before its period has completed. In this scenario the process is run at a frequency higher than that defined by the user. Case B In Figure 4.2, we consider the same case as explained above, but with a process execution time of ~40 us. With a smaller execution time, the WUT can get aligned to the current time-quantum count within the first or second context switching. In this case, the process runs twice in a single time quantum, reducing the sampling period from 100 us to 52 and 48 us. The WUT then gets aligned to the current time-quantum count for the second context switch; from then on, the process starts running periodically at 10 kHz. 59 0 50 15Q i l l l l l l l l l l l l l WUT=3+1=4 n WUT=4+1=5 52 us in i i i n r WUT=5+1=6 48 us 100 us ® 00 4° • Hardware Context Switching [~4us] Periodic Process: Execution time [~ 40us] Software Context Switching [~ 8us] Idle Process Figure 4.2. Case B. Small execution time (10 kHz sampling frequency) Case C: In Figure 4.3, we show a process with defined frequency of 5KHz. In this case, the process is supposed to run every two time quantums; however, when the process goes to sleep at the end of its first period, the "WakeUpTime" (WUT) computes only one time_quantum instead of two. The scheduler then puts an idle process to run to complete the elapsing time-quantum, however, immediately after the time_quantum elapses, the process is put to run again causing the first sample to be twice as fast as the process sampling frequency. After the process runs for the second time, the WUT gets aligned with the current time-quantum, and the process is then run periodically at its sampling period of 200 us. 60 © 0 CD 100 200 <3> ® 400 I I I 1^ 1 ® 500 a m i 111 in 11111 WUT=3+2=5 j fiTTTTTT llllillll 100 us WUT=5+2=7 I H a 200 ps i inn -| WUT=7+2=9 | IE Hardware Context Switching [~4us] Dl Periodic Process: Execution time [~ 72us] B Software Context Switching [~ 8us] D Idle Process 200 ps Figure 4.3. Case C. Large execution time (5 kHz sampling frequency) To fix the problem, first, the initial "WakeUpTime" value should be calculated to the time-quantum the process is first put in the running list. A very simple solution to control any stale WakeUpTime that might occur during the course of the program is: to make all WakeUpTime values less than the current time equal to it. This can be done directly inside the scheduler when processes are being woken up from the sleeping list. Refreshing the stale WakeUpTime value to the current time prevents cumulative delays. 4.5.2 Timer interrupt handler flaw A second problem found in ORTS refers to the timer interrupt handler itself. Context switching in ORTS can be hardware interrupted (by the timer interrupt) or software interrupted. Hardware interrupts increment a time-quantum counter (tick counter), which indicates how many timer interrupts (or time-quantum) have elapsed since the initialization of the system. Software Interrupts make use of a software interrupt flag, which is set to "1", to differentiate between interrupts. When the software interrupt flag is set, the time-quantum counter is not incremented. In this implementation, software interrupts (software preemption) use the same interrupt service routine as hardware interrupts. Moreover, the same interrupt service routine is the one that handles the software interrupt flag mentioned above. A representation of the code is shown next: 61 void Yield (void) { SoftPreempt =1; PreemptionlnterruptHandeler(); } /"software preemption*/ /*flag7 void PreemptionlnterruptHandler (void){ Push all registers; Statements; If (Soft_Preempt) /* if flag <> 0 (software interrupt) 7 I* flag =0 7 /"interrupt service routine for the timer interrupt7 Soft_Preempt =0 Else Current Time ++ /* tick (time_quantum) counter7 Statements; Pop all registers; } The main problem of sharing the same ISR between software and hardware interrupts arises when software and hardware interrupts overlap, and the software interrupt flag is set to "1" just before the hardware interrupt takes place. In this situation, when the interrupt handler is executed, the time-quantum counter is not incremented, thus skipping the count of the current time-quantum. This scheme most likely happens when software interrupts take place close to the end of the elapsing time-quantum. Skipping a time-quantum count causes the processes to run one time-quantum later than the expected. This effect is seen as lower sampling frequencies. The next Figure shows this effect. Figure 4.4 shows a periodic process with a user-defined frequency of 10 kHz. On time_quantum 6, software preemption and hardware preemption overlap. Since software preemption occurs first, it sets the interrupt flag, and the time-quantum counter is not incremented even when time_quantum 6 has already elapsed. One time quantum is technically lost, and the process is run one time_quantum later. 62 <3> o © 00; (6) r | f t © 0.0. 87us ITJLUI WUT=4+1=5 HIIIII IIII IIII 100 us 87ns rmri WUT=5+1=6 illllllllllllllll 92ys TTTTT WUT=6+l=7i ^ 11111111111111111 overlap 100 us 200 LIS 0 ® 500 87ns rrrnr WUT=7+1=8 illllllllllllllll Figure 4.4. Process running one time-quantum later The right approach to handle this problem is to call different interrupt routines for software and hardware interrupts. 4.6 Final Observations The graphical representations make a perfect match with the numerical results obtained in section 4.2 and 4.3, which confirms the existence of system flaws in ORTS. Processes whose execution time is small (approximately half the sampling period), can perfectly achieve sampling frequencies of 5 and 10 kHz. In the graphical representations, a few samples of higher frequencies appear when the process is put to run for the first time, and although these samples were not obtained in the numerical results, they exist and are real. As the execution time gets closer to the sampling frequency both the numerical, and the graphical representations, show that a process can still achieve a sampling frequency of 5 kHz. However, for a 10 kHz sampling frequency, a process is forced to run at higher frequencies, thus rarely or never achieving its nominal period. When the execution time of a process gets too close to the end of a time-quantum, it may occur that a software context switching and a hardware context switching overlap. This leads to a miss in the time-quantum count, which retards the execution of the process by one time-quantum. 63 Chapter 5 C67 DMA This chapter explains the implementation of the C67 DMA mechanism for performing fast data transfers between links. The first section is an introduction of the C67 DMA and an explanation of why the C67 DMA can be used to enhance and improve data transfers in ORTS. The following sections describe the implementation details of two different DMA algorithms used for mediating access to the DMA controller among the processes, and explain the main problem encountered when implementing a DMA algorithm, along with the proposed solution to that problem. 5.1 C67 DMA controller The TMS320C67 DSP provides a direct memory access (DMA) controller. The DMA controller transfers data between regions in the memory map without intervention by the CPU. The DMA controller allows movement of data to and from internal memory, internal peripherals, or external devices to occur in the background of CPU operation. The DMA controller has four independent programmable channels, allowing four different contexts for DMA operation [36]. The Daytona's manufacturer supplies a high-level function that uses the C67 DMA engine to perform a high-speed transfer of data between the C6x internal data RAM and the following [42]: • SSRAM • SDRAM • DPRAM • DSP-LINK3 • PEM site DMA had never been implemented before in any of the previous ORTS versions. C67 DMA is being implemented for the first time in dual ORTS to enhance the way system calls handle burst data transfers between links. For single processor ORTS versions, data transfers among links are performed by calling the run-time-support function "memcpy", which moves a specified number of characters from a source location to a destination location. The "memcpy" function has been used in ORTS for both single word transfers and large blocks of data transfers, regardless of the time the transfer takes. The C67 DMA is a mechanism for high-speed data transfers, therefore, it is important to determine how convenient or not, the "memcpy" function is to keep as the main transfer mechanism between links. Tables 5.1 and 5.2 show data transfers of variable lengths performed with the "memcpy" function and with the C67 DMA mechanism. For these benchmarks, data is transferred from a location in the SDRAM to a location in the SSRAM. 64 Transfer Length (64 bit word) 1 10 20 30 40 50 Cycles 0x84 0x25C 0x52C 0x66C 0x870 OxABC Mb/sec 9.6523 21.0945 19.2463 23.2501 23.5456 23.1824 Time[us] .79042 3.61677 7.92281 9.8443 12.9341 16.4550 Transfer Length (64 bit word) 60 70 80 90 100 1000 Cycles 0xCC8 0xE8C OxIODC 0x1310 Ox14E4 0xCF7C Mb/sec 23.3638 23.9494 23.61647 23.4979 23.8240 23.9872 Time[us] 19.5928 22.2994 25.8443 29.2215 32.0239 318.0598 Table 5.1 "memcpy" function Transfer rates SDRAM-SSRAM Transfer Length (64 bit word) 1 10 20 30 40 50 Cycles 0x110 0x148 0x1A8 0x1 F4 0x23C 0x28C Mb/sec 4.6842 38.8447 60.0994 76.4465 89.0985 97.7077 Timers] 1.62874 1.96407 2.53892 2.99401 3.42515 3.90419 Transfer Length (64 bit word) 60 70 80 90 100 1000 Cycles 0x2EC 0x338 0x3B8 0x3D0 0x41C 0x21A4 Mb/sec 102.201 108.2374 107.0679 117.4895 121.1130 147.9457 Timefus] 4.47904 4.93413 5.70060 5.84431 6.29940 51.56886 Table 5.2 C67 DMA Transfer rates SDRAM-SSRAM From tables 5.1 and 5.2 it can be seen that the "memcpy" function is convenient for very small data block transfers, however, as the data block length increases, the C67 DMA provides a higher transfer rate than that of the "memcpy" function. Even though the user data transfers 65 actually don't exceed a 60-word block length (which could be considered a small data block transfer), in this specific case, the difference in transfer rates between the "memcpy" function and C67 DMA becomes apparent for block lengths as small as 10 words. Another important point to consider when comparing the use of the "memcpy" function with the use of the C67 DMA mechanism, is context switching. If a context switch happens to occur during a "memcpy" transfer, the transfer will be suspended and will not be resumed until the next time the process is put to run. If a context switch happens to occur during a C67 DMA transfer, the DMA will continue the transfer while the processor executes the context switch. When the time comes to resume the execution of that same process, the transfer will most probably already be completed. It is important to notice, though, that the C67 DMA could be executing the data transfer of another process when a completely different process is actually executing and requiring a DMA transfer. Although four DMA channels are available for executing four different operations, a mechanism to control access to the DMA controller among processes needs to be implemented. 5.2 DMA algorithm In order to mediate access to the shared DMA controller among processes, a so-called DMA algorithm is implemented. Two different approaches are proposed in this thesis. The first one is intended for very large data block transfers. The second one is a more general approach, the one implemented in the dual ORTS version. The following subsections explain what the main problem around the implementation of a DMA algorithm for ORTS is, explain how that problem is addressed, and explain the implementation of the two proposed DMA algorithms. 5.2.1 Main problem Despite transferring data to a local link, or to a PC link, or to an Internode link, all system calls to either read or write a link wait for the completion of the transfer before performing any other updates and returning to the calling function. If a context switch happens to occur during a "memcpy" transfer, the function will resume execution just after the process in question is put to run again. After the transfer is completed, the system call can proceed and return to the calling function. The main problem arises when a context switch takes place when a DMA transfer is being performed. A DMA transfer can execute during context switching and extend execution even during the running period of the next scheduled process. When the system call of a particular process resumes, it is necessary to check if the DMA transfer corresponding to that 66 process is completed, in order to return to the calling function. Three possible situations may occur: • the DMA is still transferring the running process data(DMA not available); • the DMA is transferring data of another process (DMA not available); • the DMA is not executing any transfer (DMA available). If the DMA controller is found to be available, it is certain that any data transfers initiated up to that moment has been completed. At this point the system call can return to the calling function. However, if the DMA is not available, there is no way to know simply by checking the DMA status, if the current DMA transfer still corresponds to the just resumed process, or if it corresponds to any other process. If the DMA transfer happens to correspond to any other process, that means that the DMA transfer of the resumed process (running process) has been completed. The latter is true since a transfer is only performed when the DMA has completed previous transfers and it becomes available, otherwise it is not possible to perform a new transfer. Briefly, a system call should resume and return to the calling function when the DMA transfer of a process different from the one that is running is taking place, but should wait until completion otherwise. It is unacceptable to wait for another process transfer to be completed in order to proceed. The main problem to be addressed in the next subsection is how to determine whether the DMA transfer of the just resumed process has already been completed or is still in progress when the "DMA not available" flag is read. 5.2.2 Addressing the DMA algorithm main problem This problem is addressed by using what in this thesis is called, "tracking flags". The main idea is to keep track of the process whose transfer was last to be completed by the DMA controller. Every process should hold an identifier flag that is different from all other process flags. This identifier flag is used as an index to update the state of a process transfer in the flag register. The flag register is necessary to hold the state of each transfer through its process unique identifier flag (e.g. TransferState [3]=busy). Finally, a second flag, called the "last running flag", holds the identifier flag of the process whose transfer was the last to be performed by the DMA controller. When a new process initiates a DMA transfer, the transfer's state of the process held by the "last running flag", is updated to "idle" in the flag register. An idle state means that the transfer for that corresponding process has been completed. Then, the state of the current process transfer is 67 updated to "busy" in the flag register. A busy state means that the transfer is in progress. Finally, in order to keep track of which process' transfer was the last to be performed, the "last running flag" is updated with the currently running process identifier flag. Any changes in state are made in the flag register under the index of the process identifier flag, or under the index held by the "last running flag", depending on what is being updated. It is important to note that the next process to perform a DMA transfer is the one which modifies the state of the previous process transfer to "idle". When a process is put to run again, the system call resumes execution and checks both if the DMA controller is available and if the state of the transfer is held in the "idle" state. If either of these two conditions is true, the transfer has been completed. The system call can then proceed and return to the calling function. It is important to check not only the state of the transfer, but also the availability of the DMA controller. Checking both is necessary because it could happen that after the context switch no other process performs a DMA transfer, and the state of the last transfer will be kept in "busy" even after completion. The "tracking flags" mechanism is indivisible ('atomic'), in other words, the sequence of machine instructions cannot be interrupted. This mechanism is not a semaphore; that is, since the DMA controller can perform even when pre-emption occurs, it can make itself available even before the process that was first granted the DMA controller resumes, and frees it. Therefore, the DMA controller can be granted to any other process as soon as it is available, and when pre-emption takes place. We can prove that the tracking flags mechanism presented in this thesis works for all situations just by checking simple conditional structures. To explicate how the conditional structures are evaluated we present the operation of the DMA mechanism in 3 steps, and show all the possible conditions that may occur in each step, if any. 1. A process is granted the DMA controller only if the process requesting the DMA controller has the highest priority, if the DMA controller is available (the DMA controller cannot start any other transfer until the current transfer is completed), and if the transfer size is equal or greater than that indicated. This is simply achieved by using a conditional structure that checks that all conditions are met before granting the DMA controller. For this example, we assume that the transfer size is always greater than the expected size, so that condition is not evaluated. 68 priority DMA controller If [(priority==highest)&& (DMA==idle)] Action Highest Idle (1) lf[(1)&&(1)] DMA granted Highest Busy (0) lf[(1)&&(0)] DMA not granted Not highest Idle (1) lf[(0)&&(1)] DMA not granted Not highest Busy (0) If [(0)&&(0)] DMA not granted 2. When a process is granted the DMA controller, it clears the identification flag of the last process holding the DMA controller (CLEARED); this is done atomically when the DMA is first acquired. If no other process acquires the DMA, the flag of the last process remains set (SET). 3. When a process that was granted the DMA controller returns from preemption, for example process A, the process checks if its transfer has completed or not by checking both its identification flag, and the availability of the DMA controller through a DMA internal flag. Three possible situations may occur: • The DMA is not executing any transfer (DMA idle). This means that process A transfer has completed. This is also true for all DMA transfers that might have executed after process A, until that very moment. DMA controller Process Identification flag While [(DMA==busy)&&(flag==set)] Transfer State Idle S E T / C L E A R E D While [(0)&&(1 / 0)] Completed • The DMA is still transferring process A (DMA busy). This means that the process A transfer has not yet completed. Therefore, no other process has acquired the DMA controller after process A. DMA controller Process Identification flag While [(DMA==busy)&&(flag==set)] Transfer state Busy SET While [(1 )&&(!)] Not completed • The DMA is transferring data from another process, for example process D transfer (DMA busy). 69 This means that process A transfer has completed. This is true for process B and process C as well. DMA controller Process Identification flag While [(DMA==busy)&&(flag==set)] Transfer state Busy CLEARED While [(1)&&(0)] Completed 5.2.3 DMA algorithm first approach This first approach makes use of the four DMA channels provided by the DMA controller. Each channel is intended to handle transfers of a specific priority process. Channel 0 is designated to handle the highest priority processes' transfers, channel 1 is designated to handle the second highest priority, channel 2 the third highest priority, and channel 3 the fourth highest priority. The four highest priorities are dynamically determined at process creation. This priority assignation should be coherent with the priority assigned to the DMA channels for resource arbitration. Channel 0 should be assigned the highest priority, followed by channel 1, then channel 2, with channel 3 having the lowest priority. This way it is always true that a lower priority process will hand over the DMA control to a higher priority process when contending for the same resource [36]. Priorities lower than the fourth highest priority are handled directly with the run-time-support function "memcpy". Before any data transfer can take place, it is first checked whether the process priority is among the four highest priorities. If it is not, the transfer is directly handled by the "memcpy function". If the process priority is among the four highest priorities, then it is checked if the DMA channel that handles that priority is available to perform the transfer, if it is not, then the transfer is performed by "memcpy" instead. This first approach has the advantage of allowing four different contexts for DMA operation at the same time by using the four channels provided by the DMA controller. However, its disadvantage is the code size required to handle all four DMA channels. In order for a channel to perform a transfer, parameters and arguments specific to that channel need to be used, so different cases for each channel must be coded to handle all four channels. In this case, the code grows considerably for a relatively simple design. The code size affects the system in two different ways: first in memory space, and second in execution time. As explained in section 3.4, Memory Distribution, the internal program memory size where the real time operating system and the run-time-support (RTS) functions are allocated is not large, just 64KB, so it is important to optimize the usage of this space. In this first approach, the code size drawback can be worthwhile for very 70 large DMA data transfers, and the increase of code size can be acceptable. However the user's applications, as mentioned earlier in this thesis, roughly define an item size (data block size) greater than 60 words, and executing this first approach for such small data block transfers considerably increases the execution time of the processes. That is why this approach is intended only for very large data block transfers, and is not implemented as part of ORTS. 5.2.4 DMA algorithm second approach This second approach focuses exclusively on the highest priority processes. The highest priority processes are enabled to perform DMA transfers, any other process data transfers are handled directly by the "memcpy" function. Unlike in the previous approach, only the DMA channel 0 is used. A transfer is performed via DMA only if the following conditions are true: • the process' priority is zero; • the transfer length ("item-size") is greater than the minimum allowed transfer length; • the DMA channel is available for performing a new transfer. Even if a process' priority is zero, if the data block length is smaller than that specified, or the DMA is not available, the transfer is immediately handled by the "memcpy" function. C67 DMA is dynamically enabled, depending on the data block length, to transfer. There is an optimal block length per type of link that makes more convenient and faster transfers with the use of C67 DMA; that is, C67 DMA is not performed if the execution of the DMA algorithm is too costly for transferring the specified data block. This block length is related to the type of memory used for each link, and is determined through some benchmarks tests. Tables 5.3, 5.4 and 5.5 show the duration of a system call to write to a PC link, to a Local link and to an Internode link, respectively, for different block lengths of 64 bit words. The results were obtained by performing transfers with the "memcpy" function and with C67 DMA. It is assumed that the source location of the transferred data is in the SDRAM. 71 Transfer Length (64 bit words) System call duration "memcpy" function [us] System Call duration C67 DMA [us] 1 13.26946 19.23353 5 14.89820 19.28144 10 16.07186 19.44910 15 17.86826 19.47305 20 19.56886 20.04790 25 28.33533 27.59281 30 29.86826 27.68862 40 33.24551 28.33533 50 36.09581 29.14970 60 39.56886 29.29341 100 59.25749 38.41916 Table 5.3 System call duration to write a PC/Hurricane link Transfer: SDRAM-SSRAM Transfer System call duration System Call duration Length "memcpy" function C67 DMA (64 bit words) [us] [JIS] 1 17.26946 22.94611 5 19.47305 23.42515 10 24.11976 23.64072 15 28.43114 24.02395 20 41.60479 32.43114 25 47.16168 33.34132 30 50.15569 32.52695 40 58.20359 32.57485 50 67.16168 32.83832 60 75.95210 33.43713 100 119.78443 43.13772 Table 5.4 System call duration to write a Local link Transfer: SDRAM-SDRAM 72 Transfer Length (64 bit words) System call duration "memcpy" function [us] System Call duration C67 DMA [us] 1 13.58084 19.20958 5 15.30539 20.45509 10 18.41916 22.08383 15 21.26946 23.16168 20 31.20958 32.86228 25 33.94012 34.10778 30 36.76647 35.90419 40 42.15569 39.18563 50 42.15569 41.61078 60 53.60479 45.67665 100 Table 5.5 System call duration to write an Internode link Transfer: SDRAM-DPRAM From the previous tables it can be seen that there are optimal transfer lengths that make the C67 DMA more convenient to perform than the "memcpy" function. The optimal transfer length when reading and/or writing PC links, Local links and Internode links is achieved when the data block length is greater than 25, 10 and 30 words (64 bit word), respectively. This DMA algorithm makes convenient use of both the "memcpy" for very small data blocks transfers, and the C67 DMA, which considerably improves transfer rates for medium to very large sized blocks of data. In this approach, the use of a single DMA channel helps simplify the code, which at the same time helps reduce the system call duration. By keeping the code simple, the advantages of implementing the C67 DMA mechanism together with the DMA algorithm in dual ORTS become apparent for data block transfers as small as 10 words. This second approach is mainly developed in dual ORTS to help reduce the duration of system calls to write and read links for burst data transfers for the highest priority processes. 73 Chapter 6 Static WCET analysis In this chapter we present an analysis for bounding the WCET of the system calls to write data to local and PC links for the previous ORTS version, which runs on a Tl TMS320C32 processor. In this scenario, data transfers are performed exclusively via the "memcpy" function (i.e. DMA transfers are not used), and the processor does not handle instruction parallelism. Next, we present two methods for extending the previous analysis to the current ORTS version. The first method deals with the microarchitecture of the Tl TMS320C67 processor, which integrates VLIW that leads to complex instruction parallelism. The second method proposes a solution for bounding DMA data transfers. 6.1 WCET analysis for the C32 ORTS version Low-level analysis In this analysis we consider a simple architecture processor with no cache or pipeline. Therefore, the microarchitecture (low level) analysis of the processor is reduced to the concluding that the execution time of each machine instruction is fixed. In this case, the execution time of a sequence of instructions is equal to the sum of the execution time for the individual instructions. The execution time (or execution cycles) of the instructions is obtained from the reference manuals provided by the manufacturer [38]. High-level analysis The program path (high level) analysis performed in this thesis is based on the method proposed by Li, Malik and Wolfe [28]. In their method, the execution time of a program (or part of a program) is calculated by summing the products of the execution counts of a basic block in the program and its corresponding execution times. This is shown next: N Total execution time = S XjCj "1=1 where N = number of blocks in the program, Xj = the execution count of a basic block B, Cj = the execution time of the basic block, 74 The execution count (Xj) of a basic block is constrained by the program structure itself (program structural constraints), and by the possible values the program variables can take (program functionality constraints). Program structural constraints are derived directly from the control flow graph (CFG) of the program. These constraints define the execution counts on the edges of the control flow graph. That is, for each node, the execution count is equal to the number of times the control flow enters the node (inflow), and it is also equal to the number of times the control flow exits the node (outflow), see Figure 6.1. The purpose of these constraints is to preserve the flow of a program. inflow Figure 6.1. Control flow graph On the other hand, the program functionality constraints are provided by the user. This refers basically to loop bounds and other path information that cannot be obtained from the program structural constraint information itself (e.g. the number of times a block is executed given a specific result or a specific case). In the analysis presented in this thesis, the WCET is obtained when the execution counts of a basic block equals the number of times that basic block is executed in the worst-case scenario. The first step in this analysis is to obtain the control flow graph (CFG) of all the different parts of the kernel which directly or indirectly affect the performance of the data transfers in ORTS. Then, the number of cycles needed to execute each basic block of machine code is obtained from the assembler code by adding the execution cycles of each instruction. A basic block, as defined in [28], is a maximum sequence of consecutive instructions in which flow of control enters at the beginning and leaves at the end without halt or possibility of branching, except at the end. Next, worst-case execution paths, infeasible paths, and execution counts of basic blocks are analyzed. Finally, preemption, blocking, and generation of external events, are taken into consideration to 75 estimate the WCET of the transfers. The timer interrupt is the only external event that interrupts the RTOS. The different kernel sections analyzed include the following: • Scheduler, • Semaphores, • Interrupt handler, • Software interrupts, • System call to write data to direct local links • System call to write data to buffered PC links. For each of these kernel sections, we present its control flow graph. The execution time associated with each node (or block) of the CFG is represented as (tn 0de). and corresponds to the time it takes to execute each node (basic block) in isolation. tnode is the equivalent to Cj in equation (1). The following analysis presents a very representative code, (however not the real code) for each kernel section. The purpose of the code is solely to conduct the WCET analysis. Therefore it is not expected, and is not the intention of this thesis, that the reader comprehends the code. 76 6.1.1 The scheduler Scheduler() A: while (sleep_list!=null && sleepjist.WakeUpTime <= currentjime) { B: pcb = sleepjist C: RemoveProcessFromList (sleepjist, pcb) D: if (pcb periodic) E: AddProcessBeginningList (pcb-> priorityJevel, pcb) else F: AddProcessEndList (pcb-> priorityjevel, pcb) G: pcb->State = Running } H: for (i=0; i<num_priorityJevels; i++) I: if(priorityJevel[i] != null) { J : current_process = priorityJevel[i] K: priorityJevel[i] = current_process->Next L: return } M: current_process = idle_pcb Scheduler () ORTS has a round-robin, preemptive, time-sliced, and prioritized scheduler. It has 8 priority levels, and consists of 8 priorityjevel lists or queues where the processes wait to be scheduled. When the scheduler starts executing, it iterates a sleep list and puts back in its priorityjevel list all those processes whose sleep period has elapsed, if any. If the process is periodic, it is placed at the beginning of its priorityjevel list; otherwise, it is placed at the end of its priorityjevel list. This can be observed as path ABCDEFG, which is part of a 'while loop'. Finally, a process is scheduled (granted the CPU) if it is the first process on the specific priorityjevel list and there are no other processes waiting to be scheduled on higher priorityjevel lists. The latter can be observed as path HIJKL, which is part of a 'for loop' that iterates from the highest priority 0 to the lowest priority 7. 77 Next, we present two approaches to the worst-case execution of the scheduler. The first approach is too pessimistic to be considered for the analysis, the second approach is a tighter approach, which is the one used to estimate the worst-case execution time for the data transfers in ORTS. Pessimistic approach: The worst-case execution of the scheduler occurs when a process of the lowest priority (priority 7) is scheduled. This case is only possible in case of the following: 1. Only the lowest priority processes in the sleep list are ready to be awaken (put in its priorityjevel list). 2. If there are no processes waiting in higher priorityjevel lists (i.e. priorityjevel list 0 to priorityjevel list 6). In this scenario, the worst-case execution of the scheduler is WCESched = a+1)A'+jB'+(k+1)C'+D' [cycles] (1) where j= Number of processes of the lowest priority (i.e. all processes of priority 7), and k= 7 = lowest priority. Tighter Approach: Although equation (1) theoretically represents the worst case execution of the scheduler (where the lowest priority process is scheduled), if a process of priority higher than the lowest priority process is ready to be scheduled, equation (1) is not feasible; that is, that path will never possibly occur. The latter is true because the lowest priority processes can only be scheduled when no other processes are waiting to be scheduled on higher priorityjevel lists. If we want to consider a tighter (not overestimated) analysis, it is necessary to come up with a worst-case execution of the scheduler that is feasible for every priority. If we follow the idea that in ORTS a preempted process remains in its priorityjevel list to be scheduled again, equation (1) can be generalized so that the worst case execution of the scheduler is feasible for all preempted processes of any priority. This is shown in equation (2). WCE S Ched = (m+1 )A' + mB' + (n+1 )C +D' [cycles] (2) 78 where n = priority of the process that is being preempted (highest '0', lowest V) m = total number of processes (other than the preempted process) of priority lower or equal than 'n'. ("m" is then the maximum number of processes in the sleeping list ready to be put in their corresponding priorityjevel list). Since we are interested in performing WCET analysis for the highest priority processes, the worst-case execution of the scheduler for this case is defined as follows: WCE s c h ed = (m+1)A' + mB' + C'+D' [cycles] (3) where A' = execution cycles for path A B' = execution cycles for path BCDEG C = execution cycles for path HI D' = execution cycles for path JKL If we sum the execution cycles across the basic blocks, we get the following equation: WCE s c h ed = 45 +114(m) cycles (4) The worst-case execution time of the scheduler depends on the number of processes in the sleeping list, which at the same time depends on the number of processes (of all priorities) running on the system. 7 9 6.1.2 The timer interrupt handler PreemptionlnterruptHandler() Int. Handler () A B C D E; F: G: H: Push all registers Current_Process -> Stack_Pointer < - SP SP < - Kernel_Stack_Pointer If (Soft_Preempt) Soft_Preempt =0 Else Current_Time ++ Scheduler SP Kernel_Stack_Pointer SP ~ ^ Current_Process->Stack_Pointer Pop all registers t=0 cycles I N I T A: tA=31 c[ tB=4 c | B tc=1 c tD=5c I D C tE=6c JL 4 + W C E s n h R r t G tH=1 c H t,=4 c 1 I tj=31 c J t=12 ( ^ E N D , tv=3c The Timer Interrupt Handler is the dispatcher of the operating system. It is responsible for context switching among tasks, for calling the scheduler, and for incrementing the global timer variable (Current_Time). The Timer Interrupt Handler services the timer interrupt, which has granularity equal to 100 us; that is, every time_quantum a task is preempted and a context switch occurs. The worst-case execution of the Interrupt Handler, defined by the path ABCDEGHIJ is: WCE Int. Handl - 99 + W C E s c n e d [cycles] If we substitute WCE S C hed for equation (4), we get the following: WCE int. Handl = 144 + 114(m) cycles (5) 80 6.1.3 Software preemption Yield () Yield () A: Soft_Preempt =1 B: PreemptionlnterruptHandler () END, INIT t=0 A t A=2 c tR- 8+ WCElnt Hanril t=0 The Yield function forces, by software, a task preemption or context switch to occur even before the time_quantum (100us) has elapsed. The worst-case execution of the Yield function is calculated as follows: WCE y ield = 10+ WCE mt. Handl [cycles] If we substitute WCE int. Handl for equation (5) we get: WCE yield = 154 +114m cycles (6) 81 6.1.4 Semaphore P SemaphorePQ A: B: C: D: E: F: G: H: I: K: L: M: N: O: Disableinterrupts() while(s->count == 0) { RemoveProcessFromList (Current_Process) AddProcessEndList(s->waitingprocesses, Current_Process) Current_Process->WaitingForSemaphore = s Current_Process->state = ps_waiting if(S->Prioritylnheritance && S->Holder->priority_lev. < Current_Process->priority_level) { if(!s->HolderOriginalPriorityPlusOne) s->HolderOriginalPriorityPlusOne = s->Holder->priority_level + 1 ChangePriority (s->Holder, Current_Process->priority_level) } Yield() } If (-s->count==0) s->Holder = Current_Process else s->count-Enableinterrupts() SemaphoreP( INIT t=4 tA=5 c | A tB=6 cl^B tc=45 cl C tn=41 C D . L It=8 c M N tE=3 cL -E tF=2 c X - , t M = 7 c > M\ tN=4c O S_ t 0 = 5 c f tG=14c( G $^ t=6 cCSB> H t H=6c i I; t,=5c rt/ tj=146c Semaphores are used to ensure that only one process is accessing a shared resource at a time. Semaphore P is used to request a shared resource. If the semaphore counter is other than zero, the resource is free and the process can proceed to make use of it. If the semaphore counter is zero, that means the resource is being used by another process. In this case, the requesting process is said to be blocked by another process, and therefore it is removed from the priorityjevel list and placed into the waiting list of the associated semaphore. The WCE of semaphore P occurs when a process gets blocked due to a process of lower priority holding the semaphore to the shared resource. This path is defined as ABCDEFGHIJK, which is inside a 'while loop'. In this case, the process is removed from its priorityjevel list and put in the semaphore's waiting list. Then, the process holding the semaphore inherits the priority of the process it blocks, and is put first in the priorityjevel list. The latter is performed via path GHIJ, which is the path for the priority inheritance protocol. On this worst-case execution path, after block K is executed, the blocked process yields to a new process, and therefore does not go back 82 to the 'while loop' until it is put to run again. This means that the 'while loop' is executed once, at the most, every time the process enters semaphore P. The worst-case execution when the 'while loop' is iterated for the first time is this: WCE semP - 281+ WCE Yield [cycles] ; first while loop iteration If we substitute WCE yield for equation (6) we get this: WCE1 semP = 435 + 114(m) cycles ; first while loop iteration (8) When the process is put to run again, it continues iterating the 'while loop', but without executing the initialization part and block A. Therefore, the WCE of semaphore P for consecutive iterations of the 'while loop' is defined by path CDEFGHIJK and it is presented as follows: WCE semP - 272+ WCE yield [cycles] ; consecutive 'while loop' iteration If we substitute WCE yield for equation (6) we get: WCE2 semP = 426 + 114(m) cycles ; consecutive'while loop'iteration (9) The best case execution of semaphore P is when the semaphore is not blocked by any other process. When the 'while loop" is iterated for the first time, the best case execution is defined by path ABLNO, and is presented as follows: BCE1 SemP = 38 cycles (10) The best case execution for semaphore P, for consecutive iterations, is defined by path BLNO. BCE2 semP = 29 cycles (11) 83 6.1.5 Semaphore V SemaphoreV (Semaphore *s) A: Disablelnterrupts(); B: if(s->Count < s->MaxCount) C: s->Count++; D: if(s->WaitingProcesses) { E: pcb = s->WaitingProcesses; F: RemoveProcessFromList(s->WaitingProcesses, pcb); G: AddProcessBeginningOfList ( priorityJevel[pcb->priority_level], pcb); H: pcb->WaitingForSemaphore = null; I: pcb->State = PS_Running; J : if(s->HolderOriginalPriorityPlusOne) { K: s->HolderOriginalPriorityPlusOne = 0; L: ChangePriority(Currenl_Process, s->HolderOriginalPriorityPlusOne -1); M: YieldQ; } } N: EnablelnterruptsQ; SemaphoreV () INIT t A =5 A t=3 tc=3 t=5 .END. Semaphore V is used by a process to signal that the shared resource it was using is now free. The worst-case execution of semaphore V is when the semaphore list has at least one process waiting to use the shared resource, and when the process executing semaphore V is running with an inherited priority (due to priority inversion). In this case, the first process in the waiting list is put back in its priorityjevel list, and the process that inherited a higher priority takes back its original priority and yields to a higher priority process. The worst-case execution path is defined 84 as path ABCDEFGHIJKLM, where path JKLM is due to the priority inheritance protocol. The worst-case execution is presented next: WCE1 semV = 279 + WCE Yield [cycles] (with priority inheritance) If we substitute WCE yield for equation (6) we get this: WCE1 semV = 433 + 114(m) cycles (with priority inheritance) (12) For the analysis, it is also necessary to consider the worst-case execution of semaphore V when there is no priority inversion, and therefore, no yielding to other processes (e.g. highest priority executing semaphore V). This can be obtained following the path ABCDEFGHIJN, where path JKLM is not executed. The best execution of semaphore V is when there are no processes waiting in the semaphore list. The best-case execution is defined as ABDN. 6.1.6 System calls to write direct local links and buffered PC links Up to this point we already know the WCE of the different software structures that directly or indirectly affect data transfers in ORTS. It is now time to analyze the program code responsible for executing such transfers, and estimate the worst-case execution for writing data in both local links and PC links. In order to estimate the worst-case execution time for writing data to local links and PC links, it is necessary to consider the worst case scenario, which occurs given the following: 1. When the running process gets blocked by a lower priority process. In this case, the maximum blocking time should be calculated. Since the running process accesses only one semaphore, chain blocking is not feasible, and therefore it is not considered in the analysis, WCE2 semV = 137 cycles (no priority inversion) (13) BCE SemV = 31 cycles (14) 85 2. When the "memcpy" function transfers the maximum ItemSize (N) handled by the user. 6.1.6.1 Writing to direct local links WriteDirectLocalLink() SemaphoreP (linkX ->EmptySpace) memcpy (linkX->buf, data, N) SemaphoreV (linkX->FullSpace) return DirectLocalLink () where: N = ItemSize (the length of the item to be transferred by the memcpy function, which is specified in 16 bit words, for the C32 processor) INIT t= 8 cycles A t ,=8 + W C E s e m P 1 B D .END, IR= 32+15N tn= 8 + W C E R e m v tD=4c t=5 c The WritDirectLocalLinks function first executes semaphore P to gain access to the shared local memory. Once the memory is granted, it proceeds to transfer an item of 'N' 16- bit words to the local memory, and finally upon completion, it executes semaphore V to signal that the shared memory is free again. Worst-case execution time calculations: 1. Calculating maximum blocking time The maximum blocking time is the time it takes the lower priority process to finish its task (writing to a local link with the maximum value of N), and execute semaphore V to unlock the resource. For the worst case scenario we are assuming that the low priority process was preempted just after it locked the semaphore, but before it could even execute a single instruction to write data to the local link. The maximum blocking time is defined by the path BC of the WriteDirectLocalLink function. Block D is not executed since the process yields to a higher priority process before returning from semaphore V. The maximum blocking time is presented next: 86 MaximumBlockingTime = [(32+15N) + (8+ WCET1 Semv)] * instruction_cycle_execution_time The user handles items no larger than 60 words; therefore, the worst-case transfer performed by the "memcpy" function is when N=60. On the other hand, the frequency at which the TMS320C32 processor runs is 60 MHz; however, the execution time of a single instruction cycle is twice the period of its clock frequency, which is 33.33ns. Refer to [39] for more information about the TI TMS320C32 processor. If we substitute WCE1 s e m v for equation (12), and if instruction_cycle_execution_time =33.33ns, and if N=60 we get the following: MaximumBlockingTime = [(32+15(60)) + (8+ 433 + 114m)] *33ns MaximumBlockingTime = (1373 +114m) *33ns (15) 2. WCET calculation In order to give an upper bound to the transfer function (WriteLocalDierctLinks), it is necessary to make one last assumption. This assumption is associated with the variable'm' that is present in all the equations that involve preemption. That is, the calculations of the upper bound depend on the number of processes in the sleeping list (ready to be put in their priorityjevel list), when the scheduler is executed. (Refer to equations 2 and 3). This assumption is that there is only one process per priority level (i.e. 8 processes of different priorities). If'm' is equal to the total number of processes (other than the running process) of a priority lower than or equal to 'n', and n was assumed to be priority 0 (refer to equation (3)), then we can say that m=7. With this assumption, we can also conclude that when the highest priority process gets blocked by a lower priority process, after the resource is unlocked, it immediately resumes execution (i.e. no other process is put to run instead). This is true since there are no higher priority processes, or any other process of the same priorityjevel list that can preempt it. 87 The execution cycles to write to a local link (considering blocking), follows path ABCD, and is presented as follows: • Initialization, • Executing of semaphore P, • Process getting blocked by a lower priority process, • Process resuming execution and executing semaphore P (second iteration of while loop), • Process executing maximum data transfer to a local link, • Process executing semaphore V to unlock resource, • End, Execution cycles writeLocaiLink = INIT + (8+WCE1 S e m P )+ MaximumBlocking + BCE2 s emP + B + (8+ WCE2 s e m V )+ D +End (16) Substituting (8), (15), (11) and (13) in equation (16), we get the following: Execution Time writeLocaiLink = (2939 + 228m)* 33.33ns (17) If we substitute'm' in equation (17), we get this execution time: Execution Time writeLocaiLink = 151.15us (18) In order to calculate the worst-case execution time for this function, it is necessary first to know how many context switches can occur while the function is executing (remember that context switches occur every 100 us). To the result obtained in (18), we first have to add the dispatch latency time, which is the time the dispatcher takes to execute before the process is actually put to run. DLT = context switching time + scheduler time + interrupt latency time (19) where context switching time +scheduler time = interrupt handler time WCET mt. Handl = (144 + 114m) *33.33ns = 31.39 us ;m=7 The interrupt latency for the C32 processor is given by the manufacturer [39] and is equal to 0.26664 us (8 cycles). The dispatch latency time is shown next: 88 DLT = 31.39us + 0.26664 us = 31.65 us Execution time = 151.15ns + 31.65ns = 182.80ns (20) From this result we can be sure that the function is interrupted at least once, but it could actually be interrupted more. In order to discover this, we have to add the effect of that first interrupt (dispatch time) to the total execution time of the function. Execution time = 182.80ns + 31.65ns = 214.45 ns The execution time now exceeds two time_quantums, which means that the function is interrupted twice, not once. Since the effect of the second interrupt (31.65 ns) will not make the execution time of the function get to the boundaries of a third time_quantum, we can conclude that the WriteToLocalLink function is interrupted at the most, twice. Finally we can calculate the worst case execution time of the function by adding the effects of that second interrupt: WCET writeLocalLink = 214.45 ns + 31.65us WCET writeLocalLink = 246.10 nS A graphical representation of the WCET to write a local link is presented in Figure 6.2. From Figure 6.2, we can observe that a 13.83% (31.06 ns) of the CPU time is dedicated to the transfer itself (writing to the local link), and that the remaining 86.16% (215.04 ns), is the result of the blocking time, and the overhead of executing semaphore P. In the worst-case scenario, periodic processes that write to local links cannot be expected to run at frequencies higher than 3.33 kHz (1/300 us). That is, since sampling frequencies are multiples of the time_quantum, which is 100ns, the closest possible frequency to the theoretical frequency of 4 kHz (1/246.10 us) is 3.33 kHz (1/300 \xs). A very important fact is that the upper bound frequency could be even smaller than 3.33 kHz when the total execution time of the entire process (not only the system call to write a link) is considered. 89 31.65 P(s) P(s) locked 41.6 26.73 45 21.77 P(s) unblocked 0.96 JED 62 V(s) - P 9.29 5.129 WCET | Semaphore P (INIT + block A) Transfer (block B) Semaphore V (blocks C+D+END) Lower priority process Figure 6.2. WCET to write a local link 90 6.1.6.2. Writing to buffered PC links WriteBufferedPCLink() Buffered PC Link () A: Ih = LinkX->PCLinkHeader B: if (lh->Type != PC link) C: FatalError D: SemP(LinkX->TailSem) E: while (lh»Tail == lh->Head) { F: Sleep (0) } G: memcpy (linkX->buf, data, N) H : lh->Tail = (lh->Tail+1) I: if (lh->Tail == (lh->Head+1)) J: InterruptPCO K: SemV (LinkX->TailSem) L: return t=8 cycles INIT tA=3c A tB=7 c B; t=12c END_ D 3 tn= 8 C + W C E s e m P t E=6c tG=36+15N c | G tH=23 c t,=24 c I tj=20 c J z tK=8 C + W C E s e m V tL=4c t=5c .END, The WriteBufferedPCLinks function checks if the link to write data to is effectively a PC link. If it is not, a fatal error is displayed; otherwise it proceeds to execute semaphore P to gain access to the memory. Then it iterates a 'while loop' (block E) to check if there is space to write on the buffer. This is done by checking the tail and head from the link header. Remember that the header is used for producer-consumer synchronization between the PC and the DSP (refer to section 2.1.2). If there is no space to write to, the process is put to sleep (suspended) until the end of the current time quantum, thus yielding to other processes. If there is space to write to, then the 91 transfer is performed. After the data transfer, the header tail is updated and the PC is interrupted to let it know that new data has been written in the PC link. Finally, semaphore V is executed to free the resource. For this analysis it is necessary to make some initial assumptions: 1. The process does not take a path that takes to suspension (it is never put to sleep). 2. In order to comply with assumption 1, the buffer is always available to write (i.e. the PC is constantly reading from the buffer, therefore, always leaves space for the DSP to write data in.) These assumptions are necessary in order to bound the WCET to write to a PC link; otherwise, in the worst case (e.g. PC crash), the suspension of a process could be unbounded. Worst-case execution time calculations: To calculate the WCET to write a buffered PC link, we follow the same analysis performed on the WriteDirectLocalLink function. 1. Calculating maximum blocking time The maximum blocking time is the time it takes for the lower priority process to finish its task (writing to the buffered PC link with a maximum value of N), and execute semaphore V to unlock the resource. The maximum blocking time is defined by the path EGHIJK of the WriteBufferedPCLink function. The maximum blocking time is presented as follows: MaximumBlockingTime = (117 + 15N + WCET1 s emv) * 33.33ns If N=60, m=7 and if we substitute WCE1 s e m v with equation (12) we get: MaximumBlockingTime = 74.92 ps (21) 92 2. WCET calculation The execution time to write to a buffered PC link (considering blocking), follows path ABDEGHIJKL, and is presented as follows: Execution time writeLocalLink = 156.6055 us (22) The worst-case execution time is calculated by adding the dispatch latencies involved in the total execution of the function. This function is interrupted twice. Furthermore, as explained in the previous analysis, the function is not run for the first time until the initial dispatcher is executed (i.e. after 31.65 ps). If we add these latencies we obtain the WCET to write to a buffered PC link. WCET WriteBufferedPCLink = 251.55 pS A graphical representation of the WCET to write a local link is presented in Figure 2. From Figure 6.3, we can observe that the transfer time itself (writing to a buffered PC link) takes only 13.36 % of the time (18.87 ps + 14.75 ps). The rest of the time, which is a 86.64% (217.93 ps), is the result of the blocking time and the overhead of executing semaphore P. In the worst-case scenario, we obtain the same results as in the previous analysis; that is, periodic processes that write to buffered PC links cannot be expected to run at frequencies higher than 3.33 kHz. Once again, the upper bound frequency could be even smaller than 3.33 kHz when the total execution time of the entire process (not only the system call to write a link) is considered. 93 Interrupt handler Highest priority process Lower priority process 31.65 P(s) P(s) blocked 41.9£ 26.40 P*s) unblocked 0.9 18.871 48 5.129 52 v(s; 14.75 W C E T [ L I S ] J | Semaphore P (blocks A+ B +D +INIT) Execution (blocks E+G +H +l +J) Semaphore V (blocks K+ L+END) Lower priority process Figure 6.3. W C E T to write a buffered P C link 6.2 Important t iming c o n s i d e r a t i o n s : With the assumption of having at the most one process per priority level (8 processes), we obtained that m=7 (i.e. the total number of processes in the sleeping list when the scheduler is run). With this assumption we calculate next some important latencies, and the worst case execution time of the different basic structures of the RTOS. CPU utilization of the timer interrupt handler Since the timer interrupt is serviced every 100 us, the C P U utilization of the timer interrupt handler for the worst case is calculated as follows: 94 Utilization ! n t. Handl = (31.65 us /100 us ) *100 us Utilization |nt. Handl = 31.65 % This means that 31.65 % of the CPU time is dedicated to context switching among tasks. The worst-case execution of the scheduler itself, is this: WCET Sched = 28.09 us This means that the 88.75% of the interrupt handler time (dispatch time) is dedicated solely to the scheduler. Maximum interrupt disable time The maximum interrupt disable time in ORTS occurs inside semaphore P. Semaphore P is the longest region inside the ORTS kernel where interrupts are disabled. In the worst-case path of semaphore P, interrupts are not enabled until the Yield function is executed. Therefore, the maximum interrupt disable time equals the worst-case execution of semaphore P, described in equation (8): Maximum Interrupt Disable Time = WCET s e m p = (435 +114m) *33.33ns Maximum Interrupt Disable Time = 41.09 us The worst-case execution time of the data transfers exemplified in this study are not directly affected by this latency, and therefore not considered. That is, this latency could postpone the execution of a transfer, but the total execution time of the transfer itself is not affected. This latency, however, has a big impact when performing schedulability analysis; it postpones the execution of a process, thus affecting its capability to meet its deadlines. If we include this latency in the dispatch latency time defined in equation (20), we obtain the following: 95 DLT = context switching time + scheduler time + interrupt latency time + max. interrupt disable time DLT = 31.65 us +41.09 us DLT = 72.74 us This means that 72.74 us is the interval between the time of the generation of the timer interrupt (which cannot be processed because it is disabled) and the execution of the process responding to the interrupt. In other words, once it is time for a higher priority process to preempt a lower priority process, the higher priority process is not executed until 72.74 us later; consequently, only 27.26% of its time slice is usable. Semaphores We calculate the time it takes to acquire a semaphore, and the time it takes to release a semaphore and schedule a higher priority process that was pending on that semaphore. The time it takes to acquire a semaphore is defined by the best case execution of semaphore P, presented in equation (10): BCET s e m P = 1-266 us The time it takes to release a semaphore and schedule a higher priority process waiting in that semaphore is defined by the worst case execution of semaphore V, presented in equation (12): WCE S emV = 41.02 us 96 6.3 Microarchitecture analysis for the TMS320C67 processor When pipeline is considered, the total execution time of a basic block is no longer equal to the sum of individual instructions, but less, due to parallelism among instructions. For this reason it is necessary to analyze the interaction of each instruction within and across basic blocks. This thesis presents a method for analyzing the microarchitecture analysis of the TMS320C67 processor, based on the method proposed in [29]. 6.3.1 TMS320C67 pipeline operation overview It is not the purpose of this thesis to elaborate on the description of the pipeline operation of the TMS320C67 processor. Detailed information can be found in the processor's reference guide [38]. TMS320C671 is a floating-point DSP processor based on the VLIW architecture. It has eight functional units and can execute up to eight instructions in a single cycle, with each instruction executing in an independent functional unit. The pipeline follows 3 stages: fetch, decode, and execute. First, theTMS320C67 fetches a 256-bit eight-instruction group (called a fetch packet) at the same time. Each fetch packet can be split into one to eight execution packets, depending on how many instructions can be executed in parallel; that is, each execution packet consists of one instruction or from two to eight parallel instructions. The fetching process consists of 4 phases, which are program address generate (PG), program address send (PS), program access ready wait (PW), and program fetch packet receive (PR). Next, all instructions in an execution packet are assigned to the appropriate functional unit and dispatched together. The dispatch process consists of 2 phases, which are instruction dispatch (DP) and instruction decode (DC). Finally, the execution stage is divided into 10 phases (E1...E10). Each type of instruction requires different phases to complete its execution. The pipeline operation is presented in Figure 6.4. Clock Fetch packe n n+1 n+2 n+3 n+4 n+5 9 10 11 12 13 14 15 16 17 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 Figure 6.4. Pipeline operation 97 6.3.2 Pipeline analysis The purpose of pipeline analysis is to model the execution time of instructions when they overlap. This overlap makes the execution time of the basic blocks (or instructions) less than the sum of the execution time of individual basic blocks (or instructions). The execution time of a sequence of instructions in a basic block is the time between when the first instruction enters the pipeline and the last instruction leaves the pipeline. For the analysis, it is necessary to consider the interaction of each instruction within and across basic blocks. It is necessary first to identify, in each basic block, every fetch packet (remember that each fetch packet consists of 8 instructions), and within each of those, identify the execution packets. Execution packets can consist of single instructions (when none of the 8 instructions are executed in parallel), or from two to eight parallel instructions. Once the fetch packet and the execution packets for each basic block are identified, it is convenient to make a pipeline operation graph showing the cycles taken to execute each fetch packet. Next, we have to combine the pipeline effects of the different basic blocks in a combined pipeline operation graph. An example of how to get the execution time of basic blocks for a pipelined processor is shown next. Example: In this example we are showing symbolic machine instructions (as presented in the assembler code) for block A and block B. In each block, the fetch packets are identified, and within each fetch packets, the execute packets are identified. For this example, each instruction is assigned an instruction type to let us know the number of execute cycles required to complete the instruction. The instruction type for all different machine instructions can be found in [38]. With this information, a pipeline operation graph is presented for each block in Figures 6.5 and 6.6. The combined pipeline operation for both blocks A and B is presented in Figure 6.7. 98 Block A: Instructions Instruction type Fetch packet 1 Instruction 1 || Instruction 2 jj Instruction 3 Instruction 4 || Instruction 5 Instruction 6 || Instruction 7 jj Instruction 8 Fetch Packet 2 Instruction 9 || Instruction 10 jj Instruction 11 || Instruction 14 || Instruction 15 jj Instruction 16 jj Instruction 17 jj Instruction 18 ; single cycle (E1) ; single cycle (E1) ; 2-cycle(E1,E2) ; single cycle (E1) single cycle (E1) single cycle (E1) 2-cycle(E1, E2) single cycle (E1) ; single cycle (E1) ; single cycle (E1) ; store (E1, E2, E3) ; single cycle (E1) ; single (E1) ; single (E1) ; single (E1) ; single (E1) Execute packet 1 Execute packet 2 Execute packet 3 Execute packet 1 The parallel bars (||) means that an instruction executes in parallel with the previous instruction. Fetch packet Execute packet 1 2 Clock 3 4 5 6 7 8 9 10 11 12 1 1 PG PS PW PR DP DC E1 E2 1 2 DP DC E1 1 3 DP DC E1 E2 2 1 PG PS PW PR DP DC E1 E2 E3 Figure 6.5. Pipeline operation for symbolic code in block A In Figure 6.5, the fetch packet 1 moves to the fetching phase from cycle 1 to cycle 4. On cycle 5, it is detected that there are three execute packets, which are decoded sequentially. This causes the pipeline to stall in cycle 5. Once execute packet 3 moves on to the instruction decode stage, the pipeline stall is restored and fetch packet 2 proceeds for decoding. The execution cycles for this block occur when instruction 11 leaves the pipeline after stage E3, which occurs in cycle 12. 99 Block B Fetch packet 1 Instructions Instruction type Instruction 1 ; single cycle (E1) || Instruction 2 ; single cycle (E1) jj Instruction 3 ; single cycle (E1) jj Instruction 4 ; 2- cycle (E1, E2) Instruction 5 ; single (E1) || Instruction 6 ; single (E1) jj Instruction 7 ; store (E1, E2, E3) jj Instruction 8 ; single (E1) Execute packet 1 Execute packet 2 Fetch Execute packet packet 1 2 1 Clock 2 3 4 6 PG PS PW PR DP DC 7 8 9 E1 E2 10 DP DC E1 E2 E3 Figure 6.6. Pipeline operation for symbolic code in block B The execution cycles for block B are when instruction 7 leaves the pipeline after stage E3. This occurs in cycle 10. Combining blocks A and B we get the following: Block Fetch packet Execute packet 1 2 Clock 3 4 5 6 7 8 9 10 11 13 14 15 A 1 1 PG PS PW PR DP DC E1 E2 A 1 2 DP DC E1 A 1 3 DP DC E1 E2 A 2 1 PG PS PW PR DP DC E1 E2 E3 B 1 1 PG PS PW PR DP DC E1 E2 B 1 2 DP DC E1 E2 E3 Figure 6.7. Pipeline operation for combined block A and B 100 In Figure 6.7, we have the same case of pipeline stall in cycle 5. Once again, this is due to the first 3 "execute packets" coming from fetch packet 1 in block A. The execution cycles for the combined pipeline of blocks A and B are 15. Once we have the execution cycles for each basic block, and the execution cycles for the overlapped blocks, it is possible to calculate the WCET of a program considering the effects of pipelining. The worst-case execution time without considering pipelining was calculated by summing the products of the execution counts (for the worst-case scenario) and the execution times of the basic blocks: WCET = £ (Xnode * t n 0de) Vnode Engblom and Ermedahl [29] integrate a new timing variable into the worst-case execution equation, called the timing effects (8s), which are associated with sequences of nodes in a control flow graph. Timing effects (Ss) can be negative indicating a speedup over a sequence of nodes, or positive, indicating a slowdown. Just as each execution time variable (t n 0de) has an associated execution count variable (x n 0de), each timing effect has an associated execution count variable (xs) representing the number of times the sequence is executed in the worst-case program execution. The worst-case execution time for a pipelined processor can be generalized as follows: WCET = max( £ x n 0 d e * t n 0de+ £ x s * 8 s ) Vnode Vs The timing effect (8S) for a sequence can be calculated using the execution times of the sequence: 8 s= ts - ts[2..l] - ts[1..|-1] + ts[2..l -1] where s= s[i..i] indicates a subsequence of length I The timing effect of the sequence of basic block A and basic block B presented in the previous example can be calculated as follows: 8 AB= tAB - tA - ts If tAB =15 cycles, tA = 12 cycles and te = 10 cycles, the timing effect for sequence AB (8 AB) becomes 5AB = -7 cycles 101 We can now represent the pipeline effects of the processor in the control flow graph by associating the timing effect with the sequence AB. This is shown in Figure 6.8. A tA=12c 5AB = -7 c r ' B • tB=10 c Figure 6.8. Control flow graph with associated timing effect 8 s 6.4 Bounding DMA effects on a program In dual ORTS, a DMA mechanism is integrated to improve data transfers between memories (refer to Chapter 5). In this section, we present a method for bounding the effects of DMA in a program when running in a cycle-stealing mode. For this purpose we follow the research done in [7]. Huang, Liu, and Hull [7] present a method for bounding the WCET of a program when it executes concurrently with cycle-stealing DMA for I/O operation. It assumes a commonly used contention protocol, in which the devices (in this case the CPU and the DMA controller) contend for the I/O bus by asserting a bus request line. If the bus is free, the requesting device gains control of the bus and may continue to hold it unless a higher priority device (e.g. CPU) asserts the bus request line. In I/O operations, this is considered a short delay, called the bus master time (BMT). It occurs before the device, which was granted the bus, can start to transfer data. This delay is due to bus switches between devices. Huang, Liu, and Hull classify machine cycles into two categories: B (bus-access) cycles and E (execution) cycles. B-cycles are those in which the CPU holds the bus, whereas the E-cycles are those in which the CPU is executing and does not make use of the bus. The DMA controller can therefore gain control of the bus during the E-cycles. To find the upper bound on the WCET of a program when it executes concurrently with cycle-stealing DMA, it is necessary to consider first the worst case delay suffered by the CPU when cycle-stealing operations are allowed. Assuming that each transfer performed by the DMA takes the same amount of time, defined as DT, and that T is the total execution time of k consecutive E-cycles, the maximum units of data (m) the DMA controller can transfer during the k sequence of E-cycles can be defined by the following equation: 102 Figure 6.9 illustrates the DMA transfers over a sequence of E-cycles. BMT! i BM"ri m*DT 1 m E, 1 E, 1 M T t B i + i CPU asserts bus request Figure 6.9. Concurrent execution of DMA and a sequence of E-cycles [7] The worst-case delay suffered by the CPU during the k consecutive E-cycles can be computed as follows: m*DT +2*BMT -Tl Tc Tc (2) where Tc is the period of a clock cycle. That is, since each machine cycle is triggered by the processor clock, the first B-cycle after the sequence of k E-cycles cannot start until the next processor clock cycle. The worst-case execution time of an instruction, denoted by W(l), can be obtained by summing the execution time of the instruction when executing without DMA and the worst-case delay of all the E-cycle sequences in the instruction. W(l) = ti + d| (3) Equation 3 can be used to bound the WCET to a block of instructions: W(Bi) =j£|W(li) (4) where lj is the \th instruction in the basic block and k is the number of instructions in the basic block. 103 Derived from the equation proposed by Li, Malik and Wolfe, N 5 Xi*Ci Huang, Liu and Hull define the WCET of a program executing concurrently with a cycle-stealing DMA as follows: N WCET=g WfBiPxi ( 5 ) where Xj is the execution count of the basic block B| (associated with the worst-case execution of the program). This method can be used to consider the TMS320C67 cycle-stealing DMA controller used in dual ORTS, not for I/O operations, but for transferring data to external memory through the external memory interface (EMIF). The EMIF connects the DSP processor to external memory (e.g. SSRAM or SDRAM) in the DSP board. The CPU and the DMA controller contend for the EMIF every time they need to access external memory, and follow the same contention protocol as described by Huang, Liu, and Hull. In this scenario, there is a memory latency time associated with switching between requestors (CPU and DMA controller) to external memory. This latency time depends on both the type of memory (e.g. SSRAM or SDRAM), and the current EMIF activity (i.e. reading or writing). These switching times are presented in Table 6.1. For internal data memory accesses, there is no delay associated when switching between requestors. For more information, refer to the TI peripherals guide in [36]. Subsequent access SDRAM SSRAM Current Access READ WRITE READ WRITE SDRAM READ 16-18 cycles 18-20 cycles 13-17 cycles 13-17 cycles WRITE 10cycles 8 cycles 5-7 cycles 5-7 cycles SSRAM READ 7-9 cycles 7-9 cycles 4-6 cycles 4-6 cycles WRITE 5-7 cycles 5-7 cycles 4 cycles 4 cycles Table 6.1. External switching time between accesses by different requestors [36] 104 If we substitute the bus master transfer time (BMT) in equation 2, with the memory latency times presented in Table 1, we can use equation 5, proposed by Huang, Liu, and Hull, to obtain the WCET of a program executing concurrently with the TMSC320C67 DMA controller in dual ORTS. The EMIF supports burst capabilities to enhance high-speed transfers. The DMA controller exercises this capability through the use of its internal 11-deep FIFO. The CPU, on the other hand, is optimized for internal accesses, and it cannot burst data from external memory. It must, therefore, wait for each data element before proceeding with the next transfer. Using the DMA to transfer data from internal memory to external memory and wee versa allows better processing speeds to be achieved. To lessen the effects of memory access latencies, DMA accesses should be performed in bursts. If there is the risk of frequent interruptions to the burst activity by a higher priority device (e.g. CPU), the arbitration bit called (RBTR8) can be set in the EMIF so that a minimum of eight accesses of the current burst can be performed without interruption, before switching to the higher priority requestor. 6.5 ORTS kernel optimization Most of the millions of instructions per second (MIPS) in DSP applications occur in loops. Fortunately, loops are software structures that have more parallelism than any non-loop structures. This is because there are multiple iterations of the same code executing with limited dependencies between iterations. Through software pipelining, the TMSC320C67 achieves the highest performance on loops; however, this performance is greatly affected by how well the compiler can software pipeline, and finally, this depends on whether a loop qualifies for software pipelining or not. There are some conditions that loops must meet in order to qualify for software pipelining [37]: • It cannot call another function from within the loop unless the called function is inlined. Any break in control flow makes it impossible to software pipeline. • It cannot have too many instructions in the loop. Loops that are too big typically require more registers than are available. With not enough registers, full parallelism cannot be achieved. In ORTS, the main source of overhead is the kernel itself. If the kernel is optimized, the entire system speeds up; however, the loops that are part of the main kernel structures do not qualify for software pipelining (they mainly violate the first condition). For example, there are two main loops in the ORTS scheduler, one for iterating the sleeping list (blocks A through G), and the 105 other for scheduling the process in its priorityjevel list (blocks H through L). (Refer to section 6.1.1.) Within the first loop, three different functions are called. Although this can be solved by inlining the functions, it can cause code bloat (the size of the code gets larger), and too many inline functions may cause the operating system to thrash, which means that it spends most of its time pulling in the chunk of code. In the second loop, the loop is interrupted by a "return" statement, which prevents the loop from being pipelined. In semaphore P, the main loop (blocks B to K in section 6.1.4) calls 3 functions, one of which is the inheritance protocol. Furthermore, the loop is interrupted by the Yield function, which cannot be inlined. Other than these loops, the rest of the code for the scheduler, semaphoreP and SemaphoreV, is built out of conditional structures or straight-line structures, which cannot be software pipelined. Software pipelining, however, can highly optimize user processes. The use of the C67 DSP processor (running at 167 MHz) helps the ORTS kernel to execute faster per instruction, compared to the C67 DSP running at 60 MHz. However, the ORTS kernel does not take full advantage of software pipelining (parallelism), for which the kernel first needs to be refined. 106 Chapter 7 Scaling issues The next section discusses the issues of scaling ORTS in the number of DSP processor nodes, and issues regarding scaling the C67 DMA implementation to multiple DMA channels. 7.1 Multiple DMA channels The Tl TMS320C67 DMA controller (C67 DMA) consists of four independent channels. Each channel can access any portion of the DSP processor memory map. The C67 DMA is implemented in ORTS to transfer data between memories within the DSP processor node through a single DMA channel (channel 0). The highest priority processes are the only processes enabled to perform DMA transfers (through channel 0); any other processes data transfers are handled directly by the "memcpy" function. Multiple DMA channels can be used to perform data transfers. Since DMA channels are prioritized, each DMA channel can be implemented to handle data transfers of a different priority process. Although this has the advantage of allowing four different contexts for DMA operation, scaling to multiple channels brings two main disadvantages. The first disadvantage is the large overhead which the system call (to either write or read data between memories) incurs, due to the size of the DMA algorithm required to handle not one, but four DMA channels. In order to perform a transfer via a DMA channel, different parameters and arguments specific to that channel need to be used, so different conditional branches for each channel must be coded to handle all four channels. Therefore, the DMA algorithm to handle 4 different channels grows considerably, compared to the code (DMA algorithm) for a single channel. Although there is a cost associated with the execution of the DMA algorithm, for the single DMA channel approach, the overhead of executing the DMA algorithm is very small. By keeping this overhead small, the high transfer rates that can be achieved with the C67 DMA mechanism become apparent for very small data transfers. On the other hand, if multi-channel DMA is implemented, the overhead of executing the DMA algorithm is no longer small, thus leading to an increase in the execution time of the processes for small data transfers. Multi-channel DMA scheme becomes convenient for medium to very large data transfers. For this reason it is not adequate for ORTS where the user roughly defines transfers larger than 50 words. The second disadvantage has to do with the architectural design of the C67 DMA controller itself. Each DMA channel has 2 holding registers that work as a 2-deep FIFO (First-in-first-out). These 107 registers are used to temporarily store data. This way, data can be read without being dependent on the destination. In order to enhance high-speed transfers, a 9-deep FIFO is shared among all 4 channels. The 9 deep-FIFO combined with the holding registers of each channel create an 11-deep FIFO, which facilitates data bursting, and therefore improves transfer rates. When a single channel is used, as implemented in dual ORTS, the 11-deep FIFO is completely dedicated to that channel, and data bursting can be performed. When more than one channel is in play, arbitration between channels takes place. Arbitration allows a higher priority channel to start operation before the lower priority channel even has a chance to complete its operation. This arbitration causes a negative effect on the performance of the DMA controller, because if the source of the higher priority channel is the same as the destination of the lower priority channel, the 9-deep FIFO (holding the lower priority channel data) will not be able to flush its data. In this case, the higher priority process cannot hold the 9-deep FIFO, and restrain itself to its 2-deep FIFO, which is not deep enough to perform data bursting. In RTOS it would not be acceptable for a higher priority process transferring data over the highest priority DMA channel to perform poorly due to a lower priority process data that could not be flushed out of the 9-deep FIFO. As a matter of fact, this DMA structure was redesigned by Texas Instruments for the TMS320C62 processor to overcome this problem and improve performance [43]. 7.2 Multiple DSP processor nodes Each DSP node in the Daytona board runs a separate copy of ORTS, therefore running independently from each other, but communicating through a Dual Port RAM. The work that needs to be done when scaling ORTS to a multiple processor node board highly depends on the hardware characteristics of the new DSP board. If the same hardware architecture as the Daytona board is maintained (e.g. scaling to Spectrum's Barcelona board which has four C67 processor nodes and a 4-port RAM for the communication between the four nodes) scaling ORTS to multiple processor nodes is as transparent as defining and declaring the newly added nodes in the right functions and classes in the code. The main scaling issue is the arbitration to access the local PCI bus, which is performed among several processor nodes instead of only two. This could bring some delays to processes wanting to access the PCI bus. However, since arbitration is prioritized (through the hurricane bridge of each node), the highest priority process of the node with the highest priority to access the PCI bus will always gain access to the shared resource before any other process running on the board. 108 Chapter 8 Future Work • The most important task to be considered in future work is performing an extensive WCET analysis to ORTS to guarantee that any real-time application can be executed in a timely, reliably and safe manner. ORTS must provide execution time specifications as any RTOS that is in the market. • The timing flaws indicated in this thesis must be fixed. First, it is suggested that different interrupt routines be used for software and hardware interrupts. Second, in order to prevent cumulative delays in the process wake up time, the initial "WakeUpTime" value should be calculated to the time-quantum that the process is put in the running list for the first time. • The ORTS kernel should be redesigned both to reduce its excessive overhead, and to make better and more effective use of the processor's VLIW technology. • To improve Hurricane links implementation, a second storing buffer must be implemented to reduce sample losses to zero. Although there is no need to allocate a second buffer as large as the first one, the SSRAM may not be large enough to hold the second buffer; therefore, the SDRAM should be used instead. • Today's user applications make use of a single process to log data to the host. If multiple processes are considered to communicate with the host in the future, a DMA algorithm should be implemented to control access to the Hurricane DMA. 109 Chapter 9 Conclusions ORTS was ported to Daytona. Host-DSP communication was an important development area in porting ORTS. Hurricane DMA, the first of its kind in ORTS, was implemented to achieve host-DSP communication. Host-DSP communication through PC links showed poor performance for periodic processes; a huge amount of data losses occurred due to the host low throughput rate when reading from the SSRAM. On the other hand, host-DSP communication carried out through Hurricane links proved to be the best solution for both aperiodic and periodic processes communicating between host and DSP. Samples losses, although few, cannot be brought to zero unless a second storing buffer is implemented, which would hold the new data a process might produce while a DMA transfer is taking place. The C67 DMA mechanism was implemented for the first time in ORTS along with a DMA algorithm, designed and implemented for multitasking purposes. The C67 DMA implementation in ORTS helps reduce the duration of system calls to write and read links. Although this mechanism is more advantageous for large data transfers, rates higher than those provided by the "memcpy" function can be achieved for data transfers, in some cases, as small as 10 words. The single DMA channel approach implemented in this thesis proved to be the most convenient and suitable solution for ORTS due to the small data transfers handled by the user. The actual "wake up times" implementation for periodic processes, and the implementation of a single interrupt service routine for both software and hardware interrupts, prevent periodic processes from being executed at the defined frequency. It was demonstrated that processes whose execution time is small (around half the sampling period), can perfectly achieve the defined sampling frequency of either 5 or 10 kHz, except for the initial sampling frequencies, which are higher than the desired. As the execution time gets closer to the sampling frequency, a process can still achieve a frequency of 5 kHz, with few higher frequencies appearing at start up. However, for a 10 kHz sampling frequency, the process is forced to run at higher frequencies, and hardly ever achieves the 10 kHz. In this thesis we present an analysis that provides a first insight into the ORTS kernel in terms of worst case execution times; however, an extensive analysis must be performed of the entire system to guarantee that any real-time application can be executed in a timely, reliable and safe manner. From the results presented in this analysis we observed that the ORTS kernel is too overhead-prone. The system takes up too much processor time to execute context switching and semaphores, consequently reducing the available computing time for the application. 110 For the previous version of ORTS, the worst-case process dispatch latency time could take 72.74 us of the CPU time, which is extremely high for a microkernel. It was also observed that the worst-case execution time of the system calls to write local and buffered links is 246.10 us and 251.55 us, respectively. In both cases nearly 87% of the CPU time is dedicated to execute operations other than the data transfer itself. Periodic processes that write to local and buffered links cannot be expected to run at frequencies higher than 3.33 kHz. In its current state, it is neither safe nor recommended to use ORTS for real-time applications. In the results given in this thesis, we present relevant findings that suggest that any proprietary commercial real-time kernel (properly selected to handle ORTS applications) be used instead. These findings are the following: • Nowadays, there is no certainty about the behavior of ORTS for real-time applications. ORTS lacks an extensive and well designed analysis that can demonstrate whether it is indeed, capable of supporting real time applications effectively, reliably and safely. • In this thesis, we quantify the kernel's overhead and demonstrate that the system overutilizes the CPU, in some cases, by 72% of the time, which is extremely high for a microkernel. Alleviating the effects of this large source of overhead is not completely the responsibility of a faster processor, but of a better kernel design. • ORTS kernel, in its current form (code), does not comply with the necessary conditions to be software pipelined; therefore, it does not make effective use of the VLIW super technology it is implemented on. The ORTS kernel needs to be redesigned. • Most commercial proprietary RTOS are supported by widely known corporations, and by a growing community using the same RTOS for all kinds of applications. ORTS lacks this support. • Some commercial RTOS offerings (of well-known corporations such as WindRiver, and OSE Systems, among others), already support DSP technologies, including Tl DSP processors, infrastructure that MAL lab already counts on. "The cost of uniquely developing and maintaining a homegrown kernel, as well as the increasing quality of the commercial offerings, is significantly reducing the practice of generating homegrown kernels [10]". References: [I] N. A. Erol, "Open Real-time Operating System for CNC Machine Tools," M.A.Sc. Thesis, University of British Columbia, 1996 [2] Y. Altintas, N.A. Newell, and M. Ito, " A hierarchical open architecture multi-processor CNC system for motion and process control," in ASME Manufacturing Science and Engineering, vol. 64, pp. 195-205, 1993. [3] B. Otkunc, "Open Architecture Controllers: Real-Time Capabilities and Interoperability," M.A.Sc. Thesis, University of British Columbia, 2000. [4] A. Gurcan, "Real-Time System Design for Machines Robots and Process Control", M.A.Sc. Thesis, Istanbul Technical University, 2000. [5] N.A. Erol, Y. Altintas, and M. Ito, "Open Architecture Modular Tool Kit for Motion and Machining Process Control," in ASME Manufacturing Science and Technology, MED-vol. 6-1, pp. 15-22, 1997. [6] J.A. Stahkovic, "Misconceptions About Real-Time Computing: A Serious Problem for Next-Generation Systems," IEEE Computer, vol. 21, no. 10, pp. 10-19, Oct. 1988. [7] T-Y. Huang, J.W.-S. Liu, and D. Hull, "A method for bounding the effect of DMA I/O interference on program execution time", in 17th IEEE Real-Time Systems Symp., pp. 275-285, Dec. 1996. [8] J. Takala.J, M. Kuulusa, P. Ojala, and J. Nirmi, "Enhanced DSP core for embedded applications", in IEEE Workshop on Signal Processing Systems, pp. 271-280. Oct. 1999. [9] Y. Li; M. Potkonjak, and W. Wolf, "Real-time Operating Systems for embedded computing," in Proc. of the IEEE International Conference on Computer Design: VLSI in Computers and Processors, pp. 388-392, 1997. [10] K. Ramamritham, and J. Stankovic, "Scheduling algorithms and operating systems support for real-time systems," in Proc. of the IEEE, vol. 82, no.1, pp. 55-67, Jan. 1994. [II] A. Colin, and I. Puaut, "Worst-case execution time analysis of the RTEMS real-time operating system," in 13th Euromicro Conference on Real-Time Systems, pp. 191-198, June 2001. 112 [12] M. Lindgren, H. Hansson, and H. Thane, "Using measurements to derive the worst-case execution time," in Proc. 7th International Conference on Real-Time Computing Systems and Applications, pp. 15-22, Dec. 2000. [13] C.Ditze, "DREAMS- Concepts of a distributed real-time management system," Engineering Practice, vol. 4, no. 10, pp. 1451-1460, Oct. 1996. [14] D. Kalinsky, and J. Ready, "Real-time operating system kernel technology in the twenty-first century," in the Second International Conference on Software Engineering for Real-Time Systems, pp.214-218, 1989. [15] J. Stankovic, "The Spring architecture," in Proc. Euromicro Workshop on Real-Time, pp. 104-113, June 1990. [16] K. Schwan, A. Geith, and H. Zhou, "From Chaos b a s e to Chaos a r c : A family of real-time kernels," in Proc. Real-Time Systems Symp., pp. 82-91, Dec 1990. [17] R. Rajkumar, " Real-time synchronization protocols for shared memory multiprocessors," in Proc. 10th International Conference on Distributed Computing Systems, pp.116-123, June 1990. [18] S.K. Baruah, "Scheduling periodic tasks on uniform multiprocessors," in 12th Euromicro Conference on Real-Time Systems, pp. 7-13, June 2000. [19] S. Ramamurthy, and M. Moir, " Static-priority periodic scheduling on multiprocessors," in Proc. 21st IEEE on Real-Time Systems Symp., pp.69-78, Nov. 2000. [20] S. Baruah, J. Gehrke, and G. Plaxton, "Fast scheduling of periodic tasks on multiple resources," in Proc. 9th International Parallel Processing Symp., pp.280-288, Apr. 1995. [21] R. Rajkumar, L. Sha, and J.P. Lehoczky, " Real-time synchronization for multiprocessors," in Proc. Real-Time Systems Symp., pp.259-269, 1988. [22] P. Tsigas, and Yi Zhang," Non-blocking data sharing in multiprocessor real-time systems," in (>h International Conference on Real-Time Computing Systems and Applications, pp.247-254, Dec. 1999. [23] A. Garcia Martinez, J.F.Conde, and A. Vina, " A comprehensive approach in performance evaluation for modern real-time operating systems," in Proc. 22 n d Euromicro Conference. Beyond 2000: Hardware and Software Design Strategies, pp. 61-68, Sep. 1996. 113 [24] L. Keate, "A real-world approach to benchmarking DSP real-time operating systems," in Conference Proceedings WESCON'97, pp. 418-424, Nov. 1997 [25] M. Machtel, and H. Rzehak, "Measuring the influence of real-time operating systems on performance and determinism," Control Engineering Practice, vol. 4, no. 10, pp. 1461-1469, Oct. 1996. [26] J. Engblom, A. Ermedahl, and P. Altenbernd, "Facilitating worst-case execution times analysis for optimized code," in Proc. 10lh Euromicro Workshop on Real-Time Systems, pp. 146-153, Jun. 1998. [27] H.A. Aljifri, A. Pons, and M.A. Tapia, "Tighten the computation of worst-case execution-time by detecting feasible paths," in Proc. of the IEEE International Conference of Performance, Computing and Communications, pp. 430-436, Feb. 2000. [28] Y-T.S.Li, S. Malik, and A. Wolfe, "Efficient microarchitecture and path analysis for real-time software," in Proc. 16th Real-Time Systems Symp., pp. 298-307, Dec.1995 [29] J. Engblom, and A. Ermedahl, "Pipeline timing analysis using a trace-driven simulator," in 6th International Conference on Real-Time Computing Systems and Applications, pp. 88-95, Dec. 1999. [30] VxWorks operating system, WindRiver, http://www.windriver.com, 2002. [31] LynuxOS, LynuxWorks, http://www.lynuxworks.com [32] Virtuoso operating system, WindRiver, http://www. windriver.com, 2002. [33] OSE real-time kernel, OSE systems, http://www.enea.com, 2001. [34] A. Burns, A. Wellings, "Real-Time Systems and Programming Languages," Addison Wesley, 1997. [35] Digital Signal Processing Solutions, "TMS320C6000 C Compiler" Texas Instruments, 2000 [36] Digital Signal Processing Solutions, "TMS320C6000 Peripherals" Texas Instruments, 2000 [37] Digital Signal Processing Solutions, "TMS320C6000 Programmer's Guide" Texas Instruments, 2000 [38] Digital Signal Processing Solutions "TMS320C6000 CPU and Instruction Set" Texas Instruments, 2000 114 [39] Digital Signal Processing Solutions "TMS320C3x General purpose applications user's guide" Texas Instruments, 1997 [40] Spectrum Signal Processing, "Daytona Dual C6x PCI Board Technical Reference", 1999 [41] Spectrum Signal Processing, "Hurricane DSP Bridge ASIC Data Sheet", 1999 [42] Spectrum Signal Processing, "Daytona/Barcelona C6x PCI Board Windows NT Programming Guide", 1999. [43] D. Bell, and E. Biscondi, "TMS320C620X/TMS320C6701 DMA and CPU: Data access performance," Texas Instruments, Application Report, Aug. 2000. 115 Appendix A: Spectrum's Download Function Bug A bug was found in the "FT_Systeml_oad" Spectrum's function. This function is responsible for downloading the code to the DSP board, and for freeing the board's memories from previous allocations. However, this function simply does not free any memory. An application cannot be run more than twice without the system running out of memory. To overcome this problem, the code had to first be downloaded with the Code Composer Studio (Tl's debugger for the DSP C6x) to force free the board's memories, and to finally initiate the application. This problem, or bug, was completely unknown to the manufacturer (Spectrum), who would not accept responsibility until we asked them to perform some specific tests aimed to prove the failure of the "FT_SystemLoad " function. After Spectrum performed those tests, it sent a patch that was supposed to fix the problem. The patch sent by Spectrum did not fix the bug. The patch works only if used with a small size source code. However, for large source code such as the ORTS real-time kernel, it does not work. The following message is displayed after running the application: "Error Reading from Coff file." To circumvent the problem, Spectrum proposed a possible solution. The idea was that every message, for example, the displayed with the "print" function, be modified to make them be multiples of 4 characters. ORTS has hundreds of those messages, and modifying each one of them would have represented a total waste of time. Furthermore, MAL, as a Spectrum's user should not be limited to that kind of constraint. This problem was approached, though not yet solved, by splitting the source code into different memories. When the user applications (user code) were added to ORTS, the ORTS real-time kernel and user applications stopped fitting together on on-chip memory. The real-time kernel was then allocated to on-chip memory, while the rest of the code was allocated to the SSRAM. This distribution made the code appear smaller, and thus, possible to be correctly handled by the Spectrum's loader function, with the help of the patch. So far, this apparent solution seems to be working fine. It is important to remember that the bug is not yet fixed. Spectrum is responsible for doing so; however, they have made it known that the problem is something they won't address soon. 116 In this thesis we found out that very specific and important rules need to be followed to successfully design a loader (function to download code to the C6x DSP), when using the "-cr" linker option for initialization of variables at load time. This explanation can be found in the TI User's guides [35,37]. Spectrum somehow fails to follow Tl's rules. After discovering the relation between the loader problem and the linker initialization options ("-cr" initialization of variables at load time, and "-c" autoinitialization of variables at run time), we performed some tests and found the following: If the patch is not used: 1) Use the " -c " linker option instead of the "-cr" option. 2) If the "-cr" option is used, the code needs to first be downloaded with the CCS to de-allocate memory. If the patch is used: Use a small size source code, or if using a large code, split it into different memories as was done with ORTS. 117 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0065366/manifest

Comment

Related Items