Approximation Algorithms for Task Allocation with QoS and Energy Considerations by Bader N. Alahmad B.S. Computer Engineering, Jordan University of Science and Technology, 2007 a thesis submitted in partial fulfillment of the requirements for the degree of Master of Applied Science in the faculty of graduate studies (Electrical and Computer Engineering) The University Of British Columbia (Vancouver) May 2011 c© Bader N. Alahmad, 2011 Abstract We consider general problems of allocating tasks to processors where each task is associated with a set of service classes. A service class is a tuple: the first element represents the resource utilization and the second element the quality of service associated with that resource utilization. Before al- locating a task to a processor, we need to assign it to a service class. We consider some elementary problems that arise from this setting. What is the minimum number of processors needed if we also need to attain a minimum aggregate QoS level? Given a fixed set of processors, what is the maximum attainable aggregate QoS? Such questions apply to the partitioned schedul- ing of real-time tasks on multiprocessors and to other resource allocation problems. The basic questions are NP-Hard, and we present polynomial time approximation algorithms for certain special, but practical, cases. We make interesting use of maximum flows in a bipartite graph while developing the polynomial time approximation schemes. We then integrate energy ex- penditure to the model above and address the problem of partitioning a set of independent, periodic, real-time tasks over a fixed set of heterogeneous processors while minimizing the energy consumption of the computing plat- form subject to a guaranteed quality of service requirement. This problem is NP-Hard and we present a fully polynomial time approximation scheme for this problem. The main contribution of our work is in tackling the prob- lem in a completely discrete, and possibly arbitrarily structured, setting. In other words, each processor has a discrete set of speed choices. Each task has a computation time that is dependent on the processor that is chosen to execute the task and on the speed at which that processor is operated. ii Further, the energy consumption of the system is dependent on the decisions regarding task allocation and speed settings. iii Preface Chapter 3 is based on a draft initiated by Sathish Gopalakrishnan1, in col- laboration with Nathan Fisher 2. My contribution was writing parts of the aforementioned chapter, and deriving some of the mathematical proofs, in addition to verifying previously written proofs and correcting some mistakes. In general, I contributed to about one third of the content of chapter 3. I would like to indicate that the work in chapter 3 has not been published in any scientific conference or journal to the date of submission of the thesis. 1Supervisor, Assistant Professor, Department of Electrical and Computer Engineering, University of British Columbia 2Assistant Professor, Department of Computer Science, Wayne State University iv Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Background and Related Work . . . . . . . . . . . . . . . . 6 2.1 Complexity Theory, Combinatorial Optimization and Approx- imation Algorithms: Preliminaries . . . . . . . . . . . . . . . 6 2.1.1 Complexity Theory . . . . . . . . . . . . . . . . . . . . 6 2.1.2 Combinatorial Optimization . . . . . . . . . . . . . . . 12 2.1.3 Approximation Algorithms . . . . . . . . . . . . . . . 14 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 On Task Allocation Problems with Service Classes . . . . 23 3.1 Motivation: Virtual Machine Allocation . . . . . . . . . . . . 23 3.2 Definitions and Formal Problem Formulation . . . . . . . . . 25 v 3.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.2 Problem Definition . . . . . . . . . . . . . . . . . . . . 27 3.2.3 Problem Hardness . . . . . . . . . . . . . . . . . . . . 28 3.2.4 Restricted Problem Formulation . . . . . . . . . . . . 29 3.3 Solving Restrictions to TASC . . . . . . . . . . . . . . . . . . 29 3.3.1 Fixed Service Classes . . . . . . . . . . . . . . . . . . 30 3.3.2 Minimal Service Class . . . . . . . . . . . . . . . . . . 32 3.3.3 TASC . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.4 Maximizing QoS with m Processors . . . . . . . . . . 35 3.4 Experimental Evaluation and Further Extensions . . . . . . . 35 4 Energy Efficient Task Partitioning and Real-Time Schedul- ing on Heterogeneous Multiprocessor Platforms with QoS Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 System Model and Notation . . . . . . . . . . . . . . . . . . . 39 4.2.1 Platform . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.2 Task and Scheduling Model . . . . . . . . . . . . . . . 40 4.2.3 Energy Model . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.4 Quality-of-Service Model . . . . . . . . . . . . . . . . 42 4.2.5 Task Set Encoding . . . . . . . . . . . . . . . . . . . . 42 4.2.6 Notes Regarding Assumptions . . . . . . . . . . . . . . 42 4.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Assignment of Tasks to Processors . . . . . . . . . . . . . . . 49 4.4.1 Exact and Optimal Dynamic Program Formulation . . 49 4.4.2 The Fully Polynomial-Time Approximation Scheme . 53 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Appendix A AModified Pruning Technique: Energy Round- ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 A.1 Running-Time Analysis . . . . . . . . . . . . . . . . . . . . . 80 vi List of Tables Table 3.1 Amazon EC2 Instance Pricing and Configuration. (1 EC2 Compute Unit Provides CPU Speeds of About 1 GHz.) (Amazon.com [5]) . . . . . . . . . . . . . . . . . . . . . . . 23 Table 4.1 Basic Math Benchmark (Single Iteration) . . . . . . . . . . 38 Table 4.2 Fast Fourier Transform (FFT) Benchmark (1000 Iterations) 39 vii List of Figures Figure 3.1 Hierarchical Scheduling Using a VMM/Hypervisor . . . . 26 Figure 3.2 Network Flow and Feasible Assignments . . . . . . . . . . 31 viii Glossary EDF Earliest Deadline First PTAS Polynomial Time Approximation Scheme FPTAS Fully Polynomial Time Approximation Scheme FFT Fast Fourier Transform TM Turing Machine DVS Dynamic Voltage Scaling P Polynomial-time, the complexity class containing languages decidable by deterministic TMs in polynomial-time NP Non-Deterministic Polynomial-time, the complexity class containing languages decidable in polynomial-time by non-deterministic TMs CONP Complement Non-Deterministic Polynomial-time, a language is in the complexity class coNP if and only if its complement is in NP L Logarithmic-space, the complexity class containing languages decidable in polynomial-time by deterministic TMs that use space that is logarithmic in the size of the input SAT Satisfiability Problem RM Rate Monotonic ix VMM Virtual Machine Monitor WCEC Wosrt Case Execution Cycles WCET Wosrt Case Execution Time x Acknowledgments I offer my enduring gratitude to the faculty, staff and my fellow students at UBC, who have inspired me to continue my work in this field. I owe particular thanks to my adviser, Dr. S. Gopalakrishnan, for his continuing support, and for giving me the liberty to work on what I love, and whose penetrating questions and provocative discussions taught me to question more deeply. Special thanks are owed to my parents, who have supported me through- out my years of education, both morally and financially. I owe particular thanks to my friends for their support, and especially Yazan Boshmaf, who dedicated considerable time listening to me talking about my research problems and engaging in fruitful discussions, and whose ideas and thoughts certainly contributed to the work. xi Chapter 1 Introduction Several fundamental questions in resource allocation for multiprocessor com- puting systems relate to task allocation schemes that satisfy a set of con- straints. For a real-time system, timeliness, or the ability of the system to meet deadlines, is a vital property. Implicit-deadline periodic tasks are well understood from a scheduling perspective Buttazzo, Liu [12, 42]. These are tasks that release jobs periodically, and each job needs to be completed be- fore the next job of the same task is released. A task τ is characterized by a period P and execution time e; the tasks utilization is u = e/P . For a set Γ = {τ1, . . . , τn} of tasks, the total utilization is U = ∑n i=1 ui = ∑n i=1 ei/Pi. In one of the simplest models for real-time multiprocessor systems, a set of implicit-deadline periodic tasks needs to be allocated to processors such that no task will miss its deadline Carpenter, Funk, Holman, Srini- vasan, Anderson, and Baruah [13]. When tasks are scheduled using the earliest deadline first policy, the problem of partitioning tasks among a set of processors is akin to the bin packing problem. Using EDF, a set of implicit-deadline tasks running on the same processor is schedulable if the total processor utilization is no greater than 1. Partitioning a task set over m processors is feasible if the corresponding bin packing problem Coffman, Garey, and Johnson [18], i.e., how many unit-capacity bins are needed to pack items of sizes u1, u2, . . . , un? has a solution that is no greater than m. The bin packing problem is computationally hard (NP-hard), and so is the 1 partitioning problem (Garey and Johnson [25]). Extensive work has concentrated on understanding the schedulability of real-time task sets on uniprocessor and multiprocessor platforms only in the homogeneous or “single-mode” task model (Buttazzo, Liu [12, 42]). We, however, view the system from a quality-of-service perspective. In general, quality of service is a measure of the maximum error a task might tolerate, or the minimum perceived precision in a dual sense. In our model, quality of service is fully specified by service classes, each service class being a tuple (uj , qj). Each tuple represents a utilization for the task and the associated quality index, and task τi is associated with a set of service classes. We restrict our attention to monotonically-increasing tuples, i.e., if (uj , qj) and (uk, qk) are two service classes of τi and uj > uk then qj > qk. We consider the problem of selecting the optimal choice of service classes in a multiprocessor setting. Although we describe the problem of interest in the specific setting of real-time scheduling for multiprocessors, this prob- lem is a generic resource allocation problem and our interest in the problem stemmed from recent developments in server consolidation and virtualiza- tion, which establishes some more context for the applicability of this model and the results (Section 3.1). We extend the setting above to incorporate energy expenditure of the computing system as the objective value to be minimzed, while maintaining a desired quality of service score. We consider a resource allocation problem where we are given a set of heterogeneous processors with discrete speed settings and a set of periodic, independent, real-time tasks. How do we par- tition the tasks across the available set of processors and choose appropriate speed settings for those processors such that we achieve minimum energy consumption while satisfying some specified quality of service requirement? In this setting, the energy consumed by the complete system depends on the task allocation and on the speeds of the processors. This problem is motivated by two important issues to consider in the design of current and future embedded systems: energy consumption and processor heterogeneity. Furthermore, the emphasis on a system model with completely discrete set of choices is based on architectural considerations as well as empirical evi- 2 dence that suggests that such a model is indeed the appropriate model to apply. Energy in embedded systems and especially battery-powered devices is a valuable resource whose expenditure needs to be kept at minimum in order to increase the lifetime of such systems. At the same time, tasks running on those systems should be appropriately serviced according to their com- putational and timeliness requirements. There are two major contributors to the overall energy consumption in a processor: (i) the dynamic power consumption due to switching activities and (ii) the leakage power con- sumption due to leakage current draw in CMOS circuits (Jejurikar, Pereira, and Gupta [32].) The former depends on the speed at which the processor is operating, while the latter is present whenever the processor is on, and is a constant. In addition to the energy consumed by the processor, the total energy consumed by the computing systems depends on the energy consumed by other devices and peripherals in the system (e.g., memory, I/O). This energy consumption is not dependent on the processor speed. This aspect of energy modeling is important because it allows us to capture energy consumption beyond what is specific to the processing units. Most embedded systems now constitute multiple processors in order to increase the processing throughput. In addition, each processor can be con- figured to run at a speed (frequency) from a limited set of allowable speeds. Moreover, the processor’s operating speed can be varied while the system is running without interrupting task execution. Heterogeneous computing systems consist of multiple processing components having different architec- tures and computing capabilities, interconnected using different connectivity paradigms. For example, the NomadikTM platform (Wolf [58]) includes an ARM processor, a video accelerator and an audio accelerator, each of which is itself a heterogeneous multiprocessor. Such systems better meet the de- mand of applications with diverse requirements, because the computational requirements of a task might differ significantly on different processing ele- ments, and heterogeneous systems allow tasks to be matched to computing elements that better serve their requirements. In a heterogeneous platform the assignment of tasks to processors may 3 impact the end-user quality of service because certain processor types may be better suited to certain tasks. For example, executing a graphics task on a specialized graphics processor yields better results than performing the same operation on a general-purpose processor. In a system with hetero- geneous compute units and multiple speed settings for each processor, a system architect may explore the trade-off between quality of service and energy efficiency. This is the stage of the design process that we target in this thesis. To simplify the discussion, we can think of a platform with a fixed num- ber of processors. We need to decide the processor type for each processor from a set of available processor types (permitting heterogeneity in the plat- form), then selecting a particular speed for a processor and finally assigning tasks to that processor. A task would have a certain worst-case execution time at the selected speed and each instance of a periodic task will consume a certain amount of energy (active-time energy) that depends on the type of processor selected and the associated speed of the processor. Additionally when a processor is idle it will consume some energy (idle-time energy). Our energy expenditure model is central to our contribution. We relax many of the impractical assumptions underlying state-of-the-art solutions, including the work of Yang, Chen, Kuo, and Thiele [60], which is the closest to our efforts. In a sense, prior models assume 1) continuity of speed levels on machines 2) linearity of WCET requirements of tasks with respect to processor speed 3) constant WCEC of tasks with respect to speed levels on processors 4) linear interpolation of the energy expenditure when calcula- tions result in speed levels that are not available on the processor, in case of a discrete processor speed model, and 5) that energy expenditure on a processor depends solely on the total utilization of the processor. These as- sumptions are agnostic to the fact that different tasks exercise differential, and somewhat arbitrary quality of execution as speed varies on a processor. A consequence of such assumptions is that previous models cannot capture the energy as consumed by devices other than the processor, which is due to nonlinearities of task execution requirements and the overall energy con- sumption when tasks spend considerable fraction of their time interacting 4 with I/O devices, such as memory and disk (I/O bound tasks). Therefore, those models are very difficult to scale to account for energy expenditure of the compute system as a whole (see chapter 4, section 4.1 for an experi- mental illustration of some of the intricacies mentioned above, and section 4.2.6 for a more elaborate discussion). Due to the practical needs of discrete and arbitrarily structured settings, our solutions are combinatorial, and our algorithms employ a blend of matching and enumeration techniques, with dynamic programming being the general algorithmic tool. In summary, our contributions are 1. New model for energy expenditure that alleviates previous impractical assumptions; 2. PTASs for varieties of (strongly NP-Hard) task allocation problems, where tasks execute on identical machines, and the goal is to minimize the number of machines on which the task set is to be deployed, while maintaining a certain quality of service score; 3. An FPTAS for the (NP-Hard in the ordinary sense) problem of allo- cating tasks on unrelated parallel machines (heterogeneous platform), where the number of machines is fixed, and the goal is to find an assign- ment of tasks to service classes and simultaneously build a platform from an assortment of given machine types, and then assign tasks (as they are equipped with service classes) to the machines comprising the platform such that the platform expends as minimum energy as possible while the aggregate quality of service score of the solution assignment is above that of a given quality of service threshold, and 4. Provable worst case bounds on the quality of the solutions returned by our approximation schemes relative to the optimal solution (produced by the exact, intractable algorithm), expressed in terms of a user- supplied error factor that determines the quality vs. efficiency trade- off. 5 Chapter 2 Background and Related Work 2.1 Complexity Theory, Combinatorial Optimization and Approximation Algorithms: Preliminaries This section highlights important definitions and sets the theoretical ground upon which our work is based. The presentation in this section is highly influenced by the beautiful books of Papadimitriou and Steiglitz [45], Vazi- rani [54] and Arora and Barak [7]. Readers familiar with the notions intro- duced here can skip directly to the core problems in chapters 3 and 4. 2.1.1 Complexity Theory A decision problem is one which demands either a “yes” or a “no” answer. Decision problems are best described in terms of languages. Consider an alphabet Σ. A language L ⊆ Σ∗ is a set of strings over Σ, where L encodes instances whose answer to the decision problem is “yes”. We say that a language L is recognizable if there exists a Turing Machine (TM) 1 that, when run on an input x ∈ Σ∗, produces “yes” and halts and if x ∈ L. This 1The terms “Turing Machine” and “Algorithm” will be used interchangeably. 6 TM, however, is not required to halt and say “no” if x /∈ L, but might loop forever without producing an answer; it is only capable of determining membership in the language L if x is a “yes” instance. A stronger notion of computability is that of decidability ; that is, if membership in the language for any input can be determined by some TM that always halts: if x ∈ L, then the TM produces “yes” and halts, and if x /∈ L, then the TM halts and produces “no”. Therefore, a language L ⊆ {0, 1}∗ is decidable if both L and its complement L are recognizable. An example of an undecidable language is the famous Halting Problem, which can be stated informally as: given a computer program description and input, is the program ever going to halt when run on the input? This problem is reminiscent of the program verification problem, in which we need to check that the program adheres to its specifications over all possible inputs. The formal language associated with the halting problem is LHALT = {〈M,w〉 |M is a TM (or program) that halts on input w}, where LHALT contains all valid program (TM) encoding and input pairs for which program M halts when presented with input w. In his seminal pa- per, Turing [52] showed that regardless of the computing power, the halting problem cannot be solved by any algorithm (computer), thus drawing the limits on solvability by machines in general. Intuitively, a decision problem is a language that has context specific semantics attached to it. For example, consider the following decision prob- lem. CLIQUE: Given a graph G = (V,E) and an integer k, does G contain a clique of size at least k ? The required answer to CLIQUE as posed above is in the form “yes”/“no”. We can write the language corresponding to CLIQUE as LCLIQUE := {G | G contains a clique of size ≥ k}. (2.1) Accordingly, LCLIQUE is the set of graph encodings, assuming any reasonable encoding such as the adjacency matrix representation, which contain cliques having at least k vertices. Therefore, the language LCLIQUE is decidable if there exists a TM that, given x, can decide in a finite number of steps 7 whether or not x is a valid graph encoding and that it contains a clique of size at least k, which is exactly deciding whether or not x ∈ LCLIQUE. The definitions above concerning computation do not consider the effi- ciency of the algorithm deciding a language: They only require the compu- tation to halt in a finite number of steps, with no regard to the number of steps the machine takes until it finishes its computation. In the following, we shall classify languages, or problems, into sets, according to their hardness, which is quantified in terms of the efficiency of the algorithms that so far exist to solve them. The most accepted dichotomy of problems according to their hardness of computation is that of polynomial time solvability. A language is said to be polynomial-time decidable if there exists an algorithm that can decide the language in time polynomial in the size of the input, assuming a reasonable input encoding. For input x, we shall denote the size of x as SIZE(x), which we define as the number of bits required to encode x input in binary. In fact, we will always assume a binary encoding and thus require our alphabet to be Σ = {0, 1}, so that x ∈ {0, 1}∗. For example, consider the problem PRIMES: Given an integer N , is N prime?. According to the definition of the size of an instance above, SIZE(N) = log2N . As another example, an instance of CLIQUE is of the form I = (G, k); the size of the input is comprised of the dimension of I, namely n := |V |, and the magnitudes of numbers involved in the input, namely k whose size is log2 k, so SIZE(I) = n+ log2 k. Later when we discuss our algorithms, the input encoding will be crucial to the analysis of the running time of the algorithm and thus some notions need to be made precise at this point. For a rational number of the form x = p/q, where p, q ∈ Z and q 6= 0, we will assume that x can be represented as a pair (p, q) and thus will require O(log p + log q) bits to be encoded in binary. If we talk of real numbers, then we will assume that we have at our disposal a function APPROX(n, t), where n is the real number and t is the desired number of digits after the decimal point. The function APPROX(n, t) thus returns t decimal places of the real number n. Accordingly, the number of bits required to represent any real number n restricted to the first t decimal digits is equal to the number of bits required to store the procedure 8 APPROX(n, t). Note that, since the number of valid procedures (or, in general, TM encodings) is countably infinite, and the real numbers are uncountable, it follows that not all real numbers can be represented this way. More specifically, we cannot establish a one-to-one correspondence between the integers (procedures, TM encodings) and real numbers, for otherwise we can fall into a contradiction, which can be shown by a simple diagonalization argument. The class P contains all languages that are polynomial-time decidable (or computable if we talk of problems or functions). For example, if PRIMES is to be in P, then there should exist an algorithm that decides whether or not a given integer N is prime in O((logN)c) for some constant c > 0. In fact, PRIMES was shown to belong to the class P by Agrawal, Kayal, and Saxena [2]. Another example of a problem in P is the undirected graph reachability, or the undirected s-t connectivity problem USTCON: Given an undirected graph G = (V,E), and two nodes s, t ∈ V , is there a path in G from s to t ?. As a matter of fact, Reingold [48] proved, in a breakthrough, the stronger result that USTCON is in log-space (the class ), thus settling a long lasting open question. The class contains those languages for which there exist TMs that can decide them using O(log n) space, where n is the size of the input. On the other hand, a language L ⊆ {0, 1}∗ is said to be in the class NP if there is a polynomial p : N→ N and a polynomial-time TM M such that, on input x ∈ {0, 1}∗: • if x ∈ L, then there exists a certificate of the solution, a string z whose size is polynomially bounded by the size of x, i.e., SIZE(z) 6 p(SIZE(x)), or z ∈ {0, 1}p(SIZE(x)), such that M(x, z) accepts, and • if x /∈ L, then for any string z ∈ {0, 1}p(SIZE(x)), M(x, y) rejects; M is called a verifier TM. Note that the definition above requires only “yes” certificates to be polynomially bounded in the size of the input, but not necessarily the “no” certificates. Some authors define a language as being in NP if there exists a non-deterministic polynomial-time TM that decides the language. The name NP stands for Non-deterministic Polynomial-time, 9 which is derived from the latter definition. An example of a problem in NP is the above-mentioned CLIQUE; a verifier takes as input a graph G and a subgraph H and checks whether or not H is a complete subgraph of G with V (H) ⊆ V (G) of size at least k by checking, for every pair of vertices i, j ∈ V (H), whether there exists an edge e = {i, j} in H that is also an edge in G. This checking requires examining (|V (H)| 2 ) = O(|V (H)|2) vertices, which is certainly a polynomial in the size of G, namely |V |. In a sense, the class NP captures those problem that have “succinct” and “efficiently” verifiable “yes” certificates. Note that our definition of the class NP requires only “yes” certificates to be polynomially bounded in the size of the input, but not necessarily the “no” certificates. For example, consider the Boolean Satisfiability Problem SAT: Let ϕ be a Boolean formula (a propositional logic formula) with n variables x1, . . . , xn, and m clauses c1, . . . , cm in Conjunctive Normal Form (CNF), i.e., ϕ = c1 ∧ . . . ∧ cm, where each clause is a disjunction (∨) of literals (variables either negated or not). The question is: given such a formula, is there an assignment of the variables x1, . . . , xn such that ϕ is satisfied? A “yes” certificate is any assignment for which ϕ evaluates to true (for a suitably defined truth system); such a certificate has length n and is therefore linear in the size of ϕ, thus SAT ∈ NP. On the other hand, a “no” certificate (i.e., for deciding whether ϕ is NOT satisfiable) is an enumeration of all possible truth assignments of the variables, which is 2n, an exponential in the size of the formula. A language L ⊆ {0, 1}∗ in the class coNP if its complement L = {x ∈ {0, 1}∗ | x /∈ L} is in NP. The languages in coNP have succinct and quickly verifiable “no” certificates (counterexamples). For example, the complement of SAT, namely Boolean formulas that are NOT satisfiable for any truth assignment of their variables, is in coNP; we define the language of SAT as LSAT = {ϕ | ϕ /∈ SAT}. A “yes” certificate to an instance of SAT is the set of all truth assignments of variables (2n), while a “no” certificate is a single truth assignment of the 10 variables such that the input formula evaluates to true. If a language L contains some instance that have short “yes” certificates and others that have short “no” certificates, then L ∈ NP ∩ coNP. An example of a problem in NP ∩ coNP is FACTORING: Given three numbers N , L, U decide if N has a factor M in the interval [L,U ]. The certificate for membership in NP is the factor M . The certificate for membership in coNP is a certificate of membership in PRIMES. To verify membership in PRIMES, one should exhibit an integer x that is coprime to N and satisfies the following • xN−1 ≡ 1 (mod N), • for every prime factor p of N − 1, x(p−1)/N 6≡ 1 (mod N). In fact, one should be able present the prime factorization of N − 1 for each N recursively as required by the second condition (but prime factors get mercifully smaller at each recursive step). This primality test was conceived by Pratt [47], and he showed that the test requires time that is polynomial in logN , thus showing that PRIMES ∈ NP. Now we overview the concept of NP-Completeness. Given two languages L1 ⊆ {0, 1}∗ and L2 ⊆ {0, 1}∗, we say that L1 is polynomial time reducible to L2, denoted as L1 ≤p L2, if there exists a polynomial-time computable function f : {0, 1}∗ → {0, 1}∗ such that for every x ∈ {0, 1}∗, x ∈ L1 iff f(x) ∈ L2. Language L2 is NP-Hard if L1 ≤p L2 for every L1 ∈ NP. If, in addition, L2 ∈ NP, then L2 is NP-Complete. We point out that the afore- mentioned SAT is the first problem that was shown to be NP -Complete, in- dependently by Cook [19] and Levin [39] (The Cook-Levin Theorem). There are, however, problems that in NP but are not known to be NP-Complete, for they lack a polynomial-time reduction from a known NP-Complete prob- lem. An example of such a problem is GRAPH_ISOMORPHISM: Given two n×n adjacency matrices M1, M2, decide if M1 and M2 define the same graph, up to renaming of vertices. Assume that the graph G = (V,E) represented by matrix M has its vertices labeled by integers in [n] = {0, . . . , n− 1} (the multiplicative group of integers modulo n), where n := |V |. Then the cer- 11 tificate is the permutation pi : [n] → [n] such that M2 is equal to M1 after reordering M1’s indices according to pi. 2.1.2 Combinatorial Optimization Optimization problems concern themselves with finding the “best” config- uration or a set of parameters to achieve some goal. If the variables to be determined are discrete, then optimization problems are called combinato- rial. On the other hand, in continuous optimization problems one seeks an assignment of variables from the real line, or sometimes one needs to search for a function. We will be concerned only with combinatorial optimization problems. The variables of an optimization problem obey a set constraints, usually specified by a set of equalities or inequalities. Configurations whose variables are set such that all constraints are satisfied are called feasible. The goal that an optimization problem must achieve is either the minimiza- tion or the maximization of the value of an objective function, by making a choice of a feasible configuration. Formally, an optimization problem Π consists of set of instances. Each instance I ∈ Π is a pair (F, c), where F is a set; the domain of feasible points, and c is the cost function c : F → R. The problem is to find f ∈ F such that c(f) 6 c(y) for all y ∈ F. The point f is called (globally) optimal with respect to instance I, and we shall denote its cost as OptΠ(I). We will often drop the subscript Π when the problem is well understood from the context, and we shall denote the optimal value of the problem over all instances (or if we are considering a generic instance) simply as Opt. Intuitively, an instance of a problem is a fixed input that encodes enough information to produce a solution. For instance, consider the Traveling Salesperson Problem TSP: Given a set 12 of n nodes, ( n 2 ) numbers di,j denoting the distances between all pairs of nodes, find a closed circuit (i.e., a “salesperson tour”) that visits every node exactly once and has minimum total length. If we denote as pi a cyclic permutation on n objects, then if we interpret pi(j) as the city visited after city j, j ∈ {1, . . . , n}, then the goal is to minimize the length of the tour having a cost function defined as the map c : pi → n∑ j=1 dj,pi(j). The set F corresponding to an instance of TSP can thus be defined as F = {all cyclic permutations pi on n objects}. As another example, consider the Linear Programming problem LP: Let m, n be positive integers, b ∈ Zm, c ∈ Zn, and A be an m×n matrix matrix such that ai,j ∈ Z. Then an instance of LP is defined as F = {x ∈ Rn | Ax 6 b, x > 0}, and the goal is to maximize the cost c, which is defined as the map c : x→ cTx. Consider the following legitimate question: what if the decision version corresponding to a combinatorial optimization problem is NP-Complete? Does Hardness carry over from decision problems to their optimization ver- sions? The answer to this question is YES. For any minimization (maxi- mization) optimization problem, one can formulate its decision version by upper (resp. lower) bounding the value of its objective (cost) function by some rational number B and ask: “Is there f ∈ F such that c(f) 6 B (resp. c(f) > B for maximization)?”. If we have an “oracle” that returns the op- timal solution of an optimization problem in time polynomial in the input size, then an algorithm for the decision version can query the oracle for the 13 optimal solution, compute its cost c (assuming, of course, the c is easy to compute) and then output the answer by comparing the value of the cost to the constant (bound) associated with it. But this means that the decision problem is decidable is polynomial-time and, if the decision problem is NP- Complete, then this entails that P 6= NP. Therefore, we conclude that the optimization problem is at least as hard as the decision version. For example, let us consider the optimization problem associated with the decision clique, which we call MAX_CLIQUE: Given a graph G = (V,E), find a clique (complete subgraph) in G of maximum size. If there is a polynomial-time algorithm for solving the optimization MAX_CLIQUE, then an algorithm for solving the recognition CLIQUE would simply query the optimization algorithm for the max. clique, compute its cost, and say “yes” if cost > B, and otherwise reject and say “no”. Conversely, if an algorithm exists for deciding the recognition version of an optimization problem, then an algorithm for solving the optimization problem can use the decision version decider as a subroutine and compute the optimal solution by dynamic programming techniques. We will abuse notation and denote NP-optimization problems whose decision versions are NP-Complete as NP-Hard. 2.1.3 Approximation Algorithms It is highly believed that, given the conjecture that P 6= NP, no efficient, polynomial-time algorithms for solving NP-Hard optimization problems will ever exist. Therefore, one sacrifices the exactness of the returned solution and settles for algorithms that produce feasible sub-optimal, but provably near-optimal solutions, which run in time polynomial in the size of the input. We make the notion of approximation algorithms for NP-optimization problems precise by introducing some definitions (see Vazirani [54]). Let Π be a minimization problem, and let δ : N→ Q+, be a function, where δ > 1. We say that algorithm A is a δ-factor approximation algorithm for problem Π if, on each instance I ∈ Π, A produces a feasible solution solution f for I 14 such that c(f) 6 δ(SIZE(I))Opt(I) [δ > 1, minimization], and the running time of A is bounded by a polynomial in SIZE(I). The definition for maximization problems is the same as that of minimization problems, except that we require that δ 6 1 and that c(f) > δ(SIZE(I))Opt(I) [δ 6 1, maximization]. We see that, for both minimization and maximization problems, the closer δ is to 1, the better the quality of the solution produced by A. Approximation algorithm design techniques vary greatly among prob- lems, and problems that are seemingly related might have inherently differ- ent approximability characteristics. For example, one might ask the ques- tion: since NP-Complete problems are polynomial-time reducible to each other, can a δ-factor approximation algorithm for one NP-Complete prob- lem be used to approximate other NP-Complete problems with the same approximation factor? The answer to this question is NO, because not all polynomial-time reductions preserve the approximation factor. Consider, for example, the following NP-Hard optimization problems • Maximum Independent Set MAX_IND_SET: Given a graph G = (V,E), find an independent set of vertices of maximum size. An independent set in a graph is a set of vertices, no two of which are adjacent. • Minimum Cardinality Vertex Cover MIN_VERTEX_COVER: Given a graph G = (V,E), find a subset of vertices V ′ ⊆ V that covers all edges in G, i.e., every v′ ∈ V ′ is incident on at least one e ∈ E, and such that V ′ has minimum size. There exists a polynomial-time reduction MAX_IND_SET ≤p MIN_VERTEX_COVER with regard to exact solution: since no edge is common between any nodes in an independent set, the remaining vertices not in the independent set must touch all edges at least once in G and thus form a vertex cover; therefore, the 15 minimum vertex cover is the complement of the maximum independent set. Denote as Vc the maximum independent set in G and let Is denote the mini- mum vertex cover in G. Thus Vc = V \Is and therefore |Vc| = |V |−|Is|. As an illustration of the phenomenon above, consider an input graphG = (V,E) with |V | = 1000 vertices. If |Is| = 510, then |Vc| = 490. Suppose that we have a 2-approximation algorithm for MIN_VERTEX_COVER. Let Vc′, Is′ de- note the sets containing the approximate vertex cover and independent set in G, respectively. According to the values above, an application of the ap- proximation algorithm for MIN_VERTEX_COVER might result in |Vc′| getting as large as 980 in the worst case. If we apply the reduction described above, then we get |Is′| = |V | − |Vc′| = 20, which is indeed a sever deterioration compared to the optimal value of 510. As a matter of fact, |Is′| approaches 0 in the worst as |Vc| approaches |V |/2. Therefore we see that having an approximation algorithm for either MAX_IND_SET or MIN_VERTEX_COVER cannot be used directly to solve the other problem by the simple reduction, and thus different techniques might be needed for different problems. Some problems cannot be approximable to within any δ. One example is the TSP in its full generality: A δ-factor approximation algorithm for TSP, for any δ, can be used to decide the Hamiltonian Circuit problem, which is NP-Complete, thus showing that P = NP ! Once the distance matrix [di,j ] is equipped with a metric space, then the triangle inequality automatically holds and we get the metric TSP (∆-TSP) problem, for which Christofides[17] showed a 3/2-approximation algorithm. On the other hand, some problems allow approximability to any required degree. Let Π be an NP-Hard optimization problem. Let > 0 be an error parameter that specifies the desired trade-off between the accuracy (quality) of the solution and the running time of the algorithm. We say that an algorithm A is an approximation scheme for Π if, given as input (I, ), where I ∈ Π is an instance of Π, then A produces a solution f such that • c(f) 6 (1 + )Opt, if Π is minimization; • c(f) > (1− )Opt, if Π is maximization. The bounds above say that the relative error of the value of solution 16 produced by A with respect to the value of the optimal solution should be bounded above by at most |c(f)−Opt| max(Opt, c(f)) 6 . Since in minimization problems c(f) > Opt, we require that c(f)−Opt c(f) 6 , from which we get c(f) 6 1 1− Opt 6 (1 + )Opt. Similarly for maximization problems, c(f) 6 Opt, so we require that Opt− c(f) Opt 6 , from which we get c(f) > (1− )Opt. If A runs in time polynomial in SIZE(I), then A will be called a Polynomial-Time Approximation Scheme (PTAS). A PTAS is not restricted to run in time polynomial in 1/. For example, the functions (1/)3n4 and n3 1/2 are allowable running times of PTASs. Such running times make PTASs of theoretical significance only, but they might be doomed to per- form poorly in practice (e.g., setting = 0.1 in n31/ 2 ). If A is required to run in time that is polynomial in both SIZE(I) and 1/, then A becomes a Fully Polynomial-Time Approximation Scheme. Examples of running time functions for FPTAS include (1/)3n4 and n5/2(log(1/))7. An FPTAS is the best one can hope (both practically and theoretically) for an NP-Hard optimization problem modulo the conjecture P 6= NP. One such problem that admits an FPTAS is the knapsack problem. Consider the optimization integer knapsack problem INT_KNAPSACK: Given an integer K and n objects having values v1, . . . , vn and weights w1, . . . , wn, find integers 17 x1, . . . , xn such that ∑n i=1 vixi is maximized and such that the constraint∑n i=1wixi 6 K is satisfied. INT_KNAPSACK is NP-Hard, since it decision ver- sion is NP-Complete, and it was shown to be NP-Hard by a polynomial-time reduction from the partition problem (see Papadimitriou and Steiglitz [45] chap. 15), which we define as follows. PARTITION: given integers c1, . . . , cn, is there a subset S ⊆ {1, . . . , n} such that ∑j∈S cj =∑j /∈S cj ? Luckily enough, there is a dynamic programming-based, pseudo-polynomial time algorithm that solves INT_KNAPSACK exactly in O(n2c), where c is the value of the optimal cost. The value c that appears in the running time need not be polynomially bounded in the dimension of the problem, namely n, and hence the “pseudo”-polynomiality. However, the mere existence of a pseudo-polynomial time algorithm for INT_KNAPSACK is the key for obtaining an FPTAS for it. In general, dynamic programing techniques are usually employed to derive approximation schemes for optimization problems, al- though the running time of the resulting algorithms can be prohibitively large to be used in practice. We will rely on a special dynamic program- ming formulation later to derive an FPTAS for the problem concerning energy minimization on multiprocessor platforms (chapter 4). 2.2 Related Work The problem of task allocation with service classes and QoS constraints is closely connected to several well-studied problems, particularly bin packing. de la Vega and Lueker [21] and Karmarkar and Karp [33] have developed approximation schemes for bin packing. Many variants of the bin packing problem have been studied, including bin packing with variable bin sizes (Friesen and Langston [23]). The class constrained bin packing problem deals with bins with C compartments Shachnai and Tamir [51]. The goal is to find a bin packing such that there are no more than C different classes of items in a bin. The close connection between bin packing and partitioned scheduling for real-time tasks on multiprocessors has been explored by Lopez, Diaz, Garcia, and Garcia [43], who identified utilization bounds for partitioned 18 multiprocessor scheduling with EDF. The closest effort related to our work is that of Lee, Lehoezky, Rajku- mar, and Siewiorek [36] on QoS optimization with discrete QoS options. The resource allocation methods described by Lee, Lehoezky, Rajkumar, and Siewiorek is applicable to the problem described in this article but the methods with good approximation ratios have pseudo-polynomial runtime complexity while our algorithms are polynomial-time approximations. Lee, Lehoezky, Rajkumar, and Siewiorek [36] developed a framework for opti- mizing QoS with discrete settings in real-time environments from the end users’ perspective. The authors considered QoS quantitatively and pre- sented resource planning algorithms for multiple applications, multiple re- sources (disk, network, processor, etc.) and multiple QoS dimensions, so as to maximize the utility of end users. Their best algorithm in terms of performance is a PTAS for solving the single resource multiple QoS dimen- sions problem (we look at the case of multiple processors or resources and one QoS dimension). Their algorithm is based on dynamic programming, and produces a solution that is at most (1 − ) far from optimal. Further, the authors developed a user friendly interface through which end users can specify their QoS requirements by choosing a parameterized utility curve that best matches their needs. The imprecise computation model (Liu, Lin, Shih, Yu, Chung, and Zhao [41]) divides the execution requirement of a job intomandatory and op- tional cycles. A task should be fully granted the execution of its mandatory requirements before it reaches its deadline, whereas the optional execution is a reward that the task receives depending on the state of the system. Accordingly, the reward is an increase in the precision of computation (e.g., increased frame rate in video streaming applications), and the goal is to maximize the reward (the amount of optional part executed), or minimize the error (the amount of unfinished optional part) in a dual sense. Khemka, Shyamsundar, and Subrahmanyam [34] developed an optimization scheme for reducing error when imprecise tasks are scheduled on a multiprocessor system. In their model they assumed continuous job sizes and convexity conditions on the error functions, and they permitted jobs to start on one 19 processor and then migrate to other processors. In the problems described in this article, utilization is varied in discrete steps, and each step is asso- ciated with a quality index. This approach offers some new ideas for task allocation. The Increasing Reward with Increasing Service (IRIS) model (Dey, Kurose, and Towsley [22]) does not require the execution times of tasks to be known apriori. A task receives as much execution time (value) as possible before it deadline expires, with no upper limits on the obtainable reward. Both the imprecise computation and IRIS models assume that the reward function is a non-decreasing concave function. We have presented a simple model for a service class. Lee, Shih, and Sha [37] used the term service class to refer to a pre-determined resource allocation in a problem related to radar systems. A service class in their work refers to some allocation of resources that pertain to certain situations (heavy load, light load, etc.), and precomputation allows them to perform online optimization depending on the system load. As energy minimization is concerned, a vast body of research is dedicated for investigating possible techniques for saving energy on computing systems. Approaches range from designing low power hardware, or that with power management capabilities, to designing energy-aware scheduling algorithms. On one hand, energy minimization can be achieved by means of Dynamic Frequency/Voltage Scaling, which is the most common technique used to minimize the energy consumption of processors in an online fashion. On the other hand, Power Management schemes aim at finding a suitable assign- ment of speeds to processors given prior knowledge of task parameters at design time, i.e. before tasks start execution. In their work, Irani, Shukla, and Gupta [30] considered uniprocessor sys- tems, and assumed that the power function P (s) is continuous and convex with respect to the speed s. Further they assume that speed is a contin- uous function of time s(t) with no upper limit. The authors defined the critical speed, which is the speed at which the energy function P (s)/s is at minimum. They developed an offline approximation algorithm that runs the processor at the critical speed and provided a lower bound of 3 from 20 optimal. Moreover, the authors studied the gain of turning the processor to the dormant mode if it becomes idle, by incorporating the wakeup switching overhead to the analysis. They concluded that it is rewarding to do so if the processor’s idle period is no less than the break-even time, which is equal to the ratio between the energy consumed in waking up from the sleep mode to the power consumption at the minimum frequency. Otherwise tasks can be run at a higher speed to create longer idle periods, where turning the pro- cessor to the dormant mode becomes beneficial. The authors also designed an online algorithm with an optimized competitive ratio of 193 for a cubic power function. The impracticality of their solution lies in their assumptions about the continuity of speed and power. One of the significant results is that by Ishihara and Yasuura [31], where the authors considered processors with discretely variable voltages. They showed that a single voltage alternation can bring the energy expenditure to a minimum. Specifically, for an ideal system with continuous voltage levels, they found out that the voltage that brings the execution time of a task to exactly the deadline is the voltage that minimizes the energy expenditure. If that voltage is not available on the system, then its two immediate neighbors can minimize the energy consumption. Chen [16] proposed DVS-based algorithms for minimizing the expected energy expenditure of frame-based tasks on uniprocessors with discrete fre- quencies. Their approach is probabilistic and required a discrete probability distribution of task execution cycles. The authors further considered tasks with different power characteristics. They defined an operating point to be the normalized energy at a certain frequency. For a set of operating points per task, the authors applied a convex hull algorithm to eliminate energy- inefficient operating points and produce a set of usable points that forms a convex set. Afterward, they greedily changed the operating frequency of the processor at certain points starting from the highest frequency in the ordered set of usable points. This intuitively means that the processor slows down as a task executes more. Based on the work above by Ishihara and Yasuura [31], Bini, But- tazzo, and Lipari [11] also incorporated the findings of Seth, Anantaraman, 21 Mueller, and Rotenberg [50] (about how tasks’ WCET scale with speed) and considered task execution cycles that do not scale with speed (spent waiting for device bus). They devised algorithms for scheduling on a single processor with discrete frequency steps to minimize energy. Most recently, Yang, Chen, Kuo, and Thiele [60] proposed an FPTAS for scheduling real time tasks on multiprocessor with discrete speeds. They used the trimming-the-state-space technique that we used in our work. The authors assumed that the WCET of tasks are given at the highest speed smaxj and scaled the WCET linearly as speed changes. According to their energy model, the minimum energy is achieved by operating each processor at speed equal to Ujsmaxj . If this speed is not available on the processor, then they chose its immediate neighbors sj,k and sj,k+1 such that sj,k < Ujsmaxj < sj,k+1. The energy in their case is the value of the function resulting from the linear interpolation of the points (sj,k, tPj(sj,k)), (sj,k+1, tPj(sj,k+1)) at Ujs max j , where t is the interval of energy measurement. This assumes that energy is a linear function of the speed when the workload is fixed, falling in the subtlety that the WCEC of tasks remains constant with speed. More- over, the authors do not consider leakage power consumption in their model. Rusu, Melhem, and Mossé [49] considered a restricted version of the problem we study: their goal was to maximize the reward of real-time tasks executing on a uniprocessor system, subject to an energy budget that the system should never exceed. The authors, however, assume that the proces- sor speed and the power function associated with the speed are continuous. 22 Chapter 3 On Task Allocation Problems with Service Classes 3.1 Motivation: Virtual Machine Allocation Standard On-Demand Instances Linux/UNIX Usage Configuration Small 0.10 per hour 1.7 GB memory, 1 EC2 compute unit Large 0.40 per hour 7.5 GB memory, 4 EC2 compute units Extra Large 0.80 per hour 15 GB memory, 8 EC2 compute units High CPU On-Demand Instances Linux/UNIX Usage Configuration Medium 0.20 per hour 1.7 GB memory, 5 EC2 compute units Extra Large 0.80 per hour 7 GB memory, 20 EC2 compute units Table 3.1: Amazon EC2 Instance Pricing and Configuration. (1 EC2 Compute Unit Provides CPU Speeds of About 1 GHz.) (Amazon.com [5]) The initial motivation for this study is resource allocation and control in virtualized environments. Many computing applications are moving to consolidated hosting setups. These services are hosted by the providers using virtual machine technology (an example is VMWare’s ESX Server [55] or the Xen hypervisor [9, 46]). There are several reasons for choosing virtualization as an enabler for such computational clouds: performance isolation, security, management ease and flexibility for users. Of particular interest is performance isolation. End-users can instantiate virtual machines 23 for their applications and provision these virtual machines with the resources needed to achieve performance goals. The ability of an application/service to respond to requests or perform its tasks within some time constraints depends on the amount of resources allocated to the application. The resources could be the amount of pro- cessing time, the amount of memory or the fraction of the network data rate. Often the primary resource being controlled is the fraction of pro- cessing time allocated to an application. In a virtualized setting, several virtual machines share the same physical hardware and access to the hard- ware resources is controlled by some time sharing mechanism (Figure 3.1). The constant bandwidth server mechanism [1], for example, can be used to guarantee a slice of a shared processor to a virtual machine. Controlling performance in a virtualized hosting environment requires detailed performance monitoring. Such monitoring will determine the re- lationship between resource allotment and quality of service(Urgaonkar, Shenoy, and Roscoe [53]). This information will then be useful in improv- ing resource allocation and virtual machine migration in systems such as Sandpiper (Wood, Shenoy, Venkataramani, and Yousif [59]). The Sand- piper system does in fact find heuristic solutions to multi-capacity bin pack- ing problems for deciding upon resource management, and we believe our methods can be integrated into such a system easily. Moreover, the set of service classes can evolve over time to reflect the most recent observations correlating resource allocation and performance goals. This approach to resource allocation is also applicable to cloud comput- ing, which also uses virtualization as a substrate for offering infrastructure services. Many computing systems users have shifted hardware maintenance and management to third-parties such as Yahoo!, Google and Amazon [5]. This model of computational service hosting permits organizations and in- dividuals to deploy large-scale services and pay only for the marginal cost of resource usage. The cloud service operator can allocate multiple virtual machines to the same hardware unit as long as resources are not overloaded. Application developers who choose to utilize the computational clouds can provision their virtual machines for optimal performance. Cloud service 24 providers offer multiple tiers of service with graduated pricing. Amazon Web Services, for example, offers several choices for virtual machine sizing and pricing in their Elastic Compute Cloud service (Table 3.1 [5]). Service classes can be associated with pricing details and applications can be de- ployed across multiple virtual machine instances to meet performance goals while minimizing the cost of launching compute instances. This example further emphasizes the use of the service class model and the potential ap- plicability of our main results. Quality-of-service is almost always associated with certain cost/precision trade-offs. For example, consider a search engine where the back-end, having received a user query, searches its index database for matching documents, ranks those documents (using some ranking algorithm) and returns the top N documents in ranked order (Baek and Chilimbi[8]). The service provider might consider, for example, executing less cycles running the ranking al- gorithm for the sake of faster response, in addition to saving energy on the search engine servers. The consequences of such approximation of the output of computation become relevant when the QoS metric is clearly defined. We can define the QoS loss metric as the percentage of user requests, compared to the full fledged-service case, that either return the same top N documents but in a different rank order or return different top N documents. Towards the goal of practically enabling such approximations by systematic means, Baek and Chilimbi[8] developed a programming framework called “Green”, that allows programmers to approximate loops and expensive functions, and provides statistical QoS guarantees. 3.2 Definitions and Formal Problem Formulation 3.2.1 Definitions Given some computational tasks, and a set of service classes for each task, and an unlimited pool of processors, we would like to find a) an assignment of tasks to service classes, and b) an assignment of tasks to processors such that 25 VMM/Hypervisor VMM Scheduler Hardware Guest OS / VM Local Scheduler Processes Guest OS / VM Local Scheduler Processes Guest OS / VM Local Scheduler Processes 70% CPU 10% CPU 10% CPU 10% CPU hypervisor overhead Figure 3.1: Hierarchical Scheduling Using a VMM/Hypervisor • no unit is over-utilized, • the aggregate quality of service requirements are met, and • we use as few a number of processors as possible to achieve the goals. Definition 1 (Resource Set). The set of resource types, denoted R := {x1, x2, . . . , xr}, is called the resource, where r := |R|. CPU, memory, storage space and network bandwidth are typical exam- ples of resource types. A processor is a realization of a certain resource type. In our setting, a processor pik is a resource supply that provides fractions of certain resource types; pik = 〈y1, . . . , yr〉, 0 ≤ yj ≤ 1 is the amount of resource xj that can be supplied by the processor. Definition 2 (Service Class). A service class is a tuple sck = (pik, 〈(uk,1, qk,1), . . . , (uk,r, qk,r)〉), where uk,j , qk,j ∈ (, 1]. A service class represents some fraction of a resource supplied by a pro- cessor and the associated quality (as the minimum fraction of the maximum 26 quality of service) when these resources are allocated to the relevant task. Our definitions provide a very general description of processors and service classes. In this article we shall primarily discuss the case of identical pro- cessors with one resource (CPU). Therefore, we drop the processor index pik in the service class description, and thus we are left with a single pair (u, q) specifying a service class. There is no loss of generality when we assume q ∈ (, 1]; we can scale all quality metrics to this set. If (u, q) is a service class for some task then q is a scalar that represents that perceived quality of service when the task is guaranteed a resource utilization u. The reason that both the utilization and the quality index are bounded from below is two-fold. No application has arbitrarily low utilization; there is usually a lower bound to account for context switch and other overheads. Similarly, any task, when it runs, improves the QoS by some amount that is not arbitrarily small. Later we will discuss the implications of this prob- lem setting and propose methods to remove these restrictions for further generalization. Tasks: A task is an entity that requires computational resources and Γ = {τ1, τ2, . . . , τn} denotes the set of tasks. The number of tasks is |Γ| = n. Each task τi is associated with a service class set SCSi = {sci,j = (uj , qj)} that consists of service classes. The service class set is essentially a discrete function that maps utilizations of a task quality of service values. We do not make assumptions on how exactly the quality of service index changes with respect to changes in the utilization, but rather we assume an arbitrary utilization-to-quality of service function, with some structure (below). Increased service, increased quality: We will make a natural re- striction that for all service classes associated with a particular task, an increase in any resource dimension implies an increase in the quality index. 3.2.2 Problem Definition Our definitions provide a very general description of processors and service classes. In this article we shall primarily discuss the case of identical pro- 27 cessors with one resource (CPU). Later we shall briefly discuss the multiple resource case. We leave problems related to variable capacity processors to a future article. Task allocation with service classes (TASC): Given a set of tasks Γ and a service class set for each task, we seek an algorithm that selects, for all i, a service class for τi from SCSi, and then maps the tasks to processors without over-utilizing any of the processor’s resources. Further, we would like to minimize the number of processors used, m, subject to ∑n i=1 qi > Q, where qi is the quality index assigned to τi and Q is the aggregate QoS goal. 3.2.3 Problem Hardness The problem of interest, TASC, is NP-Hard in the strong sense (Papadim- itriou [44]); the simplest case of exactly one resource attached to a processor, equally priced and equal capacity processors, and a default service class for each task with no penalty, reduces the problem to a bin packing problem, which is NP-hard in the strong sense. For problems that are NP-Hard in the strong sense, we cannot find a polynomial time algorithm that approximates the optimal solution unless P = NP (Garey and Johnson [25]). The usual measure for understanding the performance of algorithms for NP-hard problems is the (asymptotic) approx- imation ratio. If algorithm A solves, approximately, an NP-hard problem, then its approximation ratio is defined as RA := lim sup n→∞ supI { A(I) Opt(I) : Opt(I) = n } , where I is the input to the algorithm, i.e., I is an instance of the NP-hard problem. Despite the above hardness results, there are in fact practical restrictions of the the TASC problem that permit polynomial-time approximations. In this article we describe an algorithm for solving a restricted, but practi- cal, case of TASC. For the related bin packing problem, de la Vega and Lueker [21] showed that there exists a polynomial time algorithm such that 28 AdL(I) ≤ (1 + )Opt(I) + 1 for all instances I of the bin packing problem. Karmarkar and Karp [33] showed the existence of a polynomial time algo- rithm such that AKK(I) ≤ Opt(I) + O(log2(Opt(I))) for all bin packing problem instances I. 3.2.4 Restricted Problem Formulation The hardness of TASC leads us to first consider some restricted cases and obtain some insight. For the general case we develop heuristics based on our understanding of the restricted cases. 1. Minimal service class (Problem MSC): In this variation, there is a minimal service class (umin, qmin). There may be many service classes but they are all associated with resource utilization greater than umin and with a quality score greater than qmin. 2. Fixed service classes (Problem FSC): This is the simplest variation where there is a finite number of priced service classes K (that the service provider offers), and each task is assigned to one of these K service classes. We start with the simplest case, FSC, and proceed to identify polynomial time algorithms for MSC. We shall then provide some comments about the unrestricted TASC problem. 3.3 Solving Restrictions to TASC We use the following intuition in designing algorithms for solving the re- stricted versions of TASC described in Section 3.2.4. An approximate so- lution to the bin packing problem can be found in polynomial time. The key step towards solving TASC then is to enumerate assignments of tasks to service classes. If each task has one of K choices for a service class, we may have to examine Kn possible service class assignments before we find an optimal solution. We need to avoid the exponential space search and examine only p(SIZE(I)) assignments of tasks to service classes where p is 29 a polynomial function of the size of the input instance. We can then solve the bin packing problem for each of these assignments and identify the least cost assignment of service classes and mapping of tasks to processors. To solve the bin packing problem, we do not apply the methods devel- oped specifically for the bin packing problem. Instead we rely on an approx- imation algorithm for the minimum makespan problem, where a problem instance consists of m parallel machines and a set of independent jobs that needs to execute on these machines and the goal is to minimize the time taken to complete all jobs (the makespan). This problem can be solved to within (1+ )Opt in polynomial time (Hochbaum and Shmoys [27]). In our case, we will treat a bin packing problem as a minimum makespan prob- lem, vary the number of processors from 1 to n (number of tasks), and use the polynomial time algorithm to determine if there is some schedule with makespan 1 + . From that solution we can retrieve the solution for the bin packing instance trivially. The minimum number of processors needed to obtain a solution with makespan of 1 + or less is the optimal solution for the bin packing (task allocation) problem; since the optimal makespan could be 1 + we may have to use processors with capacity 1 + . 3.3.1 Fixed Service Classes Let us consider the case when the number of service classes is fixed asK. We first need to find an assignment of tasks to service classes. Let us suppose that nj denotes the number of tasks assigned to scj . We need to find a vector 〈n1, n2, . . . , nK〉 such that n1 + n2 + · · ·+ nK = n, and a feasible allocation of tasks to processors such that no unit is over- loaded, the resulting total quality index is above the required threshold and the number of processors used is minimized. We know, using an elementary combinatorial argument, that there are( n+K−1 n ) integral vectors that satisfy the equality ∑K j=1 nj = n (this bound can be interpreted as the number of ways n indistinguishable balls can be 30 Source Sink Tasks Service Classes 1 1 1 1 1 1 1 1 . . . . . 1 n1 nk . . . . . Figure 3.2: Network Flow and Feasible Assignments packed into K bins). For fixed K, the number of assignments of tasks to service classes is a polynomial in n. For each of these assignments, we can solve a bin packing problem, in polynomial time, to find a good allocation of tasks to processors. There is, however, one intermediate step to consider. We need to verify that a vector 〈n1, n2, . . . , nK〉 is indeed a feasible as- signment of tasks to service classes. There may not exist a set {τi} with cardinality nj (for some value of nj) such that scj ∈ SCSi. For this ver- ification, we represent tasks and service classes as vertices in a bipartite graph and, interestingly, perform network flow computations to solve the subproblem. Consider a bipartite graph with a vertex for each task and a vertex for each service class. An edge with capacity 1 exists between a task and the service classes that are appropriate for that task. Add a source, and connect the source to each task with an edge of capacity 1. Add a sink and connect service class scj to the sink with an edge of capacity nj . A maximum flow of n exists iff 〈n1, n2, . . . , nK〉 represents a feasible assign- ment. The maximum flow in a network can be computed in polynomial time 31 Algorithm 1: Complete algorithm for FSC for all integral vectors 〈n1, n2, . . . , nK〉 such that ∑K j=1 nj = n do verify that 〈n1, n2, . . . , nK〉 is a feasible assignment of tasks to service classes using a polynomial time max flow algorithm. if 〈n1, n2, . . . , nK〉 is feasible then solve the task allocation problem using the Hochbaum-Shmoys minimum makespan algorithm (Hochbaum and Shmoys [27]) at most n times end if end for return the solution that uses the smallest number of bins and satisfies∑n i=1 qi ≥ Q. (Ahuja, Magnanti, and Orlin [3]) to verify feasibility and obtain the service class assignments. (Figure 3.2 illustrates the bipartite graph representation of this subproblem.) When the number of service classes is fixed for all TASC problem in- stances, and given an assignment of tasks to service classes, the subproblem of allocation tasks to processors using the minimum number of units is no longer NP-Hard, and the optimal solution may be found in polynomial time using various techniques including dynamic programming. Each subproblem of the FSC can be solved in polynomial time; we there- fore have a polynomial time algorithm (Algorithm 1) to find the optimal solution to this problem. The above discussion helps us state the following lemma. (The proof naturally follows from the discussion.) Lemma 1. An optimal solution to FSC can be found in polynomial time. 3.3.2 Minimal Service Class We shall now leverage the algorithm just described to solve the problem when service classes are part of the input instance but there is a particu- lar minimum service class (umin, qmin). Hochbaum-Shmoys’ PTAS for the minimum makespan problem requires that item sizes be chosen from a fixed 32 set. In our setting, both u and q are rational numbers lower bounded by and upper bounded by 1, thus they may assume an infinite number of val- ues. In order to bound the number of possible utilization/reward values, we use a chosen parameter , 0 < < min{umin, qmin}, and then quantize the resource utilization and the service quality. For service class sck = (uk, qk), we round up the resource utilization and service quality to the nearest pair of the form ( + α2, + β2) for some non-negative integers α, β. Setting α = ⌈ uk− 2 ⌉ and β = ⌈ qk− 2 ⌉ satisfies the quantization requirements. Let Sα, Sβ denote the sets containing values that α, β may assume, re- spectively. Since uk, qk 6 1, it follows that both α and β can get as large as ⌈ 1− 2 ⌉ . Accordingly, Sα = Sβ = { 0, . . . , ⌈ 1− 2 ⌉} . Therefore, following the quantization, there are at most |Sα|.|Sβ| 6 ⌈ 1 4 ⌉ service classes. We can find optimal solutions to instances of the quantized problem (qMSC) using processors with capacity 1+; the solution can be determined, in polynomial time, using the approach for FSC (Algorithm 1). We now have the following lemma relating MSC and qMSC. Lemma 2. The optimal solution to an instance of qMSC uses at most the same number of processors as the optimal solution to the corresponding MSC instance, except that processors may need to be of capacity 1 + for qMSC. The aggregate quality of the solution to qMSC is at most (1 − )Q if the corresponding instance of MSC has a feasible solution with quality of service score at least Q. Proof. Suppose that the optimal solution to an instance of MSC assigns some task τ to service class ( + α′2, + β′2) and this task includes a service class (+(α′−1)2, +β′2). The optimal solution to qMSC for this problem instance will assign τ to service class (+(α′−1)2, +β′2), which leads to a decrease in total utilization. If the optimal solution to the MSC instance used m processors then the optimal solution to the qMSC instance will use m′ ≤ m processors. When we replace the original service classes by the quantized service classes, we see that on any processor used there may be up to 1/ tasks, and the resource utilization of each of these tasks can increase by 2 at 33 most. Therefore the maximum possible increase in resource utilization of an individual processor is ; this increase can be satisfied by using a processor of capacity (1 + ). The quality factor achieved by the optimal solution to an instance of qMSC is either Q or higher. If Q′ is the quality factor achieved by qMSC, then the maximum increase in the quality score for each task is 2, which occurs when there are Q′/ tasks, which is, of course, the maximum number of tasks. If we undid the quantization then the maximum drop is at most Q′. The aggregate quality score will always be within a 1− factor of the target quality score. By the analysis above, the algorithm might require the capacity of some processors to be augmented by at most times their maximum capacity for task allocation to be possible. For a processor running at a certain speed, say s, this processor might need to operate at a speed of (1 + )s in the worst case. Practically, if s is the maximum speed at which the processor can run, then the operating speed of the processor cannot be increased and hence our task allocation scheme might not be possible to implement. One way to overcome this impracticality is to start operating the processor at speed (1 − )s and then increasing its speed to its maximum s if required. Algorithmically, this amounts to necessity starting the algorithm with bin capacities of 1 − instead of 1, so that the capacity of any processor does not exceed unity as a result of our algorithm. This, however, might increase the number of bins required to pack all tasks compared to the (1 + )Opt upper bound guaranteed by the Hochbaum-Shmoys algorithm. 3.3.3 TASC We now return to the general TASC problem. In the case when service classes can be arbitrarily small, it is possible to obtain service class assign- ments and processor allocations by separating tasks into large and small sizes (as is typically done to solve bin packing problems (de la Vega and Lueker [21])) using a parameter as the threshold. In the case of TASC, both the utilization and quality index can be large or small. Further, the 34 QoS constraint is disconnected from the optimality criterion that refers to the number of processors used. This is problematic because we are unable to bound the QoS loss when quality indices are small. If there was no lower bound on the quality index then the upper limit on the number of tasks with a low quality index is at least Q/, which is hard to bound. For most practical settings there is a lower bound on the utilization that can be assigned to a task and on the quality of service attained when a task is assigned the smallest utilization quantum. For that setting, MSC, we can find a solution in polynomial time for the smallest number processors using a small speedup of 1 + and within a 1− factor of the desired QoS. We shall now consider one other variant of MSC that is of practical interest. 3.3.4 Maximizing QoS with m Processors In this variant of MSC, we consider a set of m processors and the service classes are defined as expected. The goal is to maximize the QoS that can be achieved with a limited number of processors. We can approach this problem exactly as we did the MSC. The only modification needed is to use the minimum makespan routine only once and with m processors (in Algorithm 1). We search for the assignment and allocation with the highest QoS. Since we have a minimal service class, we can also prove (as we did in Lemma 2) that the obtained QoS is within 1− of the optimal QoS. 3.4 Experimental Evaluation and Further Extensions Although we have analyzed and shown that the algorithms for TASC and related problems run in polynomial time, they are tedious and do not lend themselves to efficient implementation. In this section we focus on a simple implementation choice that results in acceptable performance and makes the algorithms usable. The bin packing routine and the maximum flow routine are central elements that influence the performance of the algorithms for TASC variations. 35 We replaced the minimum makespan routine for finding a bin packing with an algorithm due to Korf [35] that performs optimal bin packing. This algorithm is not polynomial time but performs remarkably well in practice. For the maximum flow problem, we used an implementation of the highest label preflow-push algorithm (Ahuja, Kodialam, Mishra, and Orlin [4]). Our experimental platform used an Intel Core Duo processor running at 2.8GHz with 4GB memory. We generated, at random, 100,000 test in- stances for each problem size. For large problem sizes (> 60 tasks, 12 service classes per task), our implementation found an optimal solution in less than 9 seconds for all problem instances. For small problem sizes (< 25 tasks, 10 service classes per task), our implementation found an optimal solution in less than 1 seconds. Although we have not analyzed algorithms for task allocation on pro- cessors with multiple resources, the basic algorithms themselves are easy to extend. Obtaining approximation ratios, however, is challenging; Garey and Graham showed that with d-capacitated bins the approximation ratio for any reasonable algorithm (First Fit, Best Fit are examples) is at most d+1 (Garey and Graham [24]). Woeginger [57] has shown that there cannot be an asymptotic PTAS for the multi-capacity bin packing problem. We implemented heuristics (Leinberger, Karypis, and Kumar [38]) for the multi-capacity bin packing problem. We were able to solve problems with 40 tasks and 3 resources in less than 3 seconds. (We did not run any larger size experiments.) We have not tabulated all the results here but we believe that this quan- titative evaluation demonstrates the feasibility of using service-class driven task allocation in many performance management scenarios. 36 Chapter 4 Energy Efficient Task Partitioning and Real-Time Scheduling on Heterogeneous Multiprocessor Platforms with QoS Requirements 4.1 Motivation Prior attempts at solving variants of this problem (see Section 2.2 for a discussion of related work) assume that the worst-case number of execution cycles (WCEC) of a task is speed-independent. This assumption then im- plies that the worst-case execution time (WCET) of a task scales linearly with processor speed. The original assumption is inaccurate if a task accesses peripheral devices that run at speeds different from that of the processor. To illustrate this drawback in prior work and to motivate our work with dis- crete settings, we performed empirical studies using two applications from the MiBench embedded applications benchmark suite Guthaus, Ringenberg, Ernst, Austin, Mudge, and Brown [26]: Basic Math and Fast Fourier Trans- 37 form (FFT) (with 64 random sinusoids and 65536 samples). We executed the former once and the latter 1000 times on an AMD R© PhenomTMII X4 925 Quad Core Processor 2.8GHz, which has discretely variable speeds in the set {800MHz, 1.6GHz, 2.1GHz, 2.8GHz} per core, and measured the ex- ecution time as well as the energy consumption for each application at each speed step. The Basic Math benchmark includes some I/O operations and we found that speed scaling does not result in corresponding changes in ex- ecution time (Table 4.1). For this benchmark, we also found that energy consumption increases with increases in speed. For the FFT benchmark, which does not include the same amount of I/O as the Basic Math bench- mark, execution time decreased as we would expect with an increase in processor speed and it is energy-efficient to run this application at a higher speed (Table 4.2) . Our observations highlight the fact that variations in execution time and energy consumption are different for different applica- tions. These variations are not easily captured by closed-form functions and require discrete modeling. We present a system and task model (Section 4.2) that captures system implementations better than earlier models. We focus on determining static speed settings for processors and identifying a mapping between tasks and processors such that the the energy expenditure of the system is minimized and the timeliness requirements of tasks respected. In addition, the task allocation scheme should guarantee that the overall QoS satisfies a speci- fied requirement (Section 4.3). Our contribution is a fully polynomial-time approximation scheme (Section 4.4) for a problem that can be shown to be NP-Hard. Frequency (GHz) Average Power (Watts) Execution Time (sec) 0.8 86.5 32.34 1.6 89.9 31.7 2.1 94 31.61 2.8 100.2 31.75 Table 4.1: Basic Math Benchmark (Single Iteration) 38 Frequency (GHz) Average Power (Watts) Execution Time (sec) 0.8 84.3 1080.4 1.6 88.96 538.2 2.1 94.78 410.23 2.8 104.11 307.74 Table 4.2: Fast Fourier Transform (FFT) Benchmark (1000 Iterations) 4.2 System Model and Notation 4.2.1 Platform We have K distinct physical processor types at our disposal, with an unlim- ited pool of processors of each type. We will use the notation pik to refer to a processor type; therefore, the set Π = {pi1, . . . , piK} includes all distinct processor types. Each processor type may allow for a discrete number of speed settings. Let Sk = {s1, . . . , sλk} be the set of distinct speed levels associated with processor type pik, where λk := |Sk|. We use λmax to denote the maximum number of speed settings that any processor type may per- mit, i.e., λmax = maxk{λk}. We introduce the notion of a logical processor type, corresponding to a unique physical processor and speed pair; a phys- ical processor running at a certain speed. by creating λk logical processor types for processor type pik. Therefore a logical processor type is defined as pi′k := (pik, s`) ∈ {pik}×Sk. There are at most λ = λmax×K logical processor types. The set Π′ = {pi′1, . . . , pi′λ} includes all logical processor types. Our objective is to construct a platform composed of M heterogeneous physical processors, possibly with identical types, each operating at a certain speed (thus the platform consists of M logical processors). We denote such a platform composition as a platform configuration C = 〈pi′1, . . . , pi′M 〉, where pi′j ∈ Π′ and pi′js in C are not necessarily distinct. We can think 39 of a platform as having a fixed number, specifically M , processor “slots”, disregarding the order in which processors are arranged. Thus we seek to fill these slots with “suitable” logical processors. Since we allow identical logical processor types in platform configurations, it follows that the number of possible assignments of logical processors toM slots is at most the number of solutions to the equation m1 +m2 + · · ·+mλ =M, wheremj is a non-negative integer that represents the number of instances of logical processor type pi′j that a platform configuration includes. Therefore, we have at most Λ = ( λ+M−1 M ) possible platform configurations. The latter bound can be interpreted combinatorially as the number of ways M indis- tinguishable balls can be packed into λ bins, since the order of processors in a platform configuration does not matter. λ, M andK, and as a consequence Λ, are considered part of the platform model and are treated as constants. 4.2.2 Task and Scheduling Model We consider a task set, Γ, with N periodic, implicit-deadline, real-time tasks such that Γ = {T1, . . . , TN}. This task set needs to be executed on a suitable heterogeneous computing platform. Task Ti recurs every Pi time units and, when executing on logical processor type pi′k has a worst-case execution time of ei,k. Since ei,k is specified per logical processor type, it is dependent upon the physical processor type and the frequency at which it is operating. For convenience we will treat Pj to be a positive integer. Correspondingly, the utilization of Ti when assigned to logical processor type pi′k is ui,k = ei,k/Pi. In this discussion, for simplicity, we restrict our attention to the par- titioned scheduling of real-time tasks on the heterogeneous multiprocessor platform with earliest deadline first scheduling at each processor. For a cer- tain platform configuration, say C, zi,j is an indicator variable that is 1 if Ti is assigned to pi′j in C and 0 otherwise. All tasks satisfy their timing requirements if Uj = ∑N i=1 zi,jui,j 6 1, for every pi′j in C, j ∈ {1, . . . ,M}. 40 We also note that this work easily applies to the situation when tasks are simply given a fraction of a processor’s bandwidth, as may be the case when virtual machines are scheduled in a cloud computing environment. 4.2.3 Energy Model To account for the energy consumed by the processor and by other associ- ated units (memory, I/O, etc.), our energy model has an active-time energy component and an idle-time energy component. For each physical processor type pik we have an idle time power consumption, pidle,k (which, of course, all of its λk logical processor types inherit) that needs to be accounted for every time unit that a processor is idle. For each logical processor type and task pair, (pi′k, Ti), we denote the energy consumed by executing an instance of Ti on pi′k as Wi,k. In its simplest form, Wi,k can be defined as Wi,k = ui,kτpk, where τ is the interval of energy measurement and pk is the dynamic power consumption of logical processor type pi′k. Given its de- pendence on the dynamic power consumption of a processor operating at a certain speed, Wi,k can be interpreted as a penalty that task Ti incurs as it executes on logical processor type pi′k. We, however, do not restrict Wi,k to be of the latter form. Since Wi,k can include additional parameters that capture energy consumption at the system level, we treat Wi,k as a single quantity. The basic task set consists of periodic tasks and therefore it is sufficient to study and minimize the energy consumed in an interval of length L = LCM(P1, . . . , PN ) because all activity repeats in an identical manner every L time units. This time interval consists of active intervals and idle intervals on the different processors. Depending on the platform configuration, C, and the mapping of tasks to processors, the energy consumed by the system in an interval of length L can be computed as EC = N∑ i=1 M∑ j=1 L Pi zi,jWi,j + L M∑ j=1 (1− Uj)pidle,j , (4.1) where the first term represents the active-time energy consumption and the 41 second term represents the idle-time energy consumption. 4.2.4 Quality-of-Service Model We denote the quality of service derived (or delivered) by a task Ti when it is assigned to logical processor type pi′k using a real number ri,k. We can also interpret ri,k as a reward obtained by the particular task-to-processor mapping. Our model allows for further generality by allowing the reward to depend on the logical processor type. The aggregate quality of service delivered by the system, or equivalently the reward obtained by the system, is a function of the different ri,k values. We use a simple sum to describe the aggregate reward although other functions can be accommodated by our framework. 4.2.5 Task Set Encoding According to the system model above, we encode the requirements of each task Ti in a set of triples (options) Oi, where vectors in Oi are defined as Oi,k = 〈pi′k, ui,k, ri,k,Wi,k〉. Task Ti is said to be defined on logical processor type pi′k if there exists an integer k such that pi ′ k ∈ Oi,k. In other words, a task might not be allowed to execute on some logical processor types, for which its execution requirements will not be given. For an instance I of our problem, if we assume that rational numbers of the form x = p/q, p, q ∈ Z, q 6= 0 are represented as a pair (p, q), then we will need O(log p+log q) space to encode x in binary. Denote as SIZE(I) the number of bits required to represent instance I in binary (or more generally any combinatorial object). Then SIZE(I) 6 N+ ∑N i=1 log2 Pi+log2Q+ ∑N i=1 ∑λ k=1(log2 pi ′ k+SIZE(ui,k)+ SIZE(ri,k) + SIZE(Wi,k)), assuming P1, . . . , PN and Q are integers. 4.2.6 Notes Regarding Assumptions Assumption 1. Power ratings include leakage (static) power consumption. We assume that the leakage power consumption pleak is implicit in both Wi,k and pidle . For instance, when a is processor idle, its power consumption can be modeled in its simplest form as pidle = mink{pk} + pleak , where pk 42 is the dynamic power rating at some speed, say sk. Nevertheless, more sophisticated idle power-saving models are employed in current technologies (Intel R© CoreTM i7 Series incorporates advanced idle modes [29]). This assumption, however, makes the model clearer and simplifies the analysis, without degrading the accuracy of the model. This entails that no assumptions exist on how task utilizations scale with different speeds. Consider the relationship between the speed of the processor and the execution requirements of tasks. Suppose that some task is known to require a worst case utilization of uk when running on a processor operating at speed sk. If the task is to run on a different speed s`, where sk 6= s`, on the same processor, then a seemingly intuitive way to adjust the execution requirement of this task is to scale uk linearly as ûk = uksks` . The latter scaling is naïve for many reasons. First, assuming that the WCEC of a task is constant and does not change with processor speed, executing a task at a higher speed will shorten its utilization. It is evident from equation (4.1) that decreasing the utilization increases both the idle and the leakage power consumption. Second, the naïve approach assumes that the WCEC is constant with respect to speed. Indeed, this is inaccurate because a task might access other devices throughout the course of its execution, such as memory and disk. Such devices typically run at bus frequencies that might be divergent from the processor’s frequency (usually constant or have limited configurable speeds compared to those of the processor). Therefore, if the device bus has constant latency, then increasing the processor speed will increase the discrepancy between the speed of the processor and that of the device bus. Accordingly, the task will spend more cycles waiting for the device (doing no useful work). Therefore, increasing the processor speed might in fact increase the WCEC required by a task and therefore increase the WCET overall (which is not what one would expect from increasing the speed of the processor). The latter situation becomes more noticeable if the task is I/O-bound. Since energy depends on the WCET, the energy expenditure might be significantly larger than that captured by the naïve method. Finally, suppose that devices can operate in three states, namely the 43 active, standby and the off state. Further, assume that all devices that a task requires are turned on in the standby state at the instance the task starts execution. Slowing down the processor will cause devices to wait longer in the standby state to be accessed by tasks requiring them, thus increasing the device standby energy. If the task is I/O-bound, then the device idle energy might dominate the energy consumption of the task and the overall energy expenditure might increase. Assumption 2. Each processor adopts a uniprocessor scheduling policy that is capable of utilizing the processor up to 100%. An example of such a policy is Earliest Deadline First (EDF) ( Liu and Layland [40]). This assumption, alongside the previous assumption, guarantee that each processor is a bin of unit capacity. To see this, let Di, Di 6 Pi, denote the relative deadline of task Ti. Let demand(t) > 0 be the worst case execution requirements of a set ofN periodic tasks that arrive and must complete within a contiguous interval of length t. Baruah, Howell, and Rosier [10] derived the following expression for the execution-time demand of the task set on one processor demand(t) = N∑ i=1 ⌊ t+ Pi −Di Pi ⌋ ei. (4.2) Further, they derived sufficient and necessary conditions for the schedu- lability of N periodic tasks on one processor that employs EDF. Specifically, the task set is schedulable iff demand(t) 6 t. In our work we assume that tasks have implicit deadlines, i.e., Di := Pi, and since our energy measure- ment interval is L, the schedulability condition on one processor becomes demand(L) = N∑ i=1 ⌊ L Pi ⌋ ei 6 L. (4.3) 44 Moreover we have N∑ i=1 ⌊ L Pi ⌋ ei 6 N∑ i=1 L Pi ei = L N∑ i=1 ui 6 L, from which we get that ∑N i=1 ui 6 1, which is our unit-capacity bin condi- tion. Further, we assume that the uniprocessor scheduling policy is work- conserving, that is, the scheduler always dispatches a task that is ready for execution as soon as the processor becomes idle. The extension to other sim- ilar scheduling policies that provide utilization bounds is straightforward. Assumption 3. Problem settings are discrete. The set of processor speeds is arbitrarily structured, as well as the set of penalties incurred by different processor speeds. Further, we make no assumptions regarding the structure of the set of rewards obtained by tasks as they execute on different processors. More specifically, we neither make assumptions about the nature of the reward functions – whether they are uniform, concave, etc. – nor about the relationship of rewards to processors and speeds. In other words, tasks are heterogeneous with respect to their power characteristics; different tasks may be associated with different power functions. 4.3 Problem Formulation For the system model above, we wish to establish appropriate pairings be- tween (i) processors and speeds and (ii) tasks and processors, so as to satisfy the following requirements: 1. Each processor is assigned exactly one power rating (speed) for the whole duration of its operation; 2. The total energy expenditure of all processors is at a minimum; 3. A processor should never exceed its capacity; 45 4. Once assigned to a processor, a task should execute on that processor in its entirety; 5. For a certain aggregate quality of service score Q, a minimum guaran- tee at the system level should be achieved. For a certain platform configuration C = 〈pi′1, . . . , pi′M 〉, we formulate the task assignment problem as a mathematical program as follows: Given a task set encoding O = {Oi}Ni=1, we represent the task settings per C as follows: Let zi,j be the following indicator variable: zi,j = 1 if Ti has been assigned to processor pi′j ,0 otherwise. Its associated binary N ×M matrix, call the assignment matrix, is: ZC = z1,1 · · · z1,M ... . . . ... zN,1 · · · zN,M The utilization matrix is given by: UC = u1,1 · · · u1,M ... . . . ... uN,1 · · · uN,M where ui,j = ui,k if ∃k s.t. pi′j = pi′k ∈ Oi,k,0 otherwise. 46 The reward matrix is: RC = r1,1 · · · r1,M ... . . . ... rN,1 · · · rN,M where ri,j = ri,k if ∃k s.t. pi′j = pi′k ∈ Oi,k,0 otherwise. The energy matrix is given by: WC = W1,1 · · · W1,M ... . . . ... WN,1 · · · WN,M where Wi,j = Wi,k if ∃k s.t. pi′j = pi′k ∈ Oi,k,0 otherwise. Finally, define the idle power vector as: Pidle,C = [ pidle,1 · · · pidle,M ]T 47 where pidle,j = pidle,k if pi′j = pi ′ k for some k. The mathematical program can be written as: minimize E = N∑ i=1 M∑ j=1 L Pi zi,jWi,j + L M∑ j=1 (1− Uj)pidle,j (4.4) subject to M∑ j=1 zi,j = 1 ∀i ∈ {1, · · · , N} (4.5) N∑ i=1 ui,jzi,j 6 1 ∀j ∈ {1, · · · ,M} (4.6) M∑ j=1 N∑ i=1 ri,jzi,j > Q zi,j ∈ {0, 1} ∀i ∈ {1, · · · , N},∀j ∈ {1, · · · ,M} where Q is the minimum aggregate QoS score that the system should achieve. Constraint (4.5) says that a task should be assigned to a single processor, and constraint (4.6) guarantees that a processor’s capacity is not exceeded. Therefore, the goal is to find the matrix Z that represents an assignment of tasks to processors that minimizes the energy expenditure of configuration C. Intractability: The solution points represented by matrix ZC are re- quired to be binary, for we do not split a task across processors, so frac- tional assignments are not of our interest. This renders our mathematical program computationally intractable. In fact, it is not difficult to show that this problem is NP-Hard by reduction from the PARTITION problem, which was shown by Garey and Johnson [25] to be NP-Complete. Therefore a polynomial-time approximation algorithm for solving this problem within a factor of the optimal solution might be our only resort unless P = NP. 48 4.4 Assignment of Tasks to Processors We resort to a dynamic programming (DP) approach to solve this problem. We show that an exact DP formulation has complexity that is exponential in time and space requirements. We then use the exact DP to obtain a modified DP that is computationally tractable and is a fully polynomial- time approximation scheme (FPTAS). 4.4.1 Exact and Optimal Dynamic Program Formulation For a certain platform configuration of the from C = 〈pi′1, . . . , pi′M 〉, we equip each task Ti with a set of states Si = {ϕ1, . . . , ϕ|Si|}. Each state ϕ ∈ Si is an M -dimensional vector that encodes a partial feasible schedule from T1 up to Ti on every pi′j in C. More precisely, a state has the form ϕ .= 〈(Y ϕi,1, Eϕi,1), . . . , (Y ϕi,M , Eϕi,M )〉, where, for every pair (Y ϕi,j , E ϕ i,j), j ∈ {1, . . . ,M}, Y ϕi,j is a set that contains information about tasks assigned to processor pi′j , i.e., Y ϕ i,j .= {〈T`, u`,j , r`,j〉}` for every task assigned to pi′j , where 1 6 ` 6 i. E ϕ i,j is the energy expenditure of pi′j with respect to the schedule Y ϕ i,j . Let U ϕ i,j be the total utilization (workload) on pi′j Uϕi,j = ∑ 〈T`,u`,j ,r`,j〉∈Y ϕi,j u`,j , then Eϕi,j is computed as Eϕi,j = ∑ 〈T`,u`,j ,r`,j〉∈Y ϕi,j L P` W`,j + L(1− Uϕi,j)pidle,j , in accordance with equation (4.4). The input is a set of task option encodings Oi and the desired quality score Q. In order to reflect Ti’s parameters with respect to the configuration in consideration C, each Oi is processed in a way similar to the definition of UC and RC matrices in the mathematical program above to produce Xi. It is easy to see that this input processing is polynomial in the input size. 49 We utilize the framework developed by Woeginger [56] for formulating dynamic programs as follows. Let the initial state be 0 = 〈(∅, 0), . . . , (∅, 0)〉, where |0| =M . We start with the initial state space S0 = {0}, and generate new states in the state space Si from Si−1 by means of a finite set of mapping functions FC = {f1, . . . , fM}. Each fj ∈ FC specifies how a task is packed into the j-th processor pi′j in configuration C and generates a new state from it. For a certain ϕ ∈ Si−1 and task Ti, ϕ′ = fj(C,Xi, ϕ) is the state that results from ϕ by updating the j-th pair (Y ϕi−1,j , E ϕ i−1,j) in ϕ as follows: the triplet 〈Ti, ui,j , ri,j〉 is added to the schedule Y ϕi−1,j to produce Y ϕ ′ i,j and the value of the new energy expenditure of the updated schedule is computed according to ui,j as follows Eϕ ′ i,j = ∑ 〈T`,u`,j ,r`,j〉∈Y ϕi−1,j L P` W`,j + L Pi Wi,j + L(1− (Uϕi−1,j + ui,j)pidle,j = Eϕi−1,j + L Pi Wi,j − Lui,jpidle,j , (4.7) where E00,j = pidle,j is the initial energy that processor pi ′ j in configuration C consumes when it is not assigned any tasks. For each fj there exists a mapping hfj ∈ HC , that serves as a tool to rule out infeasible assignments. We define hfj as hfj (C,Xi, ϕ) = U ϕ i−1,j + ui,j − 1 if ui,j > 0, ∞ otherwise. (4.8) A new state ϕ′ ∈ Si can be produced from ϕ ∈ Si−1 if, and only if hfj (C,Xi, ϕ) 6 0. This definition of hfj captures both the requirements that a processor’s capacity should never be exceeded as stated in inequali- ties (4.6), and that a task must be assigned to at least one processor over which it is defined for a schedule to be feasible. A platform configuration will therefore be invalid if there is at least one task that cannot be as- signed to any processor in this configuration, over all states in the state 50 space of the preceding task. This can occur for a task Ti and configura- tion C either because Ti is not defined on any pi′j in C, i.e., ui,j = 0, and for which hfj (C,Xi, ϕ) = ∞ for all ϕ ∈ Si−1, or because assigning Ti to pi′j will violate its capacity constraint, i.e., U ϕ i−1,j + ui,j > 1, and for which hfj (C,Xi, ϕ) = U ϕ i−1,j + ui,j − 1 > 0, for all ϕ ∈ Si−1. Since SN contains all feasible schedules of all N tasks across the proces- sors in a platform configuration, it follows that there should exist an opti- mal schedule ϕ∗ ∈ SN for which E(ϕ∗) is minimum, i.e., Opt = E(ϕ∗) = min{E(ϕ) | ϕ ∈ SN}, where E(ϕ) = ∑M j=1E ϕ N,j . Algorithm ScheduleExact considers all configurations and, for each task Ti, generates all distributions of utilizations to processors ϕ from those of Ti−1 and eliminates infeasible ones according to hfj . For each configu- ration, it finds the state in the state space SN of the N -th task with the minimum energy expenditure and adds both its energy and reward to the so- lution space. Finally, the solution space is ordered by non-decreasing energy and the first solution point to achieve the desired QoS score is chosen. It should be noted that the solution points in the resulting solution space A form a partially ordered set, where some elements are incomparable. A point (Eq, Rq) is energy inefficient with respect to reward if there exists at least one point (E`, R`) such that Eq > E` and Rq 6 R`. This means that the first solution consumes more energy than the second while attaining less reward. Those points are incomparable, so prior to ordering A, the algorithm eliminates solution points that are energy inefficient with respect to reward to produce a total ordering on the solution space A. Proposition 1. ScheduleExact is exponential in both time and space. Proof. The algorithm starts with the initial state S0 = {〈(∅, 0), . . . , (∅, 0)〉}, so |S0| = 1. Consider some task assignment to a particular configuration C. At each iteration i of the algorithm, 1 6 i 6 N , each option xi ∈ Xi can generate one or more states from each ϕi−1 ∈ Si−1, because C might aggregate multiple identical processors that match xi. Whenever Ti is assigned to a processor, that processor is ‘stripped out’ because Ti cannot be assigned to the same processor more than once. Therefore the number of 51 Algorithm 2: ScheduleExact(〈O1, . . . , ON , 〉, Q) 1 A ← ∅ 2 S0 ← {〈(∅, 0), . . . , (∅, 0)〉} 3 foreach C of the ( λ+M−1 M ) configurations do 4 invalidConfiguration ← true 5 for i← 1 to N do 6 Si ← ∅ 7 foreach ϕ ∈ Si−1 do 8 Process Oi according to C to generate Xi 9 for j ← 1 to M do 10 if hfj (C,Xi, ϕ) 6 0 then 11 Si ← Si ∪ {fj(C,Xi, ϕ)} 12 invalidConfiguration ← false 13 end if 14 end for 15 end foreach 16 if invalidConfiguration is true then 17 Move to the next configuration 18 end if 19 end for 20 Examine every pair of states ϕ and ϕ′ ∈ SN . If E(ϕ) 6 E(ϕ′) and R(ϕ) > R(ϕ′) then remove ϕ′ from SN 21 Sort SN by non-decreasing E(ϕ); sort entries with equal energy value by non-increasing reward R(ϕ) 22 ϕ̂← the first ϕ in the sorted SN such that R(ϕ) > Q 23 if no ϕ in SN achieves R(ϕ) > Q then 24 Move to the next configuration 25 end if 26 A ← A∪ {(C, ϕ̂)} 27 end foreach 28 if all platform configurations are invalid then 29 Report “No feasible schedule exists” and terminate 30 end if 31 Examine every pair (C,ϕ) and (C ′, ϕ′) ∈ A. If E(ϕ) 6 E(ϕ′) and R(ϕ) > R(ϕ′) then remove (C ′, ϕ′) from A 32 Sort A by non-decreasing E(ϕ); sort entries with equal energy value by non-increasing reward R(ϕ) 33 return the first (C,ϕ) in the sorted A 52 possible assignments of a task to a configuration is at most M . This means that the state space expands by a factor of at most M for each task. Thus the total number of states in the i-th iteration is bounded from above by |Si| 6M |Si−1|. The total number of states is N∏ i=1 |Si| |Si−1| = |SN | |S0| 6M N so |SN | 6MN . Considering all configurations, ScheduleExact will require at least O (( λ+M−1 M ) MN ) time and space, which proves the claim. 4.4.2 The Fully Polynomial-Time Approximation Scheme The exponential state space generated above can be significantly reduced by observing that some states are ‘close’ to each other, and close states can be pruned without severely sacrificing optimality. Define the error factor to be an input that specifies the desired trade off between the accuracy of the solution and the running time of the algorithm. It can be in the range 0 < 6 1. (4.9) We use the technique of Trimming-The-State-Space, introduced by Ibarra and Kim [28] and treated rigorously by Woeginger [56], in order to thin out the state space by merging ‘close’ states and bring down the size of the state space to a polynomial in N and 1 . Doing so allows us to derive an FPTAS that produces a solution having a value Ẽ whose relative error with respect to the value of the optimal solution E∗ is bounded above by 53 the error factor , i.e., eE−E∗E∗ 6 , and therefore the approximate solution is at most (1 + ) far from the optimal solution: Ẽ 6 (1 + )E∗. In order to do so, we need a measure of closeness between states to quantitatively decide on how to prune states so that the error resulting from such pruning is bounded and controlled, which we state in the following definitions. Definition 3. Let x, y be any real numbers. We say that x and y are ∆- close, where ∆ > 1 is the trimming factor, if 1∆x 6 y 6 ∆x. The following proposition lists some useful and easily provable properties of ∆-closeness. Proposition 2. Properties of ∆-closeness 1. ∆-closeness is both reflexive and symmetric. 2. If x is ∆1-close to y and y is ∆2-close to z, then x is ∆1∆2-close to z. We extend the definition above naturally to M -dimensional vectors over RM . Definition 4. Let x,y be any vectors in RM . We say that x and y are ∆-close, where ∆ > 1 is the trimming factor, if 1∆xj 6 yj 6 ∆xj for every j ∈ {1, . . . ,M}. Now we make Definition 4 specific to the vectors in our application. Definition 5. Let Si be the state space of task Ti. Two states ϕ,ϕ′ ∈ Si are ∆-close with respect to energy expenditure, where ∆ > 1 is the trimming factor, if 1∆E ϕ i,j 6 E ϕ′ i,j 6 ∆E ϕ i,j for all j ∈ {1, . . . ,M}. We denote two states that are ∆-close with respect to energy as [E,∆]- close. This means that Eϕi,j and E ϕ′ i,j are ‘good’ representatives of each other, for all of the M processors in the platform, and either ϕ or ϕ′ can be kept 54 and the other discarded; the decision of which to keep depends on the appli- cation 1. By pruning states according to the measure above, we can obtain a solution that is ‘close-enough’ to the optimal while being able to control the error propagation resulting from such pruning. State Space Pruning: From now on, we shall denote the unpruned state space as U and the pruned state space (resulting from refining U) as P. The idea of state space pruning is to partition the state space into groups containing [E,∆]-close states, and then selecting one state from each non- empty group to enter the state space. We denote such groups as ∆-boxes Bk, where Bk ⊆ U for all k ( Woeginger [56]). Consider the state space Ui generated by Ti. Let Ei,j be a multiset containing the energy values of the j-th coordinate of every ϕ ∈ Ui Ei,j = {Eϕi,j | ϕ ∈ Ui}. In other words, Ei,j is the set of energy values associated with schedules on processor pi′j among all partial feasible schedules up to the i-th task Ti in some platform configuration. Let Ei,max = maxj max{Eϕi,j | ϕ ∈ Ui }. Then Ei,max is the maximum energy value across all processors and states in the state space Ui. Now let Li = dlog∆max(1, Ei,max)e. Every vector in Ui is in [0,∆Li ]M and thus Ui ⊆ [0,∆Li ]M . We create Li + 1 intervals, where I0 = [0, 1), Ik = [∆k−1,∆k) for all k ∈ {1, . . . , Li − 1}, and ILi = [∆Li−1,∆Li ]. Each Eϕi,j ∈ Ei,j resides in exactly one interval, and cannot exceed ∆Li . We partition the energy values in each Ei,j to intervals as defined above, mapping each Eϕi,j ∈ Ei,j to the interval to which it belongs. Assuming that SIZE(Ei,max) is polynomial in the size of the instance SIZE(I, ), the number of the thereby constructed intervals will be polynomially bounded in SIZE(I, ). Further, we claim that for any two states, if for every coordinate j the energy values of both states are in the same interval, the those states are [E,∆]-close. Claim 1. Let Ui be the state space of task Ti. Let ϕ, ϕ′ be two states in Ui. 1The reader can refer to Cormen, Leiserson, Rivest, and Stein [20] chap. 35 for an example of pruning as applied to the SUBSET_SUM problem. 55 If Eϕi,j and E ϕ′ i,j are in the same interval for every j ∈ {1, . . . ,M}, then ϕ and ϕ′ are [E,∆]-close. Proof. Let Eϕi,j , E ϕ′ i,j ∈ Ik = [∆k−1,∆k] for some k ∈ {1, . . . , Li}, for all coordinates j, j ∈ {1, . . . ,M}. Then ∆k−1 6 Eϕ ′ i,j 6 ∆k ∀j ∈ {1, . . . ,M} (4.10) and ∆k−1 6 Eϕi,j 6 ∆k ∀j ∈ {1, . . . ,M}. (4.11) Write the RHS of (4.11) as 1∆E ϕ i,j 6 ∆k−1 and the LHS of (4.11) as ∆k 6 ∆Eϕi,j . Substituting the rewritten inequalities into (4.10) yields 1 ∆ Eϕi,j 6 ∆k−1 6 E ϕ′ i,j 6 ∆k 6 ∆E ϕ i,j for every j ∈ {1, . . . ,M}. Thus ϕ and ϕ′ are [E,∆]-close by our definition of [E,∆]-closeness (definition 5). Algorithm SchedulePruned is a realization of the idea above. Pro- cedure Prune performs state pruning. For every coordinate of every state ϕ, in addition to the partial schedule Y ϕi,j and the energy value E ϕ i,j of this schedule, the algorithm maintains the number of the interval to which Eϕi,j belongs, which we will denote as `ϕi,j , where ` ϕ i,j ∈ {0, . . . , Li}. The pruning procedure will admit a single state from every Bk where Bk ∩ U 6= ∅ into the pruned state space P. Specifically, we choose the state with maximum reward over all states in the ∆-box in consideration to enter the pruned state space. Therefore, the number of states in the pruned state space is exactly the number of non-empty ∆-boxes. As a matter of fact, procedure Prune relies on the contra-positive of claim 1: if ϕ, ϕ′ ∈ Ui are [E,∆]-far, i.e., there exists j such that either Eϕ ′ i,j > ∆E ϕ i,j or E ϕ′ i,j < ∆ −1Eϕi,j , then Eϕ ′ i,j and E ϕ i,j must be in different ∆-boxes. Accordingly, the number of ∆-boxes of the i-th task will be bounded above by the maximum number of intervals over M coordinates (which we will show is polynomial in the input size when M is constant). 56 We use the observations above to bound the cardinality of the pruned state space as produced by our pruning procedure. First, we need to im- pose a technical condition on the packing functions fj with respect to the trimming factor ∆, so that the error propagation resulting from pruning is controlled. Proposition 3. If state ϕ is [E,∆]-close to state ϕ′, then for any fj , fk ∈ FC , fj(C,X,ϕ) is [E,∆]-close to fk(C,X,ϕ′). Functions that satisfy the condition above will be said to have the prop- erty of being [E,∆]-closeness preserving. This condition will be used later in establishing the correspondence between the state space that results from ScheduleExact and those produced by SchedulePruned. The pruning factor ∆: We choose the pruning factor ∆ to be equal to 1+ 2N . Therefore ∆ is an increasing function with respect to user specified error factor . Lemma 3. The number of states in each iteration after pruning, |Pi|, is polynomial in N and 1/. Proof. The cardinality of Pi is bounded above by the number of ∆-boxes Bk,i induced by the [E,∆]-closeness relation on Ui, which in turn is bounded above by the number of intervals Li in the i-th iteration. Let bi be the number of ∆-boxes Bi,k, k ∈ {1, . . . , bi}. Further let Bi = |{ k | Bi,k ∩ Ui 6= ∅, k ∈ {1, . . . , bi} }|. The number of ∆-boxes is at most (1 + Li)M and therefore |Pi| = Bi 6 bi 6 (1 + Li)M . We know that Li = dlog∆max(1, Ei,max)e = ⌈ lnmax(1, Ei,max) ln ( 1 + 2N ) ⌉ , (4.12) 57 Algorithm 3: SchedulePruned(〈O1, . . . , ON 〉, , Q) 1 A ← ∅ 2 P0 ← {〈(∅, 0), . . . , (∅, 0)〉} 3 foreach C of the ( λ+M−1 M ) configurations do 4 invalidConfiguration ← true 5 for i← 1 to N do 6 Process Oi according to C to generate Xi 7 Ui ← ∅ 8 foreach ϕ ∈ Pi−1 do 9 for j ← 1 to M do 10 if hfj (C,Xi, ϕ) 6 0 then 11 Ui ← Ui ∪ {fj(C,Xi, ϕ)} 12 invalidConfiguration ← false 13 end if 14 end for 15 end foreach 16 if invalidConfiguration is true then 17 Move to the next configuration 18 end if 19 Pi ← Prune(Ui, N, ) 20 end for 21 Examine every pair of states ϕ and ϕ′ ∈ PN . If E(ϕ) 6 E(ϕ′) and R(ϕ) > R(ϕ′) then remove ϕ′ from PN 22 Sort PN by non-decreasing E(ϕ); sort entries with equal energy value by non-increasing reward R(ϕ) 23 ϕ̂← the first ϕ in the sorted PN such that R(ϕ) > Q 24 if no ϕ in PN achieves R(ϕ) > Q then 25 Move to the next configuration 26 end if 27 A ← A∪ {(C, ϕ̂)} 28 end foreach 29 if all platform configurations are invalid then 30 Report “No feasible schedule exists” and terminate 31 end if 32 Examine every pair (C,ϕ) and (C ′, ϕ′) ∈ A. If E(ϕ) 6 E(ϕ′) and R(ϕ) > R(ϕ′) then remove (C ′, ϕ′) from A 33 Sort A by non-decreasing E(ϕ), sort entries with equal energy value by non-decreasing reward R(ϕ) 34 return the first (C,ϕ) in the sorted A 58 Procedure Prune(U , N, ) 1 ∆← (1 + 2N ) 2 P ← ∅ 3 Let Emax ← maxj max{Eϕj | ϕ ∈ U} 4 Let L = dlog∆max(1, Emax)e 5 Compute ∆2, . . .∆L 6 foreach ϕ ∈ U do 7 for j = 1 to M do 8 Consider (Y ϕj , E ϕ j , ` ϕ j ) 9 if Eϕj < 1 then 10 Set `ϕj ← 0 11 else /* Find the interval in which Eϕj resides */ 12 for `′ ← 1 to L do 13 if ∆`′−1 6 Eϕj 6 ∆` ′ then 14 Set `ϕj ← `′ and move to j + 1 15 end if 16 end for 17 end if 18 end for 19 end foreach 20 Group states in U for which `ϕj = `ϕ ′ j for every j ∈ {1, . . . ,M}, for every ϕ,ϕ′ ∈ U , in ∆-boxes B1, . . . ,Bb 21 for k ← 1 to b do 22 if Bk ∩ U 6= ∅ then 23 Add to P the state ϕ ∈ Bk for which R(ϕ) is maximum 24 end if 25 end for 26 return P 59 but lnmax(1, Ei,max) ln ( 1 + 2N ) 6 2N (1 + 2N ) lnmax(1, Ei,max) (4.13) 6 3N lnmax(1, Ei,max), (by inequality (4.9)) (Inequality (A.3) is obtained using ln(x+ 1) > xx+1 when x > −1). Thus Li 6 d1 + 3N/ lnmax(1, Ei,max)e. The cardinality of Pi is therefore |Pi| 6 (1 + Li)M 6 (1 + d1 + 3N/ lnmax(1, Ei,max)e)M . (4.14) The latter bound is polynomial in N , 1/, and the number of bits re- quired to encode the input in binary, where the number of processors M is constant. This concludes the proof. In order to obtain the approximation ratio of our FPTAS, we need a lemma relating the state spaces U and P of SchedulePruned to the state space S maintained by ScheduleExact. In fact, we need to show the effect of pruning on the value of the optimal solution produced by our algorithm. The following lemma shows how severely the optimal solution can deteriorate as a result of pruning. Lemma 4. If ϕ∗ ∈ SN is the state that yields the optimal solution in the exact DP formulation (Algorithm ScheduleExact), and ϕ̃ ∈ PN is the state returned by SchedulePruned, then ϕ̃ is at most [E,∆N ]-close to ϕ∗. Proof. State ϕ∗ is produced by a chain of N -applications of functions fj ∈ FC . Denote as ϕ∗i ∈ Si the state produced as a result of applying some fj ∈ FC to ϕ∗i−1 ∈ Si−1, i.e., ϕ∗i = fj(C,Xi, ϕ∗i−1) and thus ϕ∗ = ϕ∗N = fj(C,Xi, ϕ∗N−1). 60 Consider the decomposition of ϕ∗ into itsN partial schedules ϕ∗1, . . . , ϕ∗N . We show, by induction on i, that for ϕ∗i ∈ Si, there exists a state ϕ̃i ∈ Pi that is [E,∆i]-close to ϕ∗i . There is nothing to show for i = 0, since S0 = P0. Consider ϕ∗i−1 ∈ Si−1. By the induction hypothesis, there exists ϕ̃i−1 ∈ Pi−1 that is [E,∆i−1]-close to ϕ∗i−1. By construction of Ui, the set Ui contains the state ϕ̂i = fj(C,Xi, ϕ̃i−1). By construction of Pi, there exists a state ϕ̃i ∈ Pi that is [E,∆]-close to ϕ̂i. Since ϕ∗i−1 is [E,∆i−1]-close to ϕ̃i−1, the condition in proposition 3 guarantees that ϕ̃i is [E,∆i−1]-close to ϕ∗i . Since ϕ∗i is [E,∆ i−1]-close to ϕ̃i and ϕ̃i is [E,∆]-close to ϕ̂i, it follows by proposition 2.2 that ϕ∗i is [E,∆ i]-close to ϕ̂i. Applying this result for i = N proves the lemma. Now we are ready to obtain the desired approximation factor of our FPTAS. Lemma 5. The energy expenditure Ẽ of the solution produced by Sched- ulePruned is (1 + ) from the optimal solution E∗, i.e., Ẽ 6 (1 + )E∗ Proof. From Lemma 4 we have Eϕ̃N,j 6 ( 1 + 2N )N Eϕ ∗ N,j 6 (1 + )Eϕ ∗ N,j for every j ∈ {1, . . . ,M}, where we use the inequality (1 + xn)n 6 1 + 2x, 0 6 x 6 1, n > 1, with n = N and x = /2. This is a reasonable approximation that can be verified by noticing that the left hand side is convex over [0, 1] and the right hand side is linear. To formally prove its validity in our setting, we know that:( 1 + 2N )N = eN ln(1+/2N) (4.15) 6 e/2 (4.16) 6 1 + /2 + (/2)2 (4.17) 6 1 + , (4.18) 61 where (4.16) is obtained by ln(1+x) 6 x, when |x| 6 1, and inequality (4.18) follows from inequality (4.9) by the following 0 < 6 1 0 < (/2)2 6 /2 /2 < /2 + (/2)2 6 1 + /2 < 1 + /2 + (/2)2 6 1 + . Accordingly, the energy expenditure of our solution is Ẽ = E(ϕ̃) = M∑ j=1 Eϕ̃N,j 6 (1 + ) M∑ j=1 Eϕ ∗ N,j = (1 + )E(ϕ∗) = (1 + )Opt, which proves the lemma. Theorem 1. Algorithm SchedulePruned is a fully polynomial time ap- proximation scheme that runs in time polynomial in N , 1/, and the number of bits required to encode the input in binary, where the number of processors M is constant. Proof. Let us start by bounding the number of steps required by proce- dure Prune, which is the bottleneck of SchedulePruned. Setting the in- terval numbers for every energy value across all states and processors (line 6 to 19 ) requires at most a total ofM(Li+1)|Ui| comparisons with the differ- ent ∆`′ values, where Li is as defined in line 4. Grouping states into ∆-boxes (line 20) can be done as follows: Start with any vector in Ui, create a ∆-box for it and insert it in this ∆-box. Then pick another vector and compare 62 with the first vector: if both have exactly the same interval indexes for all M processors, then insert the vector being examined into the ∆ -box where the first vector resides, otherwise create a new ∆-box for it and insert it there. Therefore, for the k-th vector in Ui, say ϕk, we will need to compare the interval indexes of ϕk to only one vector in the thereby created ∆-boxes, and insert to the ∆-box to which it belongs, or otherwise create a new ∆- box for it. If, at the worst, when examining the k-th vector, every vector so far examined resides in its own ∆-box (i.e., all are ∆-far), then Prune will need to perform k − 1 vector comparisons. Therefore, the maximum number of interval index comparisons performed by the grouping procedure is bounded above by |Ui|∑ k=2 M(k − 1) = O(|Ui|2), which is the worst case asymptotic time complexity of Prune. Processing the input as in line 6 can be done in time linear in the size of the specification of each task input per platform configuration. For the elimination of incomparable solution points (line 21), we will need to examine at most (|PN | 2 ) = O(|PN |2) vectors. The solution points per PN can be sorted (line 22) by an optimal sorting algorithm using Θ(|PN | log |PN |) comparisons. Similarly for the set A, we note that the cardinality of A cannot exceed the number of configurations Λ; therefore, elimination of incomparable so- lution points (line 32) will need to examine at most (|Λ| 2 ) = O(|Λ|2) vectors, which is constant with respect to the size of the input (i.e., does not grow as the input size changes). Sorting A (line 33) requires Θ(|Λ| log |Λ|), which, again, is a constant in the size of the input. Since |Ui| 6 M |Pi−1|, we can write all of the bounds above in terms of Pi−1 instead of Ui, where |Pi−1| is as defined in (4.14). Let γi = (1 + d1 + 3N/ lnmax(1, Ei,max)e). Accordingly, the number of comparisons required for setting the index intervals in Prune as disscued above can be written as M(Li + 1)|Ui| 6M2γiγMi−1. 63 Further, let Ê = maxiEi,max. Let γ = (1 + d1 + 3N/ lnmax(1, Ê)e). If we write the bounds above in terms of Pi−1 and sum them over N , then the asymptotic time complexity of Algorithm SchedulePruned is T (N, ) = O(Nγ2M ). This concludes the proof. Practical Considerations The running time of algorithm SchedulePruned is a polynomial with de- gree equal to twice the number of machines comprising the platform, 2M . In addition, the hidden constants (at least M3 ( λ+M−1 M ) ) can be considerably large for large enough platforms. As a consequence, SchedulePruned might not exhibit a fast running time for a considerably large number of processors. This makes our algorithm most suitable for environments that exhibit a static behavior in terms of task requirements and system condi- tions, so that the algorithm is executed in an offline fashion. The running time of SchedulePruned, however, can be significantly enhanced by using the efficient implementation of partitioned scheduling using a look-up table approach, which was devised by Chattopadhyay and Baruah [15]. In brief, their algorithm computes all possible configurations for each bin independently of the task set requirements only once per platform (this is the computationally intensive part, but the number of configurations is polynomially bounded in the size of the problem instance), and then uses the stored tables to assign tasks to processors in an efficient manner (this is the efficient implementation part). Doing so will allow SchedulePruned to respond faster to changes in the platform, such as overload conditions, processor failures, etc., without the need to re-assemble the platform for every run of the algorithm. 64 Chapter 5 Conclusions The service class model for task assignment is a simple method to express resource requirements and the associated reward/quality of service/perfor- mance/cost. This model can be used for several resource management prob- lems, especially when the relationship between resource utilization and qual- ity is highly non-linear. We used a basic service class model and provided some preliminary answers to immediate questions raised by the model. For restricted cases of QoS optimization with task allocation, we described poly- nomial time approximation schemes. Polynomial-time exact algorithms can- not exist unless P = NP; thus, approximations are the best we can hope for in polynomial-time. Further, the only measure of aggregate quality that we used was summations. There may be other better metrics for capturing the aggregate quality of service, and optimization using those metrics requires more mathematical insight. An interesting question to ponder is whether there are certain classes of functions that are amenable to joint treatment with task allocation (and bin packing). Designing real-time computing systems to satisfy quality of service re- quirements while simultaneously reducing energy consumption is a compu- tationally hard problem. While it may not be possible to obtain an optimal solution in polynomial time, we have been able to establish that near-optimal solutions can be obtained in polynomial time. The strengths of our work are in this joint treatment of QoS and en- 65 ergy issues and in the accurate modeling of the execution time variations of tasks with processor speeds and in being able to capture capturing energy consumption in resources other than the processor. The holistic modeling of execution times and energy consumption improves the state of the art in this design problem. We chose to restrict our attention to implicit-deadline real-time tasks and employed utilization bounds as a test for schedulability. We would like to remark that this is not a significant restriction in itself. The framework can support exact schedulability tests and arbitrary deadlines as long as one employs a polynomial time schedulability test. While exact schedulability tests do not run in polynomial time, we can instead use approximate tests (such as the one proposed by Chakraborty, Künzli, and Thiele [14]) to obtain polynomial time approximation schemes. We have considered the situation when we have a fixed number of proces- sor types and processors belong to one of these processor types. This opens up the way for further work on heuristics that have good performance and are faster than the dynamic programming approach that we have presented. Limited heterogeneity in compute units is a dominant choice in the design of embedded systems and our scheme is well suited to addressing this com- mon case. Recently, Andersson, Raravi, and Bletsas [6] have examined the problem of task allocation without energy or QoS considerations on a plat- form with two processor types. Their heuristic methods may be an initial direction to consider for the more general problem we have defined. Whereas we have considered the quality of service as being related only to the processor type that a task is assigned to, we believe that it is possi- ble to extend our work to cover the case when a task has multiple service classes. Consider, as an example, a video application that can operate a high, medium or low quality. This adds another choice dimension where one needs to select the appropriate service class for a task in conjunction with the choices that we have discussed in this article. Task allocation and scheduling with multiple classes of service per task is a problem that has theoretical and practical implications for the design of real-time systems. We also note that we have dealt with the case where energy consumed by 66 executing a job is independent of other jobs that are executed on the same processor. This may not always be the case. For example, jobs that have mutual cache interference may lead to higher energy consumption because of an increased number of cache misses that need to be satisfied by main memory. Considering such interference between tasks is a direction for future work. It first needs extensive measurements and methods for estimating the interference between tasks. 67 Bibliography [1] L. Abeni and G. Buttazzo. Integrating multimedia application in hard real-time systems. In Proceedings of the IEEE Real-Time Systems Symposium, pages 4–13, Dec. 1998. [2] M. Agrawal, N. Kayal, and N. Saxena. Primes is in p. Ann. of Math, 2:781–793, 2002. [3] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network Flows: Theory, Algorithms and Applications. Prentice Hall, 1993. [4] R. K. Ahuja, M. Kodialam, A. K. Mishra, and J. B. Orlin. Computational investigations of maximum flow algorithms. In European Journal of Operations Research, volume 97, pages 509–542, 1997. [5] Amazon.com. EC2: Amazon Elastic Compute Cloud. http://aws.amazon.com/ec2/. [6] B. Andersson, G. Raravi, and K. Bletsas. Assigning Real-Time Tasks on Heterogeneous Multiprocessors with Two Unrelated Types of Processor. To appear at The 31st IEEE Real-Time Systems Symposium November 30 - December 3, 2010, San Diego, CA, USA. [7] S. Arora and B. Barak. Computational Complexity: A Modern Approach. Cambridge University Press; first edition, April 20, 2009. [8] W. Baek and T. M. Chilimbi. Green: a framework for supporting energy-conscious programming using controlled approximation. In Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation, PLDI ’10, pages 198–209, New York, NY, USA, 2010. ACM. ISBN 978-1-4503-0019-3. doi:http://doi.acm.org/10.1145/1806596.1806620. URL http://doi.acm.org/10.1145/1806596.1806620. 68 [9] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In Proceedings of the ACM Symposium on Operating Systems Principles, pages 164–177, 2003. [10] S. Baruah, R. Howell, and L. Rosier. Algorithms and complexity concerning the preemptive scheduling of periodic real-time tasks on one processor. Real-Time Systems, 2(4):301–324, November 1990. [11] E. Bini, G. Buttazzo, and G. Lipari. Minimizing cpu energy in real-time systems with discrete speed management. ACM Transactions on Embedded Computing Systems, 8(4):1–23, 2009. ISSN 1539-9087. doi:http://doi.acm.org/10.1145/1550987.1550994. [12] G. C. Buttazzo. Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applications, volume 23 of Real-Time Systems Series. Springer, 2 edition, 2005. [13] J. Carpenter, S. Funk, P. Holman, A. Srinivasan, J. Anderson, and S. Baruah. A categorization of real-time multiprocessor scheduling problems and algorithms. In J. Y.-T. Leung, editor, Handbook of Scheduling: Algorithms, Models, and Performance Analysis. Chapman Hall/CRC Press, 2004. [14] S. Chakraborty, S. Künzli, and L. Thiele. Approximate schedulability analysis. In RTSS ’02: Proceedings of the 23rd IEEE Real-Time Systems Symposium, page 159, Washington, DC, USA, 2002. IEEE Computer Society. ISBN 0-7695-1851-6. [15] B. Chattopadhyay and S. Baruah. A lookup-table driven approach to partitioned scheduling. In Real-Time and Embedded Technology and Applications Symposium (RTAS), 2011 17th IEEE, pages 257–265, april 2011. doi:10.1109/RTAS.2011.32. [16] J.-J. Chen. Expected energy consumption minimization in dvs systems with discrete frequencies. In SAC ’08: Proceedings of the 2008 ACM Symposium on Applied Computing, pages 1720–1725, New York, NY, USA, 2008. ACM. ISBN 978-1-59593-753-7. doi:http://doi.acm.org/10.1145/1363686.1364095. [17] N. Christofides. Worst-case analysis of a new heuristic for the travelling salesman problem. Technical report no. 388, Graduate School of Industrial Administration, Carnegie-Mellon University, 1976. 69 [18] E. G. Coffman, M. R. Garey, and D. S. Johnson. Approximation algorithms for bin packing: a survey. In D. Hochbaum, editor, Approximation Algorithms for NP-Hard Problems, pages 46–93. PWS Publishing, Boston, 1996. [19] S. A. Cook. The complexity of theorem-proving procedures. In Proceedings of the third annual ACM symposium on Theory of computing, STOC ’71, pages 151–158, New York, NY, USA, 1971. ACM. doi:http://doi.acm.org/10.1145/800157.805047. URL http://doi.acm.org/10.1145/800157.805047. [20] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, Third Edition. McGraw-Hill Science/Engineering/Math, 2009. ISBN 0262033844. [21] W. F. de la Vega and G. S. Lueker. Bin packing can be solved within (1 + ) in linear time. Combinatorica, 1(4):349–355, 1981. [22] J. Dey, J. Kurose, and D. Towsley. On-line scheduling policies for a class of iris (increasing reward with increasing service) real-time tasks. Computers, IEEE Transactions on, 45(7):802 –813, jul. 1996. ISSN 0018-9340. doi:10.1109/12.508319. [23] D. K. Friesen and M. A. Langston. Variable sized bin packing. SIAM Journal of Computing, 15(222–230), 1986. [24] M. R. Garey and R. L. Graham. Bounds for multiprocessor scheduling with resource contraints. SIAM Journal of Computing, 4(2):187–201, June 1975. [25] M. R. Garey and D. S. Johnson. Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman, 1979. [26] M. Guthaus, J. Ringenberg, D. Ernst, T. Austin, T. Mudge, and R. Brown. Mibench: A free, commercially representative embedded benchmark suite. pages 3 – 14, dec. 2001. doi:10.1109/WWC.2001.990739. [27] D. S. Hochbaum and D. B. Shmoys. Using dual approximation algorithms for scheduling problems: theoretical and practical results. Journal of the ACM, 34(1):144–162, January 1987. 70 [28] O. H. Ibarra and C. E. Kim. Fast approximation algorithms for the knapsack and sum of subset problems. J. ACM, 22(4):463–468, 1975. ISSN 0004-5411. doi:http://doi.acm.org/10.1145/321906.321909. [29] Intel Corporation. Intel R© CoreTM i-7 900 Mobile Processor Extreme Edition Series, Intel Core i7-800 and i7-700 Mobile Processor Series Datasheet. http://download.intel.com/design/processor/datashts/320765.pdf, 2009. [30] S. Irani, S. Shukla, and R. Gupta. Algorithms for power savings. ACM Transactions on Algorithms, 3(4):41, 2007. ISSN 1549-6325. doi:http://doi.acm.org/10.1145/1290672.1290678. [31] T. Ishihara and H. Yasuura. Voltage scheduling problem for dynamically variable voltage processors. In ISLPED ’98: Proceedings of the 1998 International Symposium on Low power electronics and design, pages 197–202, New York, NY, USA, 1998. ACM. ISBN 1-58113-059-7. doi:http://doi.acm.org/10.1145/280756.280894. [32] R. Jejurikar, C. Pereira, and R. Gupta. Leakage aware dynamic voltage scaling for real-time embedded systems. pages 275 – 280, 2004. [33] N. Karmarkar and R. M. Karp. An efficient approximation scheme for the one-dimensional bin packing problem. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science, pages 312–320, November 1982. [34] A. Khemka, R. K. Shyamsundar, and K. V. Subrahmanyam. Multiprocessors scheduling for iprecise computations in a hard real-time environment. In Proceedings of the International Parallel Processing Symposium, pages 374–378, April 1993. [35] R. E. Korf. An improved algorithm for optimal bin packing. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 1252–1258, 2003. [36] C. Lee, J. Lehoezky, R. Rajkumar, and D. Siewiorek. On quality of service optimization with discrete qos options. In Real-Time Technology and Applications Symposium, 1999. Proceedings of the Fifth IEEE, pages 276–286, 1999. doi:10.1109/RTTAS.1999.777680. [37] C.-G. Lee, C.-S. Shih, and L. Sha. Online QoS optimization using service classes in surveillance radar systems. Real-Time Systems, 28 (1):5–37, October 2004. 71 [38] W. Leinberger, G. Karypis, and V. Kumar. Multi-capacity bin packing algorithms with applications to job scheduling under multiple constraints. In Proceedings of the International Conference on Parallel Processing, pages 404–413, November 1999. [39] L. Levin. Universal search problems. Problems of Information Transmission, translated into English by Trakhtenbrot, B. A. (1984). “A survey of Russian approaches to perebor (brute-force searches) algorithms”. Annals of the History of Computing, 6 (4): 384-400, 9 (3), 1973. [40] C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogramming in a hard real-time environment. Journal of the ACM, 20(1):46–61, Jan. 1973. [41] J. Liu, K.-J. Lin, W.-K. Shih, A.-s. Yu, J.-Y. Chung, and W. Zhao. Algorithms for scheduling imprecise computations. Computer, 24(5): 58 –68, may. 1991. ISSN 0018-9162. doi:10.1109/2.76287. [42] J. W.-S. Liu. Real-Time Systems. Prentice Hall, 2000. [43] J. M. Lopez, J. L. Diaz, M. Garcia, and D. F. Garcia. Worst-case utilization bound for EDF scheduling on real-time multiprocessor systems. In Proceedings of the Euromicro Conference on Real-Time Systems, pages 25–33, June 2000. [44] C. H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994. [45] C. H. Papadimitriou and K. Steiglitz. Combinatorial Optimization: Algorithms and Complexity. Dover Publications, Mineola, NY, 2 edition, 1998. [46] I. Pratt, K. Fraser, S. Hand, C. Limpach, and A. Warfield. Xen 3.0 and the art of virtualization. In Linux Symposium, volume 2, pages 65–78, July 2005. [47] V. R. Pratt. Every prime has a succinct certificate. SIAM J. Comput., 4(3):214–220, 1975. [48] O. Reingold. Undirected st-connectivity in log-space. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, STOC ’05, pages 376–385, New York, NY, USA, 2005. ACM. ISBN 72 1-58113-960-8. doi:http://doi.acm.org/10.1145/1060590.1060647. URL http://doi.acm.org/10.1145/1060590.1060647. [49] C. Rusu, R. Melhem, and D. Mossé. Maximizing rewards for real-time applications with energy constraints. IN ACM Transactions on Embedded Computing Systems, 2(4):537–559, 2003. ISSN 1539-9087. doi:http://doi.acm.org/10.1145/950162.950166. [50] K. Seth, A. Anantaraman, F. Mueller, and E. Rotenberg. Fast: Frequency-aware static timing analysis. ACM Transactions on Embedded Computing Systems, 5(1):200–224, 2006. [51] H. Shachnai and T. Tamir. Polynomial time approximation schemes for class-constrained packing problems. In Proceedings of the Workshop on Approximation Algorithms, pages 238–249, 1999. [52] A. M. Turing. On computable numbers, with an application to the entscheidungsproblem. Proceedings of The London Mathematical Society, s2-42:230–265, 1937. doi:10.1112/plms/s2-42.1.230. [53] B. Urgaonkar, P. Shenoy, and T. Roscoe. Resource overbooking and application profiling in shared hosting platforms. In Proceedings of the USENIX Symposium on Operating System Design and Implementation, pages 239–254, December 2002. [54] V. V. Vazirani. Approximation Algorithms. Springer, March 22, 2004. [55] VMWare, Inc. VMWare ESX Hypervisor. http://www.vmware.com/products/vi/esx/. [56] G. J. Woeginger. When does a dynamic programming formulation guarantee the existence of an fptas? Electronic Colloquium on Computational Complexity (ECCC), (084), 2001. [57] G. J. Woeginger. There is no asymptotic PTAS for two-dimensional vector packing. Information Processing Letters, 64(6):293–297, December 1997. [58] W. Wolf. The future of multiprocessor systems-on-chips. In DAC ’04: Proceedings of the 41st annual Design Automation Conference, pages 681–685, New York, NY, USA, 2004. ACM. ISBN 1-58113-828-8. doi:http://doi.acm.org/10.1145/996566.996753. 73 [59] T. Wood, P. Shenoy, A. Venkataramani, and M. Yousif. Black-box and gray-box strategies for virtual machine migration. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation, pages 229–242, April 2007. [60] C.-Y. Yang, J.-J. Chen, T.-W. Kuo, and L. Thiele. An approximation scheme for energy-efficient scheduling of real-time tasks in heterogeneous multiprocessor systems. In DATE, pages 694–699, 2009. 74 Appendix A A Modified Pruning Technique: Energy Rounding We present a slightly different technique for state space pruning, which might result in a slightly improved running time; one that depends on the ratio of energy values instead of maximum energy values, which was the case in the original pruning technique. In addition, we present a generalization to binary search, one that locates an element in a list based on ∆-closeness instead of arithmetic equality. We modify procedure Prune as following (the modified version is re- ported in Prune2). In addition to the input state space U , Prune2 main- tains an intermediate state space, Ũ , which contains energy-rounded versions of the states in U . For each state in the untrimmed input state space U , and for each processor pi′j in C, Prune2 sorts the states in Ũ in an increasing order of the energy value of pi′j , then for the state being examined, say ϕ, it determines whether the energy value Eϕj corresponding to pi ′ j is [E,∆]-close to any energy value of pi′j in the sorted list of rounded states. If so, it rounds the energy value Eϕj to the energy value of the corresponding processor in the state to which it was found to be [E,∆]-close in Ũ . Say that Eϕj is ∆-close to Eϕ̃j , where ϕ̃ ∈ Ũ , such that ∆−1Eϕ̃j 6 Eϕj 6 ∆Eϕ̃j . Then round- ing is performed simply by replacing ϕ’s j-th coordinate with that of ϕ̃’s, i.e., setting (Y ϕj , E ϕ j ) := (Y ϕ̃ j , E ϕ̃ j ). After rounding is done, Prune2 parti- 75 tions the rounded states into ∆-boxes, where states in the same ∆-box have equal energy components for all processors. Finally, the pruning procedure admits a single state from every non-empty ∆-box into the pruned state space. Therefore, the number of states in the pruned state space is exactly the number of non-empty ∆-boxes. ProcedurePrune2 relies onGeneralized Binary Search (procedureGBS) to locate the position of the energy value of the unrounded state in the ordered list of (rounded) energy values, for the same processor, based on [E,∆]-closeness instead of exact equality. Procedure GBS is a natural generalization of binary search in our context, and its correctness follows from the invariant that distinct energy values for the same processor in the pruned state space are always ∆-far. Further, note that GBS becomes the well-known equality-based binary search algorithm if we set ∆ = 1. The resulting pruned state space possesses the following property: any two energy values for the same processor across all states in the pruned state space are either equal or ∆-far, and equal energy values correspond to the same schedule. Note that any two vectors in different ∆-boxes are [E,∆]-far, i.e., for ϕ ∈ Bi,k and ϕ′ ∈ Bi,l, k 6= l, there exists j such that either Eϕ ′ i,j > ∆E ϕ i,j or Eϕ ′ i,j < 1 ∆E ϕ i,j . Thus the maximum number of Bks is upper bounded by a polynomial in the number of distinct energy values of each processor. In light of the new pruning procedure, we recast lemma 3 to obtain a bound on the cardinality of the pruned state space. Lemma 6. The number of states in each iteration after pruning as produced by Prune2, |Pi|, is polynomial in N and 1/. Proof. Let Ei,j be the set containing all distinct energy values of the j-th coordinate of every ϕ ∈ Pi Ei,j = {Eϕi,j | ϕ ∈ Pi}. The cardinality of Pi is bounded above by the number of non-empty ∆-boxes Bk,i induced by the [E,∆]-closeness relation on Ui, which in turn 76 Procedure Prune2(U , N, ) 1 ∆← (1 + 2N ) 2 P ← ∅ 3 Ũ ← ∅ 4 foreach ϕ ∈ U do 5 for j = 1 to M do 6 Let L ← {(Y ϕ̃j , Eϕ̃j ) | ϕ̃ ∈ Ũ}} 7 Sort L in ascending order of Eϕ̃j 8 Let L′ ← {Eϕ̃j | (Y ϕ̃j , Eϕ̃j ) ∈ L} 9 [found, k]← GBS(L′, Eϕj ,∆, 1, |L′|) 10 if found is true then /* k is the index of the entry (state) in the sorted L */ 11 (Y ϕj , E ϕ j )← L[k] 12 end if 13 end for 14 Add the rounded ϕ to Ũ 15 end foreach 16 Group states in Ũ for which Eϕj = Eϕ ′ j for every ϕ, ϕ ′ ∈ Ũ in ∆-boxes B1, . . . ,Bb 17 For every Bk, Add to P any ϕ ∈ Bk if Bk ∩ Ũ 6= ∅ 18 return P is bounded above by product of the number of distinct, ∆-far energy values across all M processor in the platform configuration. Let bi be the number of ∆-boxes Bi,k, k ∈ {1, . . . , bi}. Further let Bi = |{k | Bi,k ∩ Ui 6= ∅ k ∈ {1, . . . , bi}}|. Accordingly |Pi| = Bi 6 bi 6 M∏ j=1 |Ei,j |. Assume that energy values in each Ei,j are sorted in (strictly) increasing order such that E ϕq j < E ϕ` j when q < `, ∀j ∈ {1, . . . ,M}. 77 Procedure GBS(list, needle, ∆, min, max) output: A pair [found, index]: found is true if needle is successfully located in list; false otherwise. index is the location of of needle in list if it is found 1 while min < max do 2 mid = min+ (max−min)/2 3 if 1∆ list[mid] 6 needle 6 ∆list[mid] then 4 return [true,mid] 5 end if 6 if needle > ∆list[mid] then 7 min← mid+ 1 8 else 9 max← mid− 1 10 end if 11 end while 12 return [false, ] Any two consecutive non-zero energy values in the total orderings above are related by E ϕυ+1 j Eϕυj > ( 1 + 2N ) ∀υ ∈ {1, . . . , |Ei,j | − 1}. Therefore, Eϕ2i,j > ∆E ϕ1 i,j Eϕ3i,j > ∆E ϕ2 i,j > ∆ 2Eϕ1i,j ... E ϕ|Ei,j | i,j > ∆E ϕ|Ei,j |−1 i,j > · · · > ∆(|Ei,j |−1)Eϕ1i,j . (A.1) From (A.1) we get βi,j = E ϕ|Ei,j | i,j Eϕ1i,j > ∆(|Ei,j |−1). (A.2) 78 The constant βi,j is not defined if Ei,j does not contain two non-zero energy values, which happens in the following cases 1. The initial state, i.e. β0,j , 2. If |Ei,j | = 1, 3. If |Ei,j | = 2 and Eϕ1i,j = 0. If βi,j is defined, then, from (A.2), we get log∆∆ (|Ei,j |−1) < log∆ βi,j ⇒ |Ei,j | < 1 + lnβi,jln (1 + 2N ) 6 1 + 2N ( 1 + 2N ) lnβi,j (A.3) 6 1 + 3N lnβi,j (by inequality (4.9)) = O (N/ log βi,j) ∀i ∈ {1, . . . , N} (A.4) (Inequality (A.3) is obtained using ln(x+ 1) > xx+1 when x > −1). It might be the case that |Ei,j | > 2 and Eϕ1i,j = 0. Noticing that there cannot possibly be more than one zero element in Ei,j — since it contains distinct energy values — the situation can be handled easily by modifying equation (A.1) to use Eϕ2i,j as the minimum energy instead of E ϕ1 i,j . Never- theless, this will yield the same asymptotic upper bound as that in equa- tion (A.4). The cardinality of Pi is therefore |Pi| 6 M∏ j=1 |Ei,j | (A.5) 6 (N/)M M∏ j=1 log βi,j [if βi,j is defined] (A.6) 79 We will also assume that SIZE(βi,j) is polynomially bounded in SIZE(I, ) in order for the bound in (A.6) to be polynomial in SIZE(I, ). Further, under the assumption that the packing functions fj are [E,∆]-closeness preserv- ing (satisfy the condition in proposition 3), lemma 4 says that every energy value of the optimal solution can be rounded down at most N times as a result of pruning, so every energy value of the solution returned by Sched- ulePruned is at most ( 1 + 2N )N times the corresponding energy value in the optimal solution, and thus the 1 + approximation factor is directly obtained as we did earlier. A.1 Running-Time Analysis Procedure GBS is just an extension of binary search and therefore requires O(log n) time to locate an element in a list of size n. Therefore, the number of energy comparisons carried out for the state space Ui is bounded above by |Ui|∑ k=2 2M log k − 1 = O(|Ui| log |Ui|). Grouping states into ∆-boxes can be done as follows: Start with any vector in Ũi, create a ∆-box for it and then insert it in this ∆-box. Then pick an- other vector and compare with the first vector: if both have exactly the same energy values for all M processors, then insert the vector being examined in the ∆ -box where the first vector resides, otherwise create a new ∆-box for it and insert it there. Therefore, for the k-th vector in Ũi, say ϕk, we will need to energy-wise compare ϕk to only one vector in the thus created ∆-boxes, and insert to the ∆-box to which it belongs, or otherwise create a new ∆-box for it. If, at the worst, when examining the k-th vector, every vector so far examined resides in its own ∆-box, then the algorithm will need to perform k vector comparisons. Therefore, the maximum number of 80 vector comparisons of the grouping procedure is bounded above by |Ũi|∑ k=2 k = O(|Ui|2). Therefore the time complexity of Prune2 is O(|Ui|2). If we let β̂ := maximaxj βi,j , then |Pi| = O (( N/ log β̂ )M) . If we write the bounds above in terms of Pi−1 and sum them over N , then the the time complexity of Algorithm 3 with the new pruning procedure is T (N, ) = O ( N ( N/ log β̂ )2M) . This bound is polynomial in N , 1/, and the number of bits required to encode the input in binary, where the number of processors M is constant. 81
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Approximation algorithms for task allocation with QoS...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Approximation algorithms for task allocation with QoS and energy considerations Alahmad, Bader Naim 2011
pdf
Page Metadata
Item Metadata
Title | Approximation algorithms for task allocation with QoS and energy considerations |
Creator |
Alahmad, Bader Naim |
Publisher | University of British Columbia |
Date Issued | 2011 |
Description | We consider general problems of allocating tasks to processors where each task is associated with a set of service classes. A service class is a tuple: the first element represents the resource utilization and the second element the quality of service associated with that resource utilization. Before allocating a task to a processor, we need to assign it to a service class. We consider some elementary problems that arise from this setting. What is the minimum number of processors needed if we also need to attain a minimum aggregate QoS level? Given a fixed set of processors, what is the maximum attainable aggregate QoS? Such questions apply to the partitioned scheduling of real-time tasks on multiprocessors and to other resource allocation problems. The basic questions are NP-Hard, and we present polynomial time approximation algorithms for certain special, but practical, cases. We make interesting use of maximum flows in a bipartite graph while developing the polynomial time approximation schemes. We then integrate energy expenditure to the model above and address the problem of partitioning a set of independent, periodic, real-time tasks over a fixed set of heterogeneous processors while minimizing the energy consumption of the computing platform subject to a guaranteed quality of service requirement. This problem is NP-Hard and we present a fully polynomial time approximation scheme for this problem. The main contribution of our work is in tackling the problem in a completely discrete, and possibly arbitrarily structured, setting. In other words, each processor has a discrete set of speed choices. Each task has a computation time that is dependent on the processor that is chosen to execute the task and on the speed at which that processor is operated. Further, the energy consumption of the system is dependent on the decisions regarding task allocation and speed settings. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2011-06-02 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NoDerivs 3.0 Unported |
DOI | 10.14288/1.0103223 |
URI | http://hdl.handle.net/2429/35081 |
Degree |
Master of Applied Science - MASc |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2011-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nd/3.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2011_fall_alahmad_bader.pdf [ 820.25kB ]
- Metadata
- JSON: 24-1.0103223.json
- JSON-LD: 24-1.0103223-ld.json
- RDF/XML (Pretty): 24-1.0103223-rdf.xml
- RDF/JSON: 24-1.0103223-rdf.json
- Turtle: 24-1.0103223-turtle.txt
- N-Triples: 24-1.0103223-rdf-ntriples.txt
- Original Record: 24-1.0103223-source.json
- Full Text
- 24-1.0103223-fulltext.txt
- Citation
- 24-1.0103223.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0103223/manifest