Cross-layer Adaptive Transmission Scheduling in Wireless Networks by Minh Hanh Ngo B.E., The University of Sydney, 2003 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in The Faculty of Graduate Studies (Electrical & Computer Engineering) The University Of British Columbia November, 2007 © Minh Hanh Ngo 2007 Abstract A new promising approach for wireless network optimization is from a cross-layer perspective. This thesis focuses on exploiting channel state information (CSI) from the physical layer for optimal transmission scheduling at the medium access control (MAC) layer. The first part of the thesis considers exploiting CSI via a distributed channel-aware MAC protocol. The MAC protocol is analysed using a centralized design approach and a noncooperative game theoretic approach. Structural results are obtained and provably convergent stochastic approximation algorithms that can estimate the optimal transmission policies are proposed. Especially, in the game theoretic MAC formulation, it is proved that the best response transmission policies are threshold in the channel state and there exists a Nash equilibrium at which every user deploys a threshold transmission policy. This threshold result leads to a particularly efficient stochastic-approximation-based adaptive learning algorithm and a simple distributed implementation of the MAC protocol. Simulations show that the channel-aware MAC protocols result in system throughputs that increase with the number of users. The thesis also considers opportunistic transmission scheduling from the perspective of a single user using Markov Decision Process (MDP) approaches. Both channel state information and channel memory are exploited for opportunistic transmission. First, a finite horizon MDP transmission scheduling problem is considered. The finite horizon formulation is suitable for short-term delay constraints. It is proved for the finite horizon opportunistic transmission scheduling problem that the optimal transmission policy is threshold in the buffer occupancy state and the transmission time. This two-dimensional threshold structure substantially reduces the computational complexity required to compute and implement the optimal policy. Second, the opportunistic transmission scheduling problem is formulated as an infinite horizon average cost MDP with a constraint on the average waiting cost. An advantage of the infinite horizon formulation is that the optimal policy is stationary. Using the Lagrange dynamic programming theory and the supermodularity method, it is ii Abstract proved that the stationary optimal transmission scheduling policy is a randomized mixture of two policies that are threshold in the buffer occupancy state. A stochastic approximation algorithm and a Q-learning based algorithm that can adaptively estimate the optimal transmission scheduling policies are then proposed. iii Table of Contents ^ii Abstract iv Table of Contents ^ List of Figures ^ix List of Abbreviations ^xi Preface ^ xii Acknowledgements ^ xiii Dedication ^ xiv 1 Introduction 1 ^ 1.1 Thesis Overview ^ 4 1.2 Cross-layer Optimization 6 ^ 1.2.1^Cross-layer Medium Access Control ^ 1.2.2^Opportunistic Transmission Scheduling 1.3 1.4 1.5 2 ^ 8 9 1.2.3^Multipacket Reception Wireless Networks ^ 10 Analytical Tools 12 ^ 1.3.1^Game Theory for Communications ^ 12 1.3.2^Markov Decision Processes 14 ^ 1.3.3^Stochastic Approximation Algorithms ^ 15 Summary of Results and Contributions 16 ^ 1.4.1^Outline of Results ^ 16 1.4.2^Overview of Contributions 21 ^ 22 Thesis Organization ^ Optimal Distributed Medium Access Control iv ^ 23 ^ Table of Contents 2.1 Introduction ^23 2.1.1 Main Results ^25 2.1.2 Related Work ^26 27 2.2 Distributed MAC for Wireless Networks ^ 2.2.1 Transmission Scheduling in Random Access Wireless Networks . ^27 2.2.2 G-MPR Model ^ 29 31 2.3 Distributed MAC Optimization Problem ^ 2.3.1 Problem 1: Optimal Distributed MAC for Asymmetric Networks ^ 32 2.3.2 Problem 2: Optimal Distributed MAC for Symmetric Networks 33 2.4 Structured Optimal Transmission Policy for Heterogeneous Networked Users 34 2.4.1 Optimality of Pure Transmission Policies ^34 2.4.2 Piecewise Continuity ^35 2.4.3 Optimality of Not Transmitting ^ 36 2.5 Structured Optimal Transmission Policy for Homogeneous Networked Users 37 2.6 Estimating Optimal Transmission Policy via Stochastic Approximation ^40 2.7 Examples ^ 44 2.7.1 An Analytical Example ^ 44 2.7.2 Numerical Examples ^ 45 2.8 Conclusions ^48 2.9 Analytical Proofs ^49 2.9.1 Proof of Theorem 2.1 ^ 2.9.2 Proof of Theorem 2.2 49 ^51 2.9.3 Proof of Theorem 2.3 ^ 53 2.9.4 Proof of Theorem 2.4 ^ 53 2.9.5 Proof of Corollary 2.1 ^54 2.9.6 Proof of Corollary 2.2 ^54 2.9.7 Proof of Theorem 2.6 ^55 3 Game Theoretic Approach for Medium Access Control ^ 57 3.1 Introduction ^57 3.1.1 Main Results ^60 3.1.2 Related Work ^ 3.2 Distributed MAC for Wireless Networks ^ 61 63 3.2.1 Random Access Wireless Network Model ^ 63 3.2.2 G-MPR Model ^ 65 v Table of Contents 3.3 Game Theoretic MAC Formulation ^ 3.4 Structured Nash Equilibrium 66 ^69 3.4.1 Optimality of Threshold Policies ^70 3.4.2 Existence of a Nash Equilibrium ^72 3.5 Symmetric Nash Equilibrium for the Symmetric Game ^ 75 3.6 Learning Best Responses via Stochastic Approximation ^77 3.6.1 Deterministic Best Response Dynamics Algorithm ^ 77 3.6.2 Convergence of Best Response Dynamics ^ 78 3.6.3 Stochastic Best Response Dynamics Algorithm ^ 81 3.6.4 Properties of Stochastic Best Response Dynamics Algorithm^83 3.7 Extension to Wireless Networks with Packet Priorities ^ 85 3.7.1 MAC for Wireless Networks with Packet Priorities^ 86 3.7.2 Structured Nash equilibrium ^88 3.7.3 Learning Best Responses via Stochastic Approximation ^ 89 3.8 Numerical Studies ^ 90 3.9 Conclusions ^ 103 3.10 Analytical Proofs ^ 104 3.10.1 Proof of Theorem 3.2 ^ 105 3.10.2 Proof of Theorem 3.3 ^ 106 3.10.3 Proof of Theorem 3.4 ^ 108 3.10.4 Proof of Lemma 3.1 ^ 108 3.10.5 Proof of Lemma 3.3 ^ 109 3.10.6 Proof of Corollary 3.4 ^ 110 4 Finite Horizon Transmission Scheduling ^ 4.1 Introduction ^ 112 112 4.1.1 Main Results ^ 113 4.1.2 Related Work ^ 114 4.2 Finite Horizon Transmission Scheduling Problem ^ 115 4.2.1 Packet Transmission in Correlated Fading Channels ^ 115 4.2.2 MDP Formulation ^ 117 4.3 Optimality of Threshold Transmission Policies ^ 120 4.3.1 Finite Horizon Dynamic Programming ^ 120 4.3.2 Threshold Structure of the Optimal Transmission Policies ^ 121 4.3.3 Implementation Issues and Discussion ^ vi 124 Table of Contents 4.4 Numerical Studies ^ 126 4.5 Conclusions ^ 132 4.6 Analytical Proofs ^ 133 4.6.1 Proof of Lemma 4.1 ^ 133 4.6.2 Proof of Theorem 4.1 ^ 133 4.6.3 Proof of Theorem 4.2 ^ 134 4.6.4 Proof of Lemma 4.2 ^ 135 4.6.5 Proof of Lemma 4.3 ^ 136 5 Constrained Infinite Horizon Transmission Scheduling ^ 5.1 Introduction ^ 137 137 5.1.1 Main Results ^ 139 5.1.2 Related Work ^ 141 5.2 Infinite Horizon Transmission Scheduling in Correlated Fading Channels^143 5.2.1 The Transmission Scheduling Problem ^ 144 5.2.2 CMDP Formulation ^ 146 5.3 CMDP Parameterization using Lagrange Multiplier Method ^ 148 5.3.1 Buffer Stability and Recurrence of Markov Chains ^ 148 5.3.2 Unconstrained Transmission Scheduling with Lagrangian Cost^150 5.4 Structure of Optimal Transmission Scheduling Policy ^ 151 5.4.1 Threshold Structure of Discounted Cost Optimal Policy ^ 152 5.4.2 Threshold Structure of Unconstrained Average Cost Optimal Policy 154 5.4.3 Monotonicity of Constrained Average Cost Optimal Policy ^ 156 5.5 Policy Optimization Algorithms ^ 5.5.1 Policy Optimization via Q-learning ^ 156 157 5.5.2 Policy Optimization via Stochastic Approximation ^ 161 5.6 Numerical Examples ^ 5.6.1 Energy Efficiency Improvement ^ 164 165 5.6.2 Optimal Policy Estimation via Lagrange Q-learning ^ 166 5.6.3 Performance of the Gradient-based Learning Algorithm. ^ 172 5.7 Extension to the Multiple-action Case ^ 174 5.8 Conclusions ^ 176 5.9 Analytical Proofs ^ 177 5.9.1 Proof of Lemma 5.2 ^ 177 5.9.2 Proof of Theorem 5.1 ^ 178 vii Table of Contents 5.9.3 Proof of Corollary 5.2 ^ 6 Discussion and Conclusions ^ 179 182 6.1 Overview of Thesis ^ 182 6.2 Discussion on Results and Algorithms ^ 183 6.3 Summation ^ 185 6.4 Future Research ^ 186 Bibliography ^ 190 viii List of Figures 1.1^Outline of research project 4 ^ 2.1^Structure of the optimal channel-aware transmission policy. ^ 25 2.2^Performance gains of the optimal channel-aware MAC ^ 46 2.3^Convergence of estimates by Algorithm 2.1 47 ^ 3.1^An example of a channel-aware transmission policy. ^ 58 3.2^Threshold structure of the optimal transmission policy ^ 58 3.3^Example of a transmission policy for networks with packet priority ^ 85 3.4^The optimal transmission policy is threshold for every packet priority 85 3.5^The unique Nash equilibrium for a 3-player game ^ 92 3.6^Convergence and tracking capability of the SA Nash learning algorithm 94 3.7^Emergence of a Nash equilibrium. ^ 96 3.8^Comparison of system performance I: system throughput 97 ^ 3.9^Comparison of of system performance II: successful transmission rates. 98 3.10 Price of anarchy in system throughput ^ 99 3.11 Price of anarchy in successful transmission rate. ^ 99 3.12 Equilibrium transmission thresholds vs network size ^ 101 3.13 Example of a Nash Eq. transmission policy^ 102 3.14 Adaptive estimation of the Nash Eq. transmission threshold vector. ^. 102 4.1^Two-dimensional threshold structure of the optimal transmission policy • 113 4.2^Throughput-energy tradeoff of the opportunistic transmission policy^. . • 127 4.3^Threshold structure of the optimal transmission scheduling policy I ^ 129 4.4^Threshold structure of the optimal transmission scheduling policy II^. . • 130 4.5^Structure of optimal transmission policies for non-convex penalty costs . • 131 4.6^An example with a 2-state markov chain channel model ^ 132 5.1^Randomized mixture structure of the constrained optimal policy ^ 139 ix List of Figures 5.2 Diagram of the Lagrange Q-learning algorithm ^ 157 5.3 Optimal policy parameterization for SPSA ^ 161 5.4 Energy efficiency improvement by opportunistic transmission ^ 165 5.5 Threshold unconstrained optimal policy I ^ 167 5.6 Estimate of unconstrained optimal policy by monotone policy Q-learning^167 5.7 Estimate of unconstrained optimal policy by standard Q-learning I ^ 168 5.8 Threshold unconstrained optimal policy II ^ 168 5.9 Adaptivity of monotone-policy Q-learning ^ 169 5.10 Non-adaptivity of standard Q-learning ^ 169 5.11 Convergence of monotone-policy Q-learning in terms of costs ^ 170 5.12 Convergence of Lagrange Q-learning ^ 171 5.13 Estimates of the constrained optimal policy by SPSA ^ 173 5.14 Convergence of the SPSA algorithm ^ 173 5.15 Structure of the optimal policy for power/rate control problems ^ 175 x List of Abbreviations ACK Acknowledgement a.e. Almost everywhere ARQ Automatic repeat request CSI Channel state information CDMA Code division multiple access CMDP Constrained Markov decision process CSMA Carrier sense multiple access CSMA-CD Carrier sense multiple access with collision avoidance DS-CDMA Direct sequence code division multiple access G-MPR Generelized multipacket reception FSMC Finite state Markov chain i.i.d. Independent and identically distributed ODE Ordinary differential equation MAC Medium access control MAI Multiple access interference MDP Markov decision process MPR Multipacket reception NACK Negative acknowledgment PHY Physical POMDP Partially observable Markov decision process QCIF Quarter dommon intermediate format QoS Quality of Service SINR Signal to interference noise ratio SNR Signal to noise ratio SPSA Simultaneous perturbation stochastic approximation TDMA Time division multiple access WLAN Wireless local area network Preface The thesis consists of 6 chapters. • Chapter 1 is the introduction of the thesis. The chapter introduces the topic of crosslayer transmission scheduling in wireless networks, outlines the thesis structure and specific problems tackled. The chapter also contains brief introduction to the main mathematical tools that are used in the thesis. • Chapter 2 considers optimizing a distributed channel-aware medium access control (MAC) protocol to maximize the system stable throughput. Structural results and numerical algorithms are derived. Simulations highlight the performance gain obtained by exploiting channel state information for optimal transmission scheduling. • Chapter 3 uses a game theoretic approach to analyse the distributed MAC protocol. Threshold structural results are proved. A gradient based stochastic approximation algorithm with tracking capability is proposed for learning the optimal transmission policy online. • Chapters 4 considers the finite horizon opportunistic transmission scheduling problem using a Markov decision process (MDP) approach. Due to two threshold structural results, dynamic programming is sufficiently efficient to be implemented in a real time algorithm. • Chapter 5 formulates the infinite horizon opportunistic transmission scheduling problem as a constrained MDP. Structural results and online learning algorithms are derived. • Discussion of results, proposal of directions for future research and conclusion are presented in Chapter 6. xii Acknowledgements I would like to thank Dr. Vikram Krishnamurthy for his tireless efforts to provide me with the most dedicated supervision. I am indebted to Dr. Krishnamurthy for teaching me technical writing and introducing me to several exciting research areas. I am also very grateful for his encouragement, generosity with time and valuable guidance throughout my doctorate program. I am grateful to Dr. Lang Tong at the University of Cornell, for his collaboration, as well as the discussion that helped defining some research directions for my degree. I would also like to express my sincere thanks to Dr. Fei Yu, for his help and guidance during the early stage of my PhD program. This research has been supported in parts by an NSERC strategic grant and a Killam Predoctoral Scholarship, for which I am grateful. My warmest thanks are to my colleagues, whom I would miss when I no longer come to the lab every morning. They are the nicest friends. Together we had a lot of happy times and I am forever thankful for their sharing and caring, for the laughters, and the friendship. Especially, Alex Wang, Lax Pillutla, Dr. Teerawat Issariyakul, Farnoush Mirmoeini, Irtiza Zaidi have been extremely kind, and Arsalan Farrokh has been very humorous. I would also like to thank my other friends at UBC: Trung Thanh Nguyen and Cuong Le have been very generous and supportive. My heartfelt thanks are to Alex Wang, with whom I have shared many of my dreams I thank him for his support and the happy smiles. His excellent proofreading skills have also helped me completing this thesis. My deepest thoughts are, as always, with my parents and my big brother. All my life, they have always been there to love, to take care, to teach and to support. I thank my father and my mother for teaching me to be a good person, and for letting me know that their faith in me would never fade. To my father, who is my first teacher in maths, languages, and life. To my mother, who loves me the most. To my brother, who is my childhood role model, and the greatest friend. xiv Chapter 1 Introduction Optimization of wireless networks is a vast area of research, in which cross-layer optimization has emerged as a new promising approach. Telecommunication systems such as cellular networks, the Internet, wireless local area networks (WLANs) have traditionally been designed using layered architecture based models. A layered architecture allows conceptual layers to be designed independently and hence simplifies the design, implementation and deployment of communication networks. For wireless networks, however, there are other challenges which create dependencies across layers and make the traditional modular design approach inefficient. For example, the wireless channel quality is typically time-varying due to fading and mobility. This time-varying nature of the wireless channel immediately motivates opportunistic scheduling or resource allocation schemes, which exploit channel conditions for transmission or power control decisions. Another motivation for cross-layer optimization and design is the fact that the wireless channel is inherently a shared medium. This issue creates interdependencies between users, and also between the physical (PHY) layer and higher layers, such as the medium access control (MAC) layer [1, 2]. Additionally, the wireless medium also offers some new modalities of communication that the layered architectures do not accommodate [3]. For example, the physical layer of a wireless network may be capable of receiving multiple packets at the same time [4], or partial collision resolution. In such scenarios, the traditional approach of designing MAC protocols independently of the PHY layer will not be adequate. An additional difference between wireless and wireline networks that motivates cross-layer approaches is that the broadcast nature of the wireless channel allows users to collaborate [3]. Implementing user collaboration will be useful, for example, for designing multiple access schemes, or sensor sleep scheduling, or retrieving data from correlated data sources. Overall, it is apparent that cross-layer optimization can achieve performance gains for 1 Chapter 1. Introduction wireless networks. Recent wireless standards have been designed with cross-layer issues in mind. For example, CDMA2000 high data rate systems deploy a built-in mechanism where the channel state is periodically measured by the user and fed back to the base station [1]. This channel-state measurement mechanism is useful for dynamic scheduling where some form of priority is given to users with better instantaneous channel states. The work in this thesis assumes an underlying channel-state measurement mechanism and revolves around transmission scheduling algorithms that adjust according to the channel states of users. Precisely, the topic is PHY-MAC cross-layer optimal multiuser and single-user transmission scheduling. In most cases, the resulting optimal transmission scheduling algorithms are distributed, and can adaptively track slowly time-varying systems. The core of the optimization problems is channel-aware transmission scheduling, where the information at the physical layer is exploited at the MAC layer to optimize certain performance metrics, either with or without constraints, in both multiuser and single-user contexts. A merit of the thesis is the diversity and complementarity of the methodologies deployed to solve the transmission scheduling problems. In fact, in the multiuser context, distributed channel-aware medium access control (or multiuser transmission scheduling) is analysed using a centralized design approach, and a non-cooperative game theoretic approach to mediate multiuser interference and enhance channel utilization. In the single user context, opportunistic transmission scheduling is optimized by a finite horizon MDP approach and a constrained infinite horizon MDP approach in order to balance between energy efficiency and throughput. The primary theoretical contributions of the thesis include important analytical results on the structure of the optimal scheduling policies. The analytical results are original, and in most cases state that the optimal transmission scheduling policy has a threshold structure (see Chapters 3, 4, 5). Threshold structural results are of utmost significance in optimization problems as they reduce the optimization dimensionality to the extreme. For example, in the game theoretic MAC problem in Chapter 3, due to the result that the optimal transmission policy is threshold in the channel state, the transmission policy optimization problem is converted from a function optimization problem to a scalar optimization problem. This conversion allows for the use of many gradient-based stochastic approximation algorithms (see [5]) to solve the optimal policy estimation problem. Other contributions of the thesis are the derivation of optimal policy learning algorithms, which make the results in the thesis complete. Due to the structural results, the estimation 2 Chapter 1. Introduction algorithms are very efficient. Except for an off-line algorithm in Chapter 2, all derived algorithms can track slowly time-varying systems. The real-time learning algorithms are based on stochastic approximation and do not require a central controller, communication or exchange of information between users, or knowledge of system parameters such as the number of active users, or the channel state probability distribution functions. In the cross-layer channel-aware MAC problems, the wireless network model considered is the random access network, where users communicate with a common base station using non-orthogonal communication channels such as in CDMA (code division multiple access) or multiple antenna networks. In comparison, in the single-user context, opportunistic transmission scheduling is considered for point-to-point packet transmission in correlated fading channels. Intuitively, for multiuser transmission scheduling, channel utilization enhancement is due to both multiuser diversity and opportunistic scheduling. Whereas, for single-user opportunistic transmission scheduling, channel utilization is enhanced as the user takes into account the current and predicted values of its channel state into transmission decisions. The applications of the distributed channel-aware MAC algorithms in this thesis are networks of primitive or sophisticated nodes, such as wireless local area networks (WLANs) or sensor networks or cellular networks. In comparison, the single user opportunistic transmission scheduling solutions are most useful for networks where both energy and delay are critical issues, such as in multimedia communication, or mission-critical sensor networks. The derivation of structural results and estimation algorithms in this thesis involves novel use of several powerful mathematical tools, such as game theory, dynamic programming, stochastic approximation; and clever application of a number of mathematical concepts such as the equilibrium concept in games, principle of optimality in Markov decision processes (MDPs), and supermodularity and complementarity [6]. The mathematical tools will be outlined later in the chapter. The remainder of this introductory chapter is organized as follows. The next section contains a brief thesis overview. Then Section 1.2 outlines the state of the art of two cross-layer optimization areas, which are the topics of the studies in the thesis: crosslayer medium access control and opportunistic transmission scheduling. Introduction to the major mathematical tools utilized in the thesis, with emphasis on their applications in electrical and communication engineering, is given in Section 1.3. Section 1.4 is a summary of the contributions of the thesis. Lastly, Section 1.5 outlines the thesis organization. 3 Chapter 1. Introduction Channel-aware Transmission Scheduling in Wireless Networks Transmission Scheduling in Correlatted Fading Channels Cross-layer Distributed Medium Access Control 1. Finite horizon MDP approach: optimal channel—aware transmission scheduling subject to a hard delay constraint. 1. A centralized optimization approach: Design transmission policies to optimize system throughput 2. A non—cooperative game—theoretic approach, Users are selfish and aim to maximize their individual throughputs 2. Infinite horizon constrained MDP approach: optimal channel—aware transmission scheduling subject to an average delay constraint. 3. Extension of the game theoretic models and results to wireless networks with packet priorities. Mobile networks, wireless LANs or networks of primitive sensors, where energy and multiaccess interference are critical issues Mobile networks, wireless LANs or networks of medium sophistication sensors, where both energy and delay are crititcal issues Analytical Tools 1. Stochastic Optimization 2. Game Theory 3. Dynamic Programming Structural Results 4. Lagrange multiplier method for constrained optimization Figure 1.1: Outline of research project 1.1 Thesis Overview The plan of the thesis is outlined in Fig. 1.1. The thesis consists of two main parts. The first part, consisting of Chapter 2 and Chapter 3, is concerned with optimal distributed channelaware MAC for multiuser random access networks with the multipacket reception (MPR) capability, e.g., CDMA networks. The multipacket reception capability will be defined later in the chapter. Furthermore, the focus of the second part of the thesis, which comprises Chapter 4 and Chapter 5, is on opportunistic transmission scheduling in correlated fading channels in the single-user context. In the distributed channel-aware MAC problems for random access networks, it is as- 4 Chapter 1. Introduction sumed that each user accesses the common channel with its own channel aware transmission - policy. The objective is then to analyse the transmission policies that optimize some utility function, e.g., the system throughput. The optimization approaches taken are a centralized design approach and a game theoretic approach. In the centralized design approach, the objective is to design transmission policies for users in the network to maximize the system stable throughput. Once the transmission policies are designed, users transmit according to their individual policies without a central controller or cooperating with other users. In comparison, in the game theoretic approach, users are selfish, rational, and aim to maximize their own expected rewards. The game theoretic frame work leads to a particularly useful threshold structural result and hence allows users to learn their optimal transmission policies online, in a completely decentralized fashion. The distributed MAC protocol that is obtained from the centralized design approach is optimal in the system throughput sense. However, due to the complexity involved in solving the centralizd optimization problem (even numerically), the resulting MAC protocol is more suitable for time-invariant networks. In comparison, the game theoretic formulation yields a very efficient way to optimize the MAC protocol. In particular, in the game theoretic setting, every user optimizes its own transmission policy; then at equilibrium, no user will unilaterally deviate from its policy. Furthermore, it will be shown later in the thesis that the game theoretic MAC solution has a particularly useful threshold structural result, which substantially simplifies the process of learning the optimal transmission policy. Specifically, it will be shown that a rational user should transmit if and only if its channel state is better than a certain threshold, i.e. a threshold transmission policy is optimal. Therefore, to track the game theoretic optimal transmission policy, each user needs to adaptively estimate only a single parameter. Effectively, the function optimization problem is converted to a single parameter optimization problem, which can be solved easily and very efficiently by many stochastic approximation algorithms. The tradeoff of the game theoretic approach is only the fact that it is suboptimal in the system throughput sense. However, when the system performance is limited by multiuser interference, the game theoretic MAC protocol can achieve system throughputs that are nearly optimal, as users implicitly learn that being over-aggressive will have a counter-effect on their own performance. Taking into account the simplicity of the game theoretic model and the implementation of game-theoretic solutions, the game theoretic approach is practically very useful, especially for large scale wireless networks, where certain statistical properties (e.g., number of active users, channel probability distribution functions) may vary with time. 5 Chapter 1. Introduction The second part of the thesis focuses on opportunistic transmission scheduling in correlated fading wireless channels using Markov Decision Process (MDP) approaches. Due to mobility, multipath propagation and shadowing, transmission errors are statistically correlated. In a discrete time system, a correlated fading channel can be modeled by a Finite State Markov chain (FSMC) [7, 8]. We assume such channel models and analyse the structure of the optimal channel-aware transmission scheduling policy. Due to uncertainty in the success of every transmission over a wireless link, and the dynamics of the channel state, a transmission policy optimization problem translates to an MDP. Two MDP-based formulations, which are complimentary to each other, are considered. The structure of the optimal transmission policies is studied in both cases. In particular, it is shown that the optimal policy is threshold in the buffer occupancy state (and also in the residual transmission time for the finite horizon MDP formulation). The threshold structural results make the estimation and implementation of the optimal policy very efficient. Especially, the average cost MDP approach leads to particularly useful reinforcement learning and adaptive stochastic approximation algorithms that exploit the threshold structural result to estimate the optimal transmission policy in real time. 1.2 Cross-layer Optimization The layering principle offers a convenient means to separate the designs of different conceptual network layers. Due to fundamental differences between wireless and wireline communication channels, as well as the new advances in signal processing and communications, cross-layer optimization can potentially boost the system performance of wireless networks [9, 10]. An overview of cross-layer proposals and current research direction is given in [1-3]. In [9], the authors discussed different cross-layer approaches for multiaccess uplink models, as well as associated issues such as stability, and the fundamental tradeoffs in cross-layer wireless resource allocation. In summary, the necessity for cross-layer designs are due to the following: • Wireless communication channels are prone to high error rates due to multipath fading and shadowing. In comparison, in a wireline network, packet errors are mainly due to congestion [3]. • The average channel state condition of a user depends on its location and the inter6 Chapter 1. Introduction ference level [1]. • The wireless channel varies over time and space and has short-term memory due to multipath fading and mobility [1, 11]. • In a wireline network, the collision channel model typically holds and it is optimal to have exactly one user to transmit at any given time. However, in a wireless network, packet collision can be resolved at the PHY layer by the means of signal separation via multiuser detection, or antenna array processing [2]. In the literature, the capability of receiving multiple packets simultaneously is referred to as the Multipacket Reception (MPR) capability. Due to the above issues, there is a strong motivation for incorporating the CSI at the physical layer into transmission scheduling or power/rate control decisions at the MAC layer to optimize the performance of a wireless network. In [1], cross-layer design and optimization problems are divided into three categories. The first category contains cross-layer optimization problems concerning fundamental properties of multiuser diversity and fading channels. An example of a cross-layer optimization problem in this category is designing MAC protocols taking into account the reception capability of the PHY layer, such as in Chapter 2 and Chapter 3 of the thesis, or in [4, 10, 12]. Furthermore, this first category also includes many work on wireless resource allocation, and opportunistic transmission/power control, for example [13-15]. The second category is for modeling network performance. Lastly, the cross-layer approach has been widely adopted for designing scheduling algorithms in the downlink of multi-carrier systems [1, 16]. As it is mentioned above, the first part of this thesis is on PHY-MAC cross-layer transmission scheduling in the multi-access context. Therein, multiuser diversity is exploited, and as a result, the system capacity is improved with the number of users. In the second part of the thesis, both the channel memory and channel state information are utilized for opportunistic transmission scheduling in the single user context. In the thesis, the primary focus is to derive structural results and learning algorithms for optimal PHY-MAC transmission scheduling. The fundamental issues in wireless resource allocation and cross-layer optimization, e.g., stability, energy/throughput tradeoffs, price of anarchy in game theoretic MAC, are also dealt with. 7 Chapter 1. Introduction 1.2.1 Cross-layer Medium Access Control As it is mentioned above, the first part of the thesis is concerned with optimal distributed MAC from a cross-layer perspective. This section contains an overview of the current MAC standards, proposals and research directions in cross-layer MAC. MAC protocols define how multiple users can efficiently share a physical channel to communicate with a common base station. The distributed MAC protocols that are used in practice can be broadly divided into three classes [2]: fixed allocation MAC protocols such as TDMA (time division multiple access); random access protocols such as ALOHA or CSMA (carrier sense multiple access), which is widely used in IEEE standards for WLANs, e.g, IEEE 802.1lb; and reservation based protocols. Traditionally, when a collision channel model is assumed, collision resolution is solely accomplished, if at all, by the MAC protocol, and the MAC layer can be designed separately from the PHY layer. The most important factor in designing a MAC protocol then is the statistical properties of the traffic. Tradeoffs between these MAC protocol classes are as follows. Fixed allocation schemes can guarantee some minimum performance but are inefficient at light system loads. Random access protocols are much more efficient and yield smaller delays than fixed allocation schemes at light system loads. However, at heavy or even moderate traffic loads, random access protocols can lead to excessive delay and waste of bandwidth due to collisions. Reservation based schemes stand between the two extremes. Work on cross-layer MAC for the collision channel model, such as exploiting channel state information for multiuser diversity, includes [17-19] and [20], where a network-assisted selective retransmission mechanism is proposed to achieve diversity. In modern communication systems, due to new advances in signal processing, the channel capacity is increased, and the collision channel model does not necessarily hold. Instead, collision resolution can be partially accomplished at the physical layer due to multiuser detection and antenna array processing [2]. Then it is vital to optimize the MAC layer protocols with respect to the reception capability of the PHY layer to achieve the new channel capacity. [12, 21] are tutorials on how advances in signal processing lead to the need to go beyond the collision channel model and adopt cross-layer approaches for designing the MAC layer. Furthermore, a wide range of cross-layer MAC solutions have been discussed in [2, 9]. The first challenge for cross-layer approaches is a proper model that interfaces the PHY 8 Chapter 1. Introduction layer and the network layer [12, 22, 23]. In [24], the authors proposed a multipacket reception (MPR) model that offers the generelization that, when there are simultaneous transmissions, the reception of packets is described by conditional probabilities instead of deterministic failure. In [24], it was shown that under the MPR model it is possible to achieve a nonzero asymptotic system stable throughput. Furthermore, an optimal decentralized MAC protocol for the MPR model is given in [25]. Recent MAC protocol proposals and analysis that are based on the MPR model include [26], where the authors proposed a multiqueue service room MAC protocol; [27], where a central controller decides which users can access the channel, based on a polling algorithm was proposed; [28], where the network-assisted protocol in [20] is generalized for the MPR model; and [4], where a generalization of the MPR model to the asymmetrical network model and a stability analysis are discussed. The cross-layer MAC proposals mentioned above have considered the MPR capability of the PHY layer but not the channel state information. Whereas, channel fluctuations at the PHY layer provide valuable information for random access [9, 12]. One of the early papers that proposed exploiting CSI for multiuser diversity in cross-layer approaches is [29], where the authors utilized CSI for power control to maximize the sum-rate in the uplink of a multiuser network. In [18, 19], it has been shown that exploiting CSI can improve the system performance. The channel model considered in [18, 19] was the collision channel model. In [12], the authors discussed channel-aware ALOHA for a modified version of the MPR model, which allows for asymmetric users. In [10] the authors proposed a generalization of the MPR model that explicitly incorporates CSI into the reception model and formulated the problem of optimizing a channel-aware ALOHA protocol for random access. The reception model in [10] is a suitable framework for exploiting CSI at the physical layer for random access at the MAC layer, and is adopted in the cross-layer MAC work in this thesis. 1.2.2 Opportunistic Transmission Scheduling Cross-layer medium access control is opportunistic multiuser scheduling, which achieves multiuser diversity gains because when users experiencing good channels are given priorities to transmit, the system is more likely to operate at its peak performance rather than average performance [30]. Here opportunistic single-user transmission scheduling, which is related to the second part of the thesis, is discussed. Single-user opportunistic scheduling means that the channel conditions are exploited to achieve higher performance. 9 Chapter 1. Introduction In wireless communications, the channel state of a user fluctuates with time. Furthermore, due to multipath propagation and shadowing, the fading process of a single-user channel typically has memory. An immediate cross-layer optimization topic is then exploiting the channel state information and possibly the channel memory for opportunistic transmission. Intuitively, it seems reasonable that the optimal transmission schemes result in transmitting more in the good channel states [9, 31]. However, when the channel memory is incorporated into the optimization, for example, by a Markov Decision Process (MDP) approach, then it is unclear whether this result will hold. Furthermore, when a constraint, e.g., a delay constraint, is taken into account, it may not be feasible to wait for the good channel states to transmit. Opportunistic transmission scheduling (or power and rate control) typically deals with one or several of these issues. For example, [32] addressed the problem of opportunistic file transfer over a fading channel subject to energy and delay constraints. Therein it is observed that exploiting CSI can improve the system performance. Other papers that also consider channel-aware transmission/rate/power control include [33], where opportunistic transmission scheduling policies are analysed under a partially observable MDP framework, and [34], where the problem of minimizing the energy consumption subject to an average delay constraint in the Gilbert-Elliot channel model is considered. In the second part of the thesis, MDP approaches are deployed to tackle problems of opportunistic transmission scheduling in correlated fading channel, subject to delay constraints. The specific context and overview of related work will be given therein. 1.2.3 Multipacket Reception Wireless Networks The topic of the multipacket reception capability in wireless networks is particularly relevant to the distributed MAC protocol optimization problems considered in Chapters 2 and 3 of the thesis. Traditionally, the design of MAC protocols has relied on the collision channel model, which assumes that all packets are lost if two or more users transmit at the same time. The collision channel model typically holds with few exceptions (e.g., the capture effect) for wireline networks. However, for many modern wireless networks, the collision channel model is not correct [4, 10, 21]. In particular, multiple access techniques such as CDMA, as well as multiple antenna communications with space-time coding techniques, allow for soft collision. That is, some 10 Chapter 1. Introduction packets can be received error-free even in the case of simultaneous transmissions. In addition, there are error control coding techniques and signal processing advances that reinforce the multipacket reception capability in wireless networks [10, 21]. As the multipacket reception capability is easily achievable and can increase the channel capacity, it is important to optimize random access protocols to incorporate the multipacket reception capability. The first mathematical model for the multipacket reception capability at the PHY layer was proposed in [24]. In particular, the authors proposed a reception model, namely the multipacket reception (MPR) model, where the number of received packets depends (probabilistically) on the total number of transmitting users. The MPR model in [24] is novel in the sense that it is much more general than the collision channel model and allows packet(s) to be correctly received with probabilities in the presence of simultaneous transmissions. The MPR model in [24] provides a framework for designing new medium access protocols that can achieve a significantly better system throughput than the collision channel model. It was shown in [24, 25] that assuming the proposed MPR model, it is possible to achieve a non-zero asymptotic system throughput. An optimal decentralized MAC algorithm for the MPR model was also proposed in [25]. The MAC algorithm in [25] requires global knowledge of the number of active users in the network. The above MPR model has some limitation. First, the MPR model is a coarse abstraction of the PHY layer, where channel states are not explicitly incorporated into the MPR model. In practice, channel states directly affect error rates in wireless communications. Second, the MPR model in [24] assumes that all users are symmetric, which typically is not the case for a wireless network. A generalization of MPR to the asymmetrical network model is given in [4]. In [10], the authors generalized the MPR model in [24] to allow for the case when the transmission success probabilities depend on the channel states of all transmitting users. In the thesis, this reception model is referred to as the Generalized Multipacket Reception (GMPR) model. The G-MPR model proposed in [10] offers a suitable framework for exploiting CSI at the physical layer for random access at the MAC layer and have been adopted in [10, 35-38]. 11 Chapter 1. Introduction 1.3 Analytical Tools The thesis deploys several powerful mathematical tools that come from three research areas that have had many applications in electrical engineering. In particular, we use game theory, which is a universally accepted tool in economics and social sciences and has been widely used in communications and networking [39]; Markov decision processes that have many applications in operations research, management science, and engineering; and stochastic approximation, which is a fundamental concept in stochastic control and optimization. This section attempts to briefly overview these three areas. 1.3.1 Game Theory for Communications Game theory is a branch of applied mathematics that provides models and tools for analysing situations where multiple rational agents interact, either cooperatively or non-cooperatively, to achieve their goals. The classic book The Theory of Games and Economic Behaviour by John von Neumann and Oskar Morgenstern published in 1944 ([40]) laid the first foundations for the research field of game theory. The research field now has numerous applications in economics, social sciences, engineering and recently computer science. A major breakthrough in game theory was due to John Nash when he introduced the solution concept of Nash equilibrium in the early 1950's 1 . A strategy combination constitutes a Nash equilibrium if each agent's strategy is optimal against other agents' strategies [41]. As a result, at a Nash equilibrium agents have no motivation to unilaterally deviate from their strategies. Nash equilibrium plays an essential role in analysis of conflict and cooperation in economics and social sciences. Besides Nash equilibrium, another outstanding solution concept in game theory is that of correlated equilibrium, which was first introduced in [42] 2 . The correlated equilibrium is a generalization of the Nash equilibrium and requires coordination among players. The correlated equilibrium can improves the system performance depending on the coordination mechanism. In this thesis, we focus on Nash equilibrium, whereas the correlated equilibrium for game theoretic transmission scheduling will be one of our future research topics. Dr. John Nash has received the 1994 Nobel prize in economics, shared with R. Selten and J. Harsanyi. R. Aumann has received the 2005 Nobel prize in economics for his contributions to game theory, together with Thomas Schelling. 1 2 Dr. 12 Chapter 1. Introduction Recent textbooks on game theory include [41, 43, 44]. The treatment of game theory in [43] is very comprehensive and complete. Additionally, [41] and [44] are two very wellwritten graduate level textbooks in game theory. In communications engineering, game theory is often used for distributed resource management algorithms. The underlying motivation is that game theoretic solutions are naturally autonomous and robust. Examples of applications of game theory in wireless communications include transmission or power control [45-49], pricing [50], flow and congestion control [51], and load balancing [52]. Game theory consists of two main branches namely non-cooperative and cooperative game theory. In Chapter 3 of this thesis, non-cooperative game theory is used to analyse the structure of the channel-aware distributed medium access control that constitutes a Nash equilibrium. Work related to the work in Chapter 3 includes papers on game theoretic transmission and power control that have been mentioned above: [49, 53] are on game theoretic power control, [46-48] focus on game theoretic analysis of the ALOHA random access protocol in the collision channel model. While game theory and the equilibrium solution concepts are extremely useful and have numerous applications, the theory of learning in games is yet relatively less complete in the sense that convergence results are hard to establish and may require either restrictive, or hard to verify conditions. [54] contains essential existing results in the field of learning in games, together with original work by the authors. The most studied methods of learning in games are fictitious play and best response dynamic [54]. These two methods have similar asymptotic behaviors, and convergence to a Nash equilibrium, in general, cannot be proved for neither of the methods. The exceptions are potential games, where users have a common utility function; then both fictitious play and best response dynamic converge to a Nash equilibrium. Emergence of a Nash equilibrium can also be proved for some 2-person games. Other methods of learning in games include regret-minimizing algorithms. It has been shown in literature that if a game is repeated infinitely many time such that every player plays according to a regret-minimizing strategy, then the empirical frequencies of play converge to a set of correlated equilibria [55, 56]. In the game theoretic analysis of the distributed channel-aware MAC protocol in this thesis, the structure of transmission policies that constitute the Nash equilibrium is analysed. In particular, it is proved that there exists a Nash equilibrium at which every player adopts a transmission policy that is threshold in the channel state. The structural result is useful as it substantially simplifies the problem of estimating the best-response policy. An 13 Chapter 1. Introduction adaptive stochastic approximation algorithm is then derived to allow users to learn their part of the Nash equilibrium, independently of other users. 1.3.2 Markov Decision Processes A Markov Decision Process (MDP) is a discrete time stochastic control process, where the states possess the Markov property, which is that the probability distribution of the future states, given the current and past states, depends only on the current state and not on any past states. The classic books [57, 58] provide foundations for the research field. A recent textbook that provides an excellent exposition of MDPs is [59]. The MDP framework is abstract and very flexible [60], which allows it to be applied to many different problems. An MDP is defined by a set of states 8, an action set A s for each state s E 8, a state transition function for every state-action pair, and a reward function, which maps the current state, the action and possibly the next state to a reward. The outcome of a MDP is partly random due to the uncontrollable state transition functions, and partly under the control of the decision maker. MDPs provide a framework for modeling and analysing optimization problems in many areas such as management science, operations research and engineering. In communications, MDPs have been extensively used for modeling and analysing queueing problems [61], network performance, resource allocation (e.g., admission or transmission or power control) [62-65] and many other problems. Solutions of MDPs are characterized by the well-known principle of optimality, which is also known as Bellman's equation of optimality. The principle of optimality allows MDPs to be solved via dynamic programming, which restates the optimization problem in a recursive form. Dynamic programming constitutes the best known methods for solving MDPs. However, dynamic programming assumes knowledge of state transition probabilities [59] and is affected by the "curse of dimensionality". These two issues can be resolved by simulationbased methods such as Monte-Carlo methods, temporal-difference learning (e.g., Q-learning) [66-68], or approximation and suboptimal control [67, 68]. Furthermore, MDPs can also be solved by gradient based methods, see [69-72] for example. Lastly, as wireless network optimization typically involves tradeoffs, for example between energy and delay, a type of MDPs that is particularly relevant is constrained MDPs, which are discussed in [73]. In many cases, solutions of constrained MDPs can be found via 14 Chapter 1. Introduction constrained linear programming [59, 73], or a combination of the Lagrangian multiplier method and other classical MDP methods [73-76]. In the second part of this thesis, Markov Decision Process approaches are used to optimize transmission scheduling in correlated fading channels assuming perfect knowledge of the instantaneous channel states. As the correlated fading channel is modeled by a FSMC, the transmission scheduling optimization problems are doubly stochastic. In Chapter 4 a finite horizon MDP formulation with a terminal penalty cost (on the number of untransmitted packets) is considered. The finite horizon corresponds to a hard delay constraint: every batch of packets must be transmitted within a certain number of time slots. It is proved for the finite horizon transmission scheduling problem that the optimal transmission policy is threshold in the buffer occupancy state and the residual transmission time. The two threshold results simplify the dynamic programming algorithm. In comparison, in Chapter 5, the optimal transmission scheduling problem is formulated as an infinite horizon, average cost, countable state MDP, subject to a constraint on an average (delay) cost. We prove that the constrained optimal transmission scheduling policy is a randomized mixture between two deterministic threshold (in the buffer occupancy state) policies. The structural result is exploited to derive a gradient-based stochastic approximation algorithm and a Q-learning algorithm to numerically solve the constrained MDP in real time. Due to the structure of the optimal policy, the implementation of the algorithm is very efficient. 1.3.3 Stochastic Approximation Algorithms Stochastic approximation/optimization algorithms are widely used in electrical engineering to recursively estimate the optimum of a function or its root, see for example [5, 77, 78] for excellent expositions of this area. The first papers on the stochastic approximation methods are those by Robbins and Monro [79] and Kiefer and Wolfowitz [80] in the early 1950's. The well known least mean squares (LMS) adaptive filtering algorithm is a simple example of a stochastic approximation algorithm with a quadratic objective function. Stochastic approximation algorithms have been applied to reinforcement learning for stochastic control problems in [68], learning equilibria in games [35, 81], as well as optimization and parametric identification problems (e.g., recursive maximum likelihood and recursive expectation maximization algorithms, see [82]). In tracking applications, the step size of a stochastic approximation algorithm is chosen 15 Chapter 1. Introduction as a small constant. For such constant step size algorithms, one typically proves weak convergence of the iterates generated by the stochastic approximation algorithm. Weak convergence is a generalization of convergence in distribution to a function space. The weak convergence analysis of stochastic approximation algorithms with Markovian noise has been pioneered by Kushner and co-workers, see [77] and references therein. It was demonstrated in the 1970s that the limiting behavior of a stochastic approximation algorithm can be modeled as a deterministic ordinary-differential-equation (ODE). This is the basis of the so-called ODE method for convergence analysis of stochastic approximation algorithms. In wireless networks, as there are usually underlying dynamics (e.g. a correlated fading channel) or factors that can only be measured in noise, e.g., the expected system throughput, stochastic approximation algorithms play an important role in optimization. In this thesis, gradient-based stochastic approximation algorithms are used for adaptive and distributed estimation of the optimal transmission policies. 1.4 Summary of Results and Contributions This section outlines the results of the thesis, followed by a summary of the contributions from a top-down perspective. 1.4.1 Outline of Results This section presents a summary of results that constitute the main contributions of the thesis. The structure of the thesis is outlined in Figure 1.1. As it was mentioned previously, the thesis consists of two main parts. The first part of the thesis focuses on optimizing a channel-aware distributed MAC protocol. The second part of the thesis focuses on exploiting CSI for opportunistic transmission scheduling from a single user's perspective. In each optimization problem, analytical results on the structure of the optimal policies are proved, and numerical algorithms that exploit the analytical results to efficiently estimate the optimal policies are derived. An outline of the progression of ideas, analytical results and learning algorithms in the thesis is as follows. Chapter 2 and Chapter 3 constitutes the first main part of the thesis, where a distributed 16 Chapter I. Introduction channel-aware MAC protocol is analysed. The distributed MAC protocol is such that every user accesses the channel using a transmission policy mapping its channel state information to transmit probabilities. Chapter 2 considers a centralized design approach for optimizing the distributed channelaware MAC protocol. Therein, the objective is to maximize the full-load system throughput, which is a lower bound of the system stable throughput. The structure of the optimal transmission policy is studied and a stochastic approximation algorithm is proposed for numerically solving the optimization problem. The main results in the chapter include the following. • The distributed MAC protocol is optimized when every user deploys a transmission policy that is non-randomized, piecewise constant in the channel state. That is every user transmits or does not transmit with certainty depending on its channel state. • A stochastic approximation algorithm is proposed for numerically solving the optimization problem for the symmetric random access wireless network model. • Simulations show that exploiting CSI leads to a significant improvement in channel utilization reflected in the superior transmission success rates. Furthermore, the achieved system throughput does not degrade when the number of active users increase. Intuitively, channel utilization is enhanced as the effect of multiple access interference is reduced due to probabilistic transmission. Furthermore users with better channel states transmit with higher probabilities. In Chapter 3, the distributed channel-aware MAC protocol is analysed using a noncooperative game theoretic approach. The use of game theory in resource allocation problem is not new. [45-49] are examples where game theory is deployed to analyse power allocation, rate control problems in wireless network. Game theoretic solutions are robust due to the equilibrium solution concept, efficient as no communication among users is required. Furthermore, game theoretic solutions are expected to mediate conflicting interests, i.e. reduce the effect of multiuser interference. These features are desirable for a random access MAC protocol and serve as the motivation for the work in Chapter 3. Last but not least, in the game theoretic approach, each user can learn its strategy in real-time, which is much more efficient than solving a centralized optimization problem in Chapter 2. The main results presented in Chapter 3 include the list below. 17 Chapter 1. Introduction • A selfish and rational user can always maximize its expected utility via a threshold, non-randomized transmission policy. In other words, the best response of a user to the strategies played by other users is always a transmission policy that is threshold in the channel state. • There exists a Nash equilibrium profile at which every user adopts a threshold transmission policy. • Optimality of threshold transmission policy converts the problem of finding the optimal transmission policy, which is a function optimization problem, into estimating the optimal threshold, which is a single scalar. We propose a gradient-based stochastic approximation algorithm with a constant step size for the purpose of estimating the optimal threshold in real time. The algorithm has the capability to track changes in the statistical properties of the system. It is shown via simulations that as every user updates their policies using the proposed algorithm, the system converges to a Nash equilibrium. In fact, with primitive coordination of the times at which users update their policies, the system exhibits the best response dynamics. • The game theoretic MAC problem is conveniently generalized for random access wireless network with packet/user priorities, which are important for applications with many traffic types or classes of users. In particular, for a network with packet priorities, a transmission policy becomes a function mapping channel states, packet priorities into transmit probabilities. It is proved that there exists a Nash equilibrium profile at which every user adopts a policy which is threshold in the channel state for every packet priority. The optimal transmission policy of each user is now defined by a vector of thresholds, which can also be adaptively estimated via a stochastic approximation algorithm with a constant step size. The structural results and the tracking capability of the proposed algorithm are illustrated via numerical examples. It is shown via simulations that the game theoretic MAC, while being simple to compute and implement, offers system throughputs and success transmission rates that are close to the optimal MAC obtained from the centralized design approach. The second part of the thesis consists of Chapters 4 and 5. These chapters focus on opportunistic transmission scheduling in correlated fading channels using MDP approaches. The opportunistic transmission scheduling problems are valuable for systems where both 18 Chapter 1. Introduction energy efficiency and delay are critical issues, such as multimedia communication in cellular networks, or mission-critical sensor networks. In Chapter 4, the transmission scheduling problem is formulated as a finite horizon MDP with a terminal data loss penalty cost. The correlated fading channel is modeled by a finite state Markov chain (FSMC). The optimization objective is to minimize the expected total cost, which is the sum of accumulative transmission cost and the terminal penalty cost on the amount of packets lost. Furthermore, the delay constraint is that packets in each batch can only be transmitted within a finite number of time slots (i.e. within a finite time horizon). In particular, at the end of the time horizon, untransmitted packets are dropped and a terminal penalty cost occurs. The main results in Chapter 4 includes the following. • It is proved that if the terminal penalty cost is increasing in the number of untransmitted packets, the optimal transmission scheduling policy is deterministic, monotonically increasing, and hence threshold, in the buffer occupancy state and the transmission time. Intuitively, these threshold results imply that the optimal transmission policy is to transmit more aggressively as the number of packets remaining in the buffer is larger, or when it is closer to the end of all allowed transmission time slots. • The above threshold structural results are exploited to reduce the computational complexity required for computing the optimal transmission policy via dynamic programming, and implementing the optimal policy. The finite horizon MDP approach is to accommodate the time-critical nature of the data transmission system. Another possibility is to formulate the problem as an infinite horizon average cost MDP with delay constraints. It is well-known that for such constrained MDPs, the optimal policy can be a randomized function of the state. In [73] and [74], the authors have introduced a powerful methodology based on Lagrangian dynamic programming which comprises of a two-level optimization: at the inner step, one solves a dynamic programming problem with a fixed Lagrange multiplier to obtain a pure policy; at the outer step, one optimizes the Lagrange multipliers. These Lagrange multipliers determine the randomization coefficients for switching between the pure policies. In Chapter 5, the infinite horizon average cost constrained MDP formulation is used and the supermodularity method is applied to prove a threshold structural result for the inner step. Then, as a result of the outer optimization step, the constrained optimal policy is a randomized mixture of 19 Chapter 1. Introduction two threshold policies. In particular, the main ideas and results presented in Chapter 5 include the following list. • The transmission scheduling problem is formulated as an infinite horizon average cost MDP, where the objective is to minimize the expected average transmission cost while the channel evolves according to a finite state Markov chain, subject to a constraint on an average waiting/delay cost. • A sufficient condition for which every feasible policy induces a stable buffer and a recurrence Markov chain is derived. • The Lagrange dynamic programming method is used to convert the constrained MDP into an unconstrained MDP which is parameterized by a Lagrange multiplier. • The unconstrained average cost optimal policy is analysed by viewing the average cost MDP as the limit of a sequence of discounted cost MDPs with discount factors approaching 1 as in [59, 83]. The supermodularity method is used to prove that the discounted cost optimal transmission scheduling policy is deterministic and monotonically increasing (hence threshold) in the buffer occupancy state. Therefore, the unconstrained optimal policy is also threshold in the buffer occupancy state. • As a result of the Lagrange dynamic programming method, it then follows that the constrained optimal transmission scheduling policy is a randomized mixture of two policies that are deterministic and threshold in the buffer occupancy state. • A natural method for estimating the optimal policy, based on Lagrangian dynamic programming, then also comprises of two levels. We propose an algorithm such that at the inner level, the unconstrained threshold optimal policy is obtained via a thresholdpolicy Q-learning algorithm. At the outer step, the Lagrange multiplier is updated by the Robbins-Monro method. Simulations show that the threshold-policy Q-learning algorithm is much more efficient than the non-structured Q-learning algorithm. • An stochastic-approximation-based algorithm is proposed is to estimate the constrained optimal policy directly, given the randomized mixture structure. In particular, in this algorithm, the optimal transmission policy is approximated by a parameterized smooth function. A gradient-based stochastic approximation algorithm with tracking capability then estimates the optimal parameters in real-time. Due to 20 Chapter 1. Introduction the structural result, the estimation and implementation of the optimal transmission scheduling policy is very efficient. The analysis and algorithms for opportunistic transmission scheduling can be easily generalized to the case of an arbitrary number of actions (instead of only two actions: to transmit and not to transmit) to tackle power/rate control, or any other resource allocation problems with the same mathematical abstract of the underlying system. 1.4.2 Overview of Contributions An overview of the main contributions of the thesis is as follows. • The cross-layer approaches and solutions in the thesis offer substantial performance gains. Specifically, the cross-layer distributed MAC protocols in Chapters 2 and 3 of the thesis enhance multiuser diversity and result in a system throughput that increases with the number of users in the network. Furthermore, as CSI is exploited for MAC in Chapters 2 and 3, the average transmission success rates are also significantly improved. In Chapters 4 and 5 of the thesis, where the single-user opportunistic transmission scheduling problems are considered, opportunistic transmission algorithms, which exploit both channel state information and channel memory, improve the energy efficiency while satisfying delay constraints. • The thesis contains many important analytical results, which are mostly on the structure of the optimal transmission scheduling policies. In most cases, the results state that the optimal scheduling policy is of the threshold structure. The structural results are significant as they substantially reduce the optimization dimensionality and make the computation and implementation of the optimal policies much more feasible. • In every chapter of the thesis, the analytical results are exploited to derive efficient optimal policy learning algorithms Except for an off-line algorithm in Chapter 2, all other algorithms derived in this thesis are in real-time, provably convergent and have the tracking capability. As a result, the cross-layer transmission scheduling solutions in Chapters 3, 4, 5 can be estimated and implemented very efficiently in real time. The availability of online learning algorithms besides the structural results makes the results of every chapter in the thesis complete. 21 Chapter 1. Introduction • Other contributions of the thesis include the methodologies deployed to formulate and tackle the cross-layer transmission scheduling problems. The optimization problems considered in this thesis give tutorial examples into the use of several analytical tools and mathematical concepts in wireless network optimization, such as game theory for random access, stochastic approximation for optimal policy learning, and supermodularity for deriving monotone results. Furthermore, several results in this thesis may also find applications in other areas such as operations research, or general resource allocation problems. 1.5 Thesis Organization The remainder of the thesis consists of 5 chapters. Chapter 2 and Chapter 3 constitute the first part of the thesis, which concerns cross-layer MAC. Chapter 2 considers optimizing a distributed channel-aware MAC protocol via a centralized design approach. Chapter 3 adopts the non-cooperative game theoretic approach for analysing the MAC optimization problem. The second part of the thesis, which concerns single-user opportunistic transmission scheduling, consists of Chapter 4 and Chapter 5. In Chapter 4, the optimal transmission scheduling policy is analysed under a finite horizon MDP framework. In comparison, Chapter 5 considers an average cost infinite horizon constrained MDP formulation of the opportunistic transmission scheduling problem. Lastly, Chapter 6 contains discussion of main results, conclusions and proposals of future research directions. Each of the main chapters of the thesis is self-contained. A review of the work related to the specific optimization context is also given in every chapter. The notation is defined separately for each chapter, but has been made consistent throughout the thesis for clarity and ease of understanding. 22 Chapter 2 Optimal Distributed Medium Access Control 2.1 Introduction An important issue in optimizing the performance of wireless networks is to design a Medium Access Control (MAC) protocol that enhances radio resource utilization. There are two types of random access MAC protocols that have been implemented in practice: ALOHA protocols, and the more modern MAC protocols that are based on carrier sensing and collision avoidance. ALOHA protocols are traditionally used in local area networks. In comparison, recent wireless local area networks standards, e.g., IEEE 802.11b, use MAC protocols that are based on Carrier Sense Multiple Access with Collision Avoidance (CSMACD) to reduce the collision probability. Both ALOHA and CSMA-CD are designed for the collision channel model, which assumes that all packets are lost when two or more users transmit at the same time, regardless of channel states of users. For networks with the multipacket reception capability, the collision channel model is incorrect and these MAC protocols will lead to bandwidth waste and are highly inefficient. This chapter considers wireless networks with the multipacket reception capability instead of a collision channel and analyses the structure of the optimal channel-aware distributed MAC protocol. The optimal channel-aware MAC protocol can be implemented in a distributed fashion and offers substantial gains in the system throughput and the success transmission rate, especially when the system is heavily loaded with many active users. The multipacket reception capability is essential for increasing channel capacity and can be easily achieved with multiuser detection or array antenna processing. For example, Code Division Multiple Access (CDMA) with random signature sequences is a multiple 23 Chapter 2. Optimal Distributed Medium Access Control access scheme that allows for soft collision, so that the limit on the number of users that can be supported simultaneously is a soft limit In addition, in multiple-antenna systems, space-time coding schemes can be implemented to resolve packets received simultaneously and increase the channel capacity. As in the literature, we refer to the capability of resolving packets that are transmitted simultaneously as the multipacket reception capability [10, 24]. Given the multipacket reception capability at the physical layer, it is necessary to design a cross-layer MAC protocol to improve the system performance. Since the introduction of the first MPR channel model in [24], there has been extensive work on analysis and proposals of ALOHA-based protocols for MPR models, for example [10, 12, 25, 26, 48]. This chapter analyses a distributed channel-aware MAC protocol for wireless networks with the multipacket reception model proposed in [10], which explicitly incorporates channel state information (CSI) into the packet reception process. In particular, in the multipacket reception model in [10], the probabilistic outcome of every transmission time slot depends on the channel states of the transmitting users. The G-MPR model is a suitable framework for exploiting CSI at the MAC layer. In [10], given the G-MPR model, the authors proposed letting every user transmit according to its policy, which maps CSI to transmit probabilities. This chapter adopts the same optimization framework, i.e. the G-MPR model and the incorporation of CSI into transmission policies. The objective of the chapter is to analyse the structure of the channel-aware transmission policies that yield the optimal system throughput. In other words, the distributed MAC protocol where every user deploys its individual channel-aware transmission policy is optimized in a centralized manner. The most appropriate example of the multiaccess network model in this chapter is perhaps Wireless Local Area Networks (WLANs) as proposed in our publication [36]. In addition to the analysis of the structure of the optimal transmission policies, the chapter presents a stochastic approximation algorithm that can be used to numerically solve the MAC optimization problem for the symmetric network model. For the asymmetric network model, the game theoretic approach leads to much more efficient algorithms for learning the optimal transmission policies. The game theoretic approach for optimizing the distributed channel-aware MAC protocol is covered in the next chapter. 24 Chapter 2. Optimal Distributed Medium Access Control C Chanel state information Figure 2.1: Structure of the optimal channel-aware transmission policy. 2.1.1 Main Results The chapter considers the problem of optimizing the distributed channel-aware MAC protocol for symmetric and asymmetric wireless networks with the G-MPR model proposed in [10]. For the asymmetric network model, users are allowed to have different transmission policies. In comparison, spatially homogeneous networked users deploy a common transmission policy. The following results are obtained. • For symmetric networks, it has been shown in [10] that the system throughput when all users are backlogged is also the system maximum stable throughput. Here it is proved for the asymmetric network model that the full-load system throughput (i.e. the system throughput when all users are backlogged) is a lower bound to the system maximum stable throughput. The full-load system throughput is then used as the optimality criterion for the MAC protocol optimization problem. • We prove for both symmetric and asymmetric network models that, under certain conditions, the optimal transmission policies have the properties outlined by Fig. 2.1. The transmission policy u(y) as a function of the channel state 'y in Fig. 2.1 is nonrandomized, i.e. u('y) is either 0 or 1 for all -y, piecewise continuous in 'y, and there exists a constant C such that u(y) = 0 for all channel states -y > C. • An off-line stochastic approximation algorithm is proposed for estimating the optimal transmission policy, i.e. numerically solving the MAC protocol optimization problem, for the symmetric network model. The obtained MAC protocol can then be implemented in a decentralized fashion and offers substantial improvement in the system throughput. 25 Chapter 2. Optimal Distributed Medium Access Control 2.1.2 Related Work The classical ALOHA protocol optimization problem is to consider the symmetric system with a collision channel model and pick a constant transmit probability, that is common to all users, to maximize the system maximum stable throughput. The expression for the maximum stable throughput is known for the symmetric ALOHA system model [85],[86] and the optimal transmit probability can be derived explicitly. In comparison, for the asymmetric ALOHA system model, an expression for the system maximum stable throughput is still not known and the problem of estimating the optimal transmission probabilities is yet to be solved. In [85, 87, 88] the authors provided a necessary and sufficient condition for the stability of the asymmetric ALOHA system but the exact expression for the maximum stable throughput is still not known. Inner and outer bounds on stability region have been derived in [87], [88]. In [24], the symmetric Multipacket Reception (MPR) model, where the probability that a packet is correctly received depends on the total number of transmitting nodes, was proposed. The MPR model in [24] provides a framework for designing new protocols that can achieve a significantly better system throughput in comparison to the original slotted ALOHA protocol. It has been proved in [24], [89],[90] that by allowing the transmit probability to be a function of the number of active nodes, a non-zero asymptotic throughput can be achieved for the MPR model. A generalization of the MPR model to the asymmetric model and the investigation of stability region have been discussed in [4]. The MPR model in [24] is a coarse abstraction of the physical (PHY) and is a black box to the network layer as channel states of the transmitting users do not explicitly affect the reception of packets at the base station. In [10] a Generalized MPR (G-MPR) model has been proposed to provide explicit incorporation of channel states in the reception model. In the G-MPR model, channel states directly affect the reception of packets. Then CSI can be exploited at the MAC layer (by allowing transmit probabilities to depend on CSI) for much higher efficiency [12]. The most important instance of the G-MPR model is the Signal to Interference Noise Ratio (SINR) threshold reception model [91] for Direct Sequence CDMA (DS-CDMA) systems with random signature sequences. In [10], the author considered a channel-aware ALOHA protocol that is based on channel-aware transmission policies. An expression for the system maximum stable throughput is derived for the finite population symmetric network model. In addition, a bound on the asymptotic maximum 26 Chapter 2. Optimal Distributed Medium Access Control stable throughput is given for the symmetric system model. [10] does not give any analytical result on the structure of the optimal transmission policies. Early work of exploiting decentralized CSI includes [17] in an information theoretic setting and [18, 19], which combined ALOHA with information theoretic measures. The ALOHA aspect of [18, 19] is relevant to our setup although the encoding strategies therein lead to the collision channel model. The remaining of the chapter is organized as follows. In Section 2.2 the multipacket reception wireless network model is described. The optimization problems are given in Section 2.3. Section 2.4 and Section 2.5 contain structural results for asymmetric and symmetric network models respectively. A stochastic approximation algorithm for estimating the optimal transmission policy for symmetric networked users is proposed in Section 2.6. Section 2.7 contains examples. The conclusions of the chapter are in Section 2.8, and lastly, the proofs of analytical results are presented in Section 2.9. 2.2 Distributed MAC for Wireless Networks In this section the symmetric (spatially homogeneous) and asymmetric (spatially heterogeneous) multipacket reception wireless network models considered in the MAC protocol optimization problems are described. Some important concepts including the system maximum stable throughput, the stability notion, and definitions of pure and randomized transmission policies are also introduced. 2.2.1 Transmission Scheduling in Random Access Wireless Networks Consider a random access wireless network where a finite number of users indexed by i = 1, 2, ... , K, where K is a finite positive integer, communicate with a common base station using non-orthogonal communication channels, e.g. CDMA with random signature sequences. Assume that time is slotted and indexed by t = 1, 2, .... Let Signal to Noise Ratio (SNR) represent the channel state of a user. Denote the channel state of user i by -y, for all i E {1, 2, . , K} and assume that rye is a continuous random variable in [0, ''max], 27 Chapter 2. Optimal Distributed Medium Access Control where -ymax E R+ and -yam„ < oo. -ymax can be arbitrarily large so that [0, -y max ] includes the entire range of CSI that is of practical interest. Denote the continuous (channel) probability distribution function of user i by FJ.), i.e. ryi N Fi(.). Furthermore, at the beginning of each transmission time slot, user i is assumed to know its instantaneous CSI, yi perfectly. , Assume that the packet arrival (input) process of user i has mean ft, and finite variance. Further assume that the arrival processes are independent across all users. Define the system average cumulative input rate ,u by A =---i=1 (2.1) We consider both asymmetric (spatially heterogeneous) and symmetric (spatially homogeneous) random access system models, e.g., a WLAN. The above description is the asymmectric system model. The symmetric system model is a special case, where the arrival rates as well as the channel state probability distribution functions of all users are identical i.e. F,,; (.) = F3 (.) and ,ui = ji for all i, j E {1, 2, ... , K}. Assume that every user knows its instantaneous channel state realization perfectly at the beginning of each transmission time slot. A user can estimate its channel state by measuring the strength of a beacon signal or any acknowledgment (ACK) signal sent (or broadcast) by the local base station. At each time slot, every user i picks a transmit probability according to its transmission policy, which is a function mapping CSI to transmit probabilities as follows: ui(•) : [0, [0, 1]. (2.2) We now define pure transmission and randomized transmission policies. Definition 2.1. A randomized transmission policy is a transmit probability function u(.) : [ 0 7max] [0, 1] such that 0 < u(y) < 1 for some non-zero measure set (with respect to the probability measure F(•) of the channel state 7) of values of -y. 7 Definition 2.2. A pure transmission policy is a transmit probability function u(.) : [0, ymax ] - [0, 1] such that u(y) E {0, 1} for all -y E [0, 7m „,] except for possibly a zero-measure set (with respect to the probability measure F(•) of the channel state -y ) of values of 28 Chapter 2. Optimal Distributed Medium Access Control Let Bi t) be the length of the buffer of user i at time slot t. We use the stability notion of [10] and [86], i.e. the system is considered stable if for all i E {1, 2, , K} there exists a H(x), whose domain is Z+ (the set of non-negative integers) such that lim P(B( t) < x) = H(x) lim H(x) = 1. (2.3) t-,00 x -400 Define the throughput of a system to be the average number of packets that are transmitted and successfully received at the base station in a time slot. In addition, the maximum stable throughput is the supremum of all cumulative input (arrival) rates it, defined in (2.1), for which the system is stable [10]. The notation used in this chapter is as follows. A subscript is used to index users while a superscript (k) indicates that there are k transmitting users. AL( is any unordered set of k integers selected from 1, 2, ... , K, i.e. AK C {1, 2, , K}. In the chapter, Ar is used to specify the set of transmitting users. --iAr : iE AL() represents channel states of the users indexed by A il,c ; -7(k) = (-yi : i = {1, ... , k}). FA I‘ (.) and F(k)(.) are the joint probability distribution functions of lAr and 1( k ) respectively. Furthermore, EF [•1 denotes the expectation with respect to probability distribution function F(.). I(• denotes the indicator function, i.e. 40, 1} —> {0, 1} and I(1) = 1, 1(0) = 0. The expected full-load system throughput is a function of the transmission policies of all users and is denoted as TK ( 11 10 7 • • • , uK(.)). When it is clear which set of transmission policies are being referred to, TK(ui(.), , uK 0) is abbreviated as TK. 2.2.2 G MPR Model - We consider an identical G-MPR model to that in [10]. In the G-MPR model, the probability of receiving a packet correctly depends on the channel states of all transmitting users. In particular, the outcome of a transmission time slot when k users indexed by Ar transmit belongs to an event space where each elementary event is represented by a binary k-tuple AK = (79i : i E Ar), where Vi = {0, 1}, = 1 indicates that the packet sent by user i is correctly received and 79, = 0 indicates otherwise. 29 Chapter 2. Optimal Distributed Medium Access Control The reception capability of the system is described by a set of K functions, where the k-th function 4)(-7, 4 1cc; O A K) assigns a probability to the outcome () Aix when k users indexed by Ar with channel state lAr transmit: ciAr; OAK ) = p(eAr ik users transmit, 1,4 0.^(2.4) Equation (2.4) implies that the probability distribution of the outcome O AK is determined by the channel states of all transmitting users. We assume that (2.4) is symmetric, i.e. ciAr;eAkc)=4.(RkolAr),Rk(eAr)) ^ (2.5) for any permutation Rk of a k-element vector. (2.5) is also an assumption in [10, 92]. Our objective is to maximize the system throughput. That is, we are concerned with the expected total number of packets that are received correctly or the success rate. This information is represented by the following set of K functions W(k) (YAK)^E EFA K [vijk users tx, -TA .^(2.6) iEAK Throughout the chapter it is assumed that EFA/k< [V i lk users tx, -YA K]^EF A K ^+ 1 users tx, -YAK,^(2.7) where 7 > 0 is the channel state of the additional user. Intuitively, (2.7) means that the success probability of a user decreases when one more user transmits, which is satisfied by most non-trivial reception models, including the collision channel model. An example of the G-MPR model is the SINR threshold reception model for DS-CDMA systems with linear receivers. For example, for a DS-CDMA system with a matched filter receiver, (2.4) is given by 1 if egc = (di : i E A 42.(1,<,^=^0 otherwise. where 0/1 (^> 79, = I^ 1 + N E3eAr—z 30 0), r) (2.8) Chapter 2. Optimal Distributed Medium Access Control where I(.) is the indicator function, 73 is the channel state (SNR) of user j, N is the spreading gain and 0 is the SINR threshold for correct reception. The derivation of the SINR threshold reception model for DS-CDMA systems with other linear multiuser detectors is given in [91]. Ii( k ) (.) defined by (2.6) is then given by w (k)(1A0 (1 + kr E,EAr_i 73 > 8 .^(2.9) It should also be noted that the classical collision model and the MPR model proposed in [25] are two special cases of (2.6) when channel states of the transmitting users are not relevant. For the classical collision model we have 1 if k = 1 (k)(-7A0 = 0 k>2 In addition, there is a straightforward translation from a MPR reception matrix in [25] to a corresponding set of K functions defined as in (2.6) using Bayesian rule. 2.3 Distributed MAC Optimization Problem Using the dominant system approach for analysing stability of Markov Chains, it is shown in [10] (Theorem 1) that a symmetric random access network is stable as long as the arrival rate is less than or equal to the expected throughput when the system is fully loaded, i.e. every user has at least one packet to transmit. That is, the maximum stable throughput of a symmetric network is equal to the expected throughput when all users are backlogged, which is in some sense the worst scenario. Due to a conjecture in [86] and Theorem 1 in [85], it can be predicted that this result in [10] can be generalized for the asymmetric model. We now show that for the asymmetric system model, the full-load system throughput is a lower bound for the system maximum stable throughput. The problem of optimizing the channelaware ALOHA protocol is then formulated as finding the optimal transmission policies that maximize the full-load system throughput. The optimization problems for the asymmetric and symmetric network models are defined in Sections 2.3.1 and 2.3.2 respectively. Theorem 2.1. Consider a multiaccess system of K users indexed by i = 1, 2, . . . , K where user i has channel state probability distribution function Fi(.) and transmission policy ui(.). 31 Chapter 2. Optimal Distributed Medium Access Control Assume the G-MPR model (2.6). The system maximum stable throughput is lower-bounded by the full-load system throughput given by ''max TK (U1(•), • • • ,UK(•)) ''max K 0 ui('yi) k=0^tEAr fi (1 --uieyi»w(k)(1A0 dFK(-yK).^(2.10) Proof. See Section 2.9.^ ^ For symmetric networks, the system maximum stable throughput and the full-load sys- tem throughput are equivalent [10]. In what follows we formulate the MAC protocol optimization problems. 2.3.1 Problem 1: Optimal Distributed MAC for Asymmetric Networks Consider the asymmetric multiaccess system of K users with the G-MPR model (2.6) described in Section 2.2. We allow different users to have different transmission policies and aim to maximize the full-load system throughput given by (2.10), which is a lower bound of the system maximum stable throughput (Theorem 2.1). Denote the policy of user i by ti(.). Consider the normed linear functional space L 00 [0,ry„,,,], which is the space of all Lebesgue measurable functions on [0, -y max ] that are bounded almost everywhere. The norm of a function in L c,„ [0, -ymax ] is its essential supremum [93]. Let BLoo [ o(y .][0, 1] be the subset of L„„ [0, -y max ] that contains all Lebesgue measurable functions that have norms in the range [0, 1]. The problem of optimizing the system throughput for the asymmetric network model can be formulated on BL 00 [0 , ,rmax ][0,1] as: sup^TK (U1(•), • • • ,UK(•)) u1(.),...,ux (.):u,,(.)EBL,,0[0,-y..] [0,1] where the objective function TK^,uK(.)) ^ (2.11) is given by (2.10). Assume the supremum of (2.10) can be achieved in B L .[ 0 , 1,..][0, 1], i.e. that the optimal 32 Chapter 2. Optimal Distributed Medium Access Control transmission policies are Lebesgue measurable, rewrite the above optimization problem as max u1(.),...,uK(.): 1-te()EBL.[0,-rmaxli°, 1^] where Tx (U1(.),... ,UK(.)) Tx (U1(.), • • 'IWO) (2.12) is given by (2.10). 2.3.2 Problem 2: Optimal Distributed MAC for Symmetric Networks Here we formulate the problem maximizing the system maximum stable throughput for the spatially homogeneous multiaccess model, where all users have the same channel state probability distribution function, the same arrival rate and deploy the same transmission policy, i.e. Fi (.) = F(.), ,u,i = ,u,, u i (.) = u(.) for all i E {1, 2, , K}. As users are long-term symmetric, fairness is automatically guaranteed. The maximum stable throughput of the system is given in [10], and can also be derived from (2.10) as K ( TK(u(•)) K ) (1 — Uav)1" ''max^''max ubi) ...u(7k)T (k) (1 (0 )dF(ryi)...dF(-yk) k=1 (K) (1 uav )K—k uav fo ')'max f"rmax w(k)(,..7y.(k))dG(n ) ... dG(yk), (2.13) k=i where Uav 10 -rm. u(-y)dF(7) ^ (2.14) is the average transmit probability for each user and the distribution function GO is given by fo'r u(-7)dF(-7) G(y) = ^ uav (2.15) The problem of optimizing the system maximum stable throughput for the symmetric 33 Chapter 2. Optimal Distributed Medium Access Control network model can be formulated on the set BL co [ 0 ,-y..] [0, 1] as max tt(.)EBL,„10,.. y.] [0,11 TK(u(.)),^(2.16) with the underlying assumption in (2.16) being that the optimal transmission policy is Lebesgue measurable. Remark: The optimization problems (2.12) and (2.16) can be formulated as continuous state space, infinite horizon, average reward constrained Markov Decision Processes, where the dynamics of the (channel) state are i.i.d. The constraint is multilinear (polynomial for the case of a symmetric network model) and reflects the fact that each user has to select its transmission policy independently of other users. Due to space restriction details are omitted. However, the MDP formulation of the optimization problems helps clarifying that finding the structure of the optimal transmission policy is non-trivial and numerically solving for the optimal transmission policy is very challenging. 2.4 Structured Optimal Transmission Policy for Heterogeneous Networked Users In this section structural results for the optimization problem (2.12) for the asymmetric multiaccess system model are proved. It is shown that the optimal transmission policies have the properties outlined in Fig. 2.1, i.e., the optimal transmission policies are nonrandomized (Theorem 2.2), and piece-wise continuous (Theorem 2.3) with respect to the channel state. Furthermore, it is also proved that if a system throughput greater than 2 can be achieved, there exists a limit beyond which non-transmitting is optimal. 2.4.1 Optimality of Pure Transmission Policies We now claim that there exists a solution (uT (.), . , u; ( (.)) to (2.12) such that zt,*(.) is non-randomized in the sense of Def. 2.2. Theorem 2.2. Consider the asymmetric multiaccess model described in Section 2.2 and op- timization problem (2.12). There exists a solution ( i.e. a set of Lebesgue measurable trans- 34 Chapter 2. Optimal Distributed Medium Access Control mission policies) (till•,^,u*K (.)) that maximizes the full-load system throughput (2.10) and u1(.) is non-randomized in the sense of Def. 2.2 for all i E {1, ^, K}. Proof. See Section 2.9. ^ Remarks: For the special case of the classical collision channel model, if CSI is available and transmission policies are allowed to be functions of channel states, it is straightforward to verify that Theorem 2 holds. Indeed, for a system of K users with a collision channel, it is optimal to select u2 (•) = 1 for one user i and u3 (.) = 0 for all j i. In addition, as it is remarked after the proof of Theorem 2, if the expected system throughput when user i transmit with positive power is almost surely different from the expected system throughput when user i does not transmit then we can claim a stronger result that solution(s) of (2.12) must consist of pure policies only, i.e. pure policies are strictly optimal. Verifying this condition may be hard. Nevertheless, we conjecture that this condition holds for the SINR threshold reception model for CDMA systems. 2.4.2 Piecewise Continuity In Theorem 2.3 it is claimed that the optimal transmission policy for each user is not only pure but also piecewise continuous. Theorem 2.3 requires that the success rate functions (2.6) are piecewise continuous with respect to the channel states. Piecewise continuity of success rates means that if the channel state of a user changes, e.g., a user increases its transmit power, by an infinitely small amount, the success rate change is also small. The condition is expressed mathematically by (2.17) and is in fact not very restrictive. For example, (2.17) is satisfied by the SINR threshold reception model (2.9) for DS-CDMA systems. Theorem 2.3. Consider the spatially heterogeneous multiaccess system model described in Section 2.2 and the optimization problem (2.12). Assume that the success rate function is piecewise continuous, i.e. lim W (k) eyA k^'Yi) 35 = kli(k) ("YAlk — i "Y) , ^ , (2.17) Chapter 2. Optimal Distributed Medium Access Control for all AK and y E [0, -ymax ], except for possibly a zero-measure set of values of y. Then there exists a solution (i.e. a set of Lebesgue measurable transmission policies) (ul (.), ,u*K (.)) that maximizes the full-load system throughput (2.10) such that ui 0 is non-randomized in the sense of Def. 2.2 and piecewise continuous, i.e. lira 'th('y) = u: (-yo) V'yo Y0 E^'ymax i (2.18) -,- except for possibly a zero-measure set of values of yo, for all i = 1, 2, ... , K. Proof. See Section 2.9.^ ^ Remark: Theorems 2.2, 2.3 imply that te,'(.) is piece-wise constant for all i = 1, 2, ... , K. 2.4.3 Optimality of Not Transmitting Ths part of the chapter contains one more structural result for the most important example of the G-MPR model, which is the SINR threshold reception model (2.9) for CDMA systems. Theorem 2.4 states that under a mild condition, there exists a limit such that if the channel state of a user is beyond this limit, the user should not transmit. Intuitively, it is known that the performance of a CDMA system is limited by Multiple Access Interference (MAI) and an increase in the transmit power for a user implies an increase in the interference to all other users. One might then suspect that if a user has an extremely good channel state and transmits, it will likely decrease the expected system throughput. In Theorem 2.4, we show that this is really the case. In practice, a user should not be banned from transmitting if its channel is of super quality. However, Theorem 2.4 implies that a power control scheme should be used to guarantee that the effective SNR of any single user is not beyond the limit. Theorem 2.4. Consider the asymmetric multiaccess system model described in Section 2.2 and the optimization problem (2.12). Assume the SINR threshold reception model (2.9). Let the mean channel state value for user i be IEF, hid = rye. Define = E iK_ 1 ;yri. Let (uT(.), ,u*K (.)) be a solution to (2.12). Assume that it is possible to achieve a system 36 Chapter 2. Optimal Distributed Medium Access Control throughput greater than 2 + co, for some co E R+. Let C= N7y (2.19) 0E0' where N and 13 are the spreading gain and the SINR threshold in (2.9), then = 0 for -yi > C, Vi = 1, 2, ... , K.^(2.20) Proof. See Section 2.9. ^ It should be noted from Theorem 2.4 that the limit C given by (2.19) decreases when co increases, i.e. we can obtain a tighter limit C by showing that a higher system throughput is achievable. 2.5 Structured Optimal Transmission Policy for Homogeneous Networked Users Here we consider the symmetric (spatially homogeneous) multiaccess system model and the optimization problem (2.16) formulated in Section 2.3.2. We provide a sufficient condition for which the optimal transmission policy also has the properties outlined by Fig. 2.1. First, recall that the spatially homogeneous network is a special case of the spatially heterogeneous network, where all users have the same channel distribution function and the same input (arrival) rate. The optimization problem (2.12), which has been analysed in Section 2.4, can also be formulated for the spatially homogeneous network model. In particular, for the spatially homogeneous network model, the system maximum stable throughput given by (2.10) can be restated as follows 10 l'max TK (1.11(.), • • • , UK(•)) f 7max E^H ci AL iEilkc K k=0 Uh1'iNj(k) (1Ar) ( dF(-yi)...dF(-yK).^(2.21) 37 Chapter 2. Optimal Distributed Medium Access Control Hence, we can rewrite the optimization (2.12) for spatially homogeneous networked users as follows max^TK (11, 1(.), . • ,UK(•)) 7 u1(.) , ... , uK(.):u2(-)EBL.[0,-ymax][ 0 , 1 1 (2.22) where TK ( 11 107 • • ,UK(.)) is given by (2.21). Theorems 2.2, 2.3 and 2.4 hold for (2.22). Second, it should be noted that the optimization problem (2.16) is actually (2.12) with the objective function rewritten as (2.21) above and subject to an extra constraint: 240 ui(.) for all i,j E {1, 2, ... , (2.23) In what follows, we claim that if (2.22) has a unique solution then the structural results outlined in Fig. 2.1 can be proved for the optimization problem (2.16), which is equivalent to (2.22) with the constraint (2.23). Corollary 2.1. Consider the spatially homogeneous multiaccess model described in Section 2.2. If (2.22) has a unique solution, i.e. (2.21) has a unique global optimum, then the optimization problem (2.16) has a unique solution u*(.) that is a non-randomized transmission policy in the sense of Def. 2.2. Proof. See Section 2.9 ^ The classical collision channel model and the MPR model without CSI [25] clearly do not satisfy the condition that the solution of (2.22) is unique. However, it can still be claimed for these models that (2.16) has a non-randomized solution. Indeed, for the symmetric network model with a collision channel, the classical optimization problem is to pick a constant transmission probability to maximize the system throughput, which is equivalent to the optimization problem (2.22) with one extra constraint that u(.) = a, for some constant a E [0, 1]. This constraint makes the optimal transmission policy randomized in the sense of Def. 2.1. However, if transmission policies are allowed to be functions of channel states as in (2.16), there exists a solution that is non-randomized in the sense of Def. 2.2. The underlying reason is that, when CSI is available, each (randomized) constant transmission policy is equivalent to at least one, and usually an infinite number of, non-randomized transmission policies of the form (2.2), assuming that the physical channel distribution is 38 Chapter 2. Optimal Distributed Medium Access Control continuous. The same arguments apply to the MPR model in [24, 25]. Using a similar approach to Corollary 2.1, we claim the other structural results for (2.16). Corollary 2.2. Consider the spatially homogeneous multiaccess system model described in Section 2.2. Assume that (2.17) holds. If (2.21) has a unique global optimum, then the optimization problem (2.16) has a unique solution u*(.) that is pure in the sense of Def. 2.2 and piece-wise continuous, i.e. lim u* (y) = u* (yo) V-yo E [0, -y,,,,,]^ (2.24) 7 -'70 except for possibly a zero-measure set of values of -yo. Proof. See Section 2.9.^ ^ Corollary 2.3. Consider the spatially homogeneous multiaccess model described in Sec- tion 2.2 and the SINR threshold reception model (2.9), which is an example of the G-MPR model. Assume that (2.22) has a unique solution, i.e. (2.21) has a unique global optimum. Further assume that it is possible to achieve a system throughput greater than 2 + co, for some co E R+. Let u*(.) be an optimal transmission policy. Let C= N-y 0 6 0' where N and 0 are the spreading gain and the SINR threshold in (2.9) then u*(y) = 0 V -y > C. (2.25) Proof. Similar to the proofs of Corollaries 1 and 2, (2.25) follows from the assumption that (2.21) has a unique global optimum, the symmetry of the objective function (2.21) and Theorem 39 4. ^ Chapter 2. Optimal Distributed Medium Access Control 2.6 Estimating Optimal Transmission Policy via Stochastic Approximation In this section we focus on deriving an algorithm to numerically solve the optimization problem (2.16) for the symmetric multipacket reception wireless network model described in Section 2.2. In particular, we use a novel spherical coordinate parameterization technique to convert the transmission policy optimization problem from a constrained problem, where the constraint is that the transmit probabilities are in [0, 1], to an unconstrained problem. Then, we propose a stochastic approximation algorithm (Algorithm 2.1), that can numerically solve the optimization problem (2.16). Algorithm 2.1 is an off-line gradient-based algorithm and can be used to find the optimal transmission policy for spatially homogeneous networked users. For spatially heterogeneous networked users, an analogous algorithm can be easily obtained. However, for the asymmetric network model, a game theoretic approach gives a more efficient way to update the transmission policy of each user in a real-time, distributed fashion [35]. The core of Algorithm 2.1 is the stochastic estimation of the gradient (of the objective function), which cannot be derived in close-form. The gradient estimates in Algorithm 2.1 are obtained by the score function method [94]. The score function method is one of the exact gradient estimation methods that yield unbiased gradient estimates. Other methods for obtaining unbiased gradient estimates include the weak derivative method and the process derivative method [94]. In this paper we select the score function method for its relative simplicity. The essence of the score function method is the use of the Rao-Blackwellization (smoothing by conditioning) technique to reduce the variance of the gradient estimates. An alternative, very simple and widely used method for gradient estimation is the finite difference method [5]. However, gradient estimates by the finite difference method are subject to a random systematic error of mean 0 and variance in the order of 1/c, where c is the perturbation factor (see [94], Chapter 5). In other words, gradient estimates by the finite difference method are biased, and as the perturbation factor tends to zero, the variance of the estimate error is unbounded. First, quantize the channel state space into a finite number of ranges and convert (2.16) into a finite dimensional optimization problem. In particular, let SNR represent channel state and quantize the SNR domain (i.e. channel state space) into M regions by a, , i = 1, 2 ... M +1, a, < Vi < j. The function optimization problem (2.16) is then replaced by 40 Chapter 2. Optimal Distributed Medium Access Control the problem of selecting an optimal transmit probability for each SNR range. We use the novel spherical coordinate parameterization method, first proposed in [95], [96], to represent a transmission policy by u-0 (7) = Esin2 OiI(ai < y < i=i ^ (2.26) The system maximum stable throughput (2.10) can be rewritten as: TK = where k=1 (K) (1 -M K-k EFk [uweyi). . . uw(yk) qj (k)(1(k))] , (2.27) 0 = ( 01, , OM) and y5o. = 1 sin e Oi (F(ai+ i) — F(ce i )) ^ (2.28) i=i is the average transmit probability, and klf(k)(1(k)) is given by the reception model (2.6). The optimization problem (2.16) becomes finding 6 that maximizes the system throughput, i.e. max TKA•^ (2.29) In what follows, we will derive a numerical algorithm to estimate the parameter B that maximizes TK(B) given by (2.27). The use of the spherical coordinates as in (2.26) guarantees that the norm of the estimated transmission policy is always in [0,1] for all (estimated) parameters B. First, define a stationary point, which is also a local maximizer of the system throughput by 6-** =^VfiTK(W) = 0, D 2c7K(6.) < 01.^ (2.30) The gradient of (2.27) with respect to 0, is given by K-k-1 (F(ai+i)) F(ai)) sin(20i)E F (k) 9,TK(W) =^(K) (k — K)(1 —^( [X(W, / (k) )] k=i [ 41 Chapter 2. Optimal Distributed Medium Access Control + ( 1^67)1( -kV 9 F(k) [x^,y(k))]] ^ (2.31) where X(6,1( k )) A = uw(-yi)^u o.(-yk)kli( k )(--)-0 )). Unbiased gradient estimates of every factor in (2.31), including F(a2) for all i, can be easily obtained without knowing the channel state probability distribution function F(•). Hence, it is straightforward to derive a blind stochastic approximation algorithm for estimating the optimal transmission policy. However, when the channel probability distribution function is known, there are several methods for reducing the gradient estimate variance and greatly improve the convergence rate of the algorithm [94]. Below we assume that F(.) is known and propose an off-line optimal transmission policy learning algorithm that uses the Rao-Blackwellization (smoothing by conditioning) technique for gradient estimation. Stochastic Approximation Algorithm with Likelihood Ratio Gradient Estimation First, we need to generate unbiased estimates of the gradient of the system throughput, V7TK(0) given by (2.27). Generation of Vo t TK(W) requires the generation of unbiased estimates of E F (k) [X (fi,l(k))] and its gradient with respect to the W. Since the channel distribution function is known, we can reduce variance of the estimates by the Rao-Blackwellization (smoothing by conditioning) technique. We estimate E F (k) [Xk] and Vo,E F (k) [Xk] conditioned on the event that all k users transmit, which we denote as event Zk. The a posteriori channel state distribution of a user given that it transmits is computed in [92], [10]: COM = u e-MdF(-7)^J.; u o.(-7)dF( 7y) (2.32) u w OldF ( 7) The corresponding density function is 9B(Y) = u6('r)f ( Y) - pe where .77. 67 =^u t9 T)dF(-7) is the average transmit probability corresponding to -.( -, 42 0. Chapter 2. Optimal Distributed Medium Access Control The conditional expectation of E[X(6r,-7( k ))] capnb e ) computed coFmkpu straightforwardly straightforwardly by (zkE lztk [if(k 1 (k) )] = E [-K ( 11 , 1 (k) )1zd = ^ =r9pE c t[w (k ( 1 (0 )] E [X(6, (2.33) ) It should be noted that after conditioning the (a posteriori) channel state distribution G o.(-y) is parameterized by W, i.e. EGk (k) (5,(k)\) ] is of the parameterized integrators form o' L [94]. There are three methods to simulate the gradient of an expectation of the parameterized integrators form: the score function method (ratio likelihood), the weak derivatives method and the process derivatives method [94]. In Algorithm 2.1, the score function method is selected due to its relative simplicity. Furthermore, as estimates of W( k )(-7(k)) are i.i.d over time, the three methods are expected to have similar performance in terms of the gradient estimate variances. Due to (2.33) and by the score function method in [94], the gradient Vo i E F (k)[Xk(/( k ))] = 0;1 f 1 (sin(200)(F(a.i±i) — F(a 2 ))E c t [W(k)(-7( k ))] 21:7 9, E G. [w(k) (1(k) 1 , where v oz E G. [ w (k) (1(k))] = (1. [ w (k) ei(k)) (V 92 (g d(71) • • • gj('yk)))] 967(71) • • • gtibk) e,g(9.(7k ))1 =^ {(k) (1(k) ( v ezg-( n )^V +...+ / go,h)gW(70 (2.34) , Simple mathematical derivations give Veig6T(71) _= ^sin(20i )I(a i 7 < ai+i)^sin(20i)[F(ai+i) — F(ai)] 967('Y1)^Ejm_i sin 2 (0i)I(ai < -y < ai ± i)^E jm_ 1 sin(19i)[F(ai + i) — F(cej )] (2.35) . Plugging (2.33), (2.34), (2.35) into (2.27) gives the following score function estimate for Ve z TK (C) (k V0 K( 0 ) = — Ki367)(1 — p67) K—k-1 /4-1 (F(ai+i) — F(ai)) sin(20i)Wk(7' (k) ) k=1 + (1 — .13 6) K - k ilw(k) - 0,, ( k)) (7k)) 19 V ei9g(71)^V .+ ^9 -] gwh)^g65-(7k) where ryi ti G I9.(.), which is given by (2.32), for all i e {1, 2, ... , K}. - 43 (2.36) Chapter 2. Optimal Distributed Medium Access Control Algorithm 2.1. Initialize 1 = 0 ; Otei = (Oh^, Om) E R M . For time index 1 = 1, 2, ... repeat • Generate -yi^Go ) (.) for i = 1, . . . , K. Obtain V 7(1) Tx( 0- 11) ) using (2.36) 0 + 1) =^+ sj * t 67T (6)^ (2.37) • Conditions on decreasing step size si with El (2.38) = 00; E Ei < 00 i=i Theorem 2.5. The estimates Oil) generated by Algorithm 2.1 converge w.p.1 to a local maximizer 0*, defined in (2.30), of the system stable throughput (2.27). Proof. For a fixed initial 6, since the sequence V 674 ) (6) : 1 = 1, 2 ... are i.i.d, the above - equation (2.37) is an instance of the well known Robbins Munro algorithm. In [97], the convergence proof for gradient-based algorithms for much more general, correlated signals (e.g. Markovian signals) is given. The convergence conditions are (2.38) and uniform integrability of V 674 (0). A sufficient condition for uniform integrability is that the channel ) distribution has finite variance.^ ^ 2.7 Examples 2.7.1 An Analytical Example In this section we aim to give a simple analytical example to provide more insight into the structural results in Section 2.4. In particular, we consider a system of two users and compute the optimal transmission policy explicitly to illustrate the results proven in Section 2.4. We relax the assumption that the channel state space lies in [0, -y finite -y max and let the channel state space be [0, co). 44 max ] for some Chapter 2. Optimal Distributed Medium Access Control Consider a system of two symmetric users, who communicate with a common base station using DS-CDMA with random signature sequences. Similar to previous sections, we assume the SINR threshold reception model (2.9). Further assume that the physical channel is Rayleigh fading, which leads to an exponential channel state (SNR) probability distribution function, i.e. F(-y) =1 exp(-77N The full-load system throughput (2.10) can be rewritten as L ( ui(. ) ,u 2(•)) = J f: ui (71 )u2 ( 72 N1 (2) (-n,72) + u1(71) ( 1 — u2('Y2)) 4/(1) (71) + (1 — (71))u2(72) 41(1) (Y2)dF('YOdF(72),^(2.39) where 4/(2)( Y1 , 72 )^I( 1+7721 /, >^+1( 1 42 /N >^T (1) (70 = l(n > 0), T (1) (72) = I(y2 > 0). The optimization problem (2.12) becomes (ui (.), 'u2(•)) = arg max ui (•),u2 (•) L (u i (.), u 2 (.)) • (2.40) We will now solve (2.40) completely to show that the results in Section 2.4 indeed hold. Theorem 2.6. Consider the problem (2.40). Let A = : 130 , if 1 + e A17 — — 2e — Rh > 0 (2.41) then the optimal solution for (2.40) is uT•) = ^=I(. > /3). Proof. See Section 2.9. 0 2.7.2 Numerical Examples In this section the structural results and the performance of Algorithm 2.1 are illustrated via numerical examples. We consider a spatially homogeneous multiaccess system of K users that communicate with a common base station using the DS-CDMA scheme with random signature sequences of spreading gain N = 32. Assume that SNR represents channel state and that the base station has a matched filter receiver to decode signals from different users. Then the reception model is given by (2.8) and (2.9). That is, in a transmission time slot, 45 Chapter 2. Optimal Distributed Medium Access Control Throughputs and success rates for different transmission schemes 10 —I, — Optimal throughput with decentralized CSI --A— Optimal throughput without CSI • Throughput without transmission control 0^Optimal success rate with decentralized CSI 7 • Optimal success rate without CSI - t,--• Success rate without transmission control 6 5 0 2 4 AAAAAAAAAA.+AAAA A A 3 XX0 , 0-00000000000-00000 , 00-00000-000 0 AAA AAAA A A.A.A48. 444 4 44 4 °A Figure 2.2: The optimal distributed channel-aware MAC protocol offers a substantial gain in the system maximum stable throughput and the success rate especially when the network size is large. Simulation parameters: spreading gain N = 32, SINR threshold 0 = 4dB, mean channel state value it = 10. a packet transmitted from user i is successfully received if and only if: I( 1+ `Yi > 0) , N where Ar is the set of all transmitting users, N is the spreading gain and /3 is a SINR threshold. Assume the underlying channel is Rayleigh fading, then the channel state (SNR) is exponentially distributed. We relax the assumption that ry E [0, -ymax ] for some finite hrax and let the channel state probability distribution function, which is identical for all users, - be F(-y) = 1 — where 5 is the mean value of , Improvement in system throughput and success rate: Let ;)-/ = 10 and quantize the channel state space into 25 ranges by ai i for i = 0,1, ... , 25 and a26 = oo. Fig. 2.2 compares three transmission schemes in the sense of the full-load system throughput, which is also the system maximum stable throughput as the network is symmetric, and the average success rates. The transmission schemes are (i) using the optimal channel-aware transmission policies designed by Algorithm 2.1; (ii) using the non-channel-aware optimal transmission 46 Chapter 2. Optimal Distributed Medium Access Control Convergence of the estimates of the optimal channel—aware transmission policy Initial tx. probabilities, Throughput = 1.56 — — — After 50 iterations, Throughput = 4.6 •^• After 200 iterations, Throughput = 4.61 0.8 0.6 0.4 0.2 0 0.2 0 5 ^ ^ ^ 10^15 25 20 Channel state (SNR) Figure 2.3: Estimates of the optimal transmission policy by Algorithm 2.1 converge to a transmission policy that has the properties outlined by Fig. 2.1. Throughput obtained by the estimated transmission policy is close to optimal after 50 iterations. Simulation parameters: K = 20 users, spreading gain N = 32, SINR threshold 3 = 4dB, mean channel state value u = 5. policy in [25]; (iii) using the policy of always transmitting. The emperical system maximum stable throughput is computed from (2.10) and the average success rate is defined by R = Pwhere p is the average transmit probability given by (2.14), K is the total number of users, and T is the system maximum stable throughput. The size of the network is varied from 1 to 40 users. It is seen that the system throughput is improved with the size of the network only when optimal channel-aware transmission policies are deployed. In addition, the success rate remains at 0.76 for optimal channelaware transmission policies even when the system is overloaded, i.e. the number of active users is greater than the spreading gain. That is, when CSI is exploited, channel utilization is significantly enhanced. Convergence of the estimates by Algorithm 2.1: We quantize the channel state space into 25 ranges as in the previous example but assume that '), = 5 instead of 5 , = 10. Algorithm 2.1 is run to estimate the optimal transmission policy for the case when there are 20 active users in the network. Fig. 2.3 illustrates that the estimates 8( 1 ) generated by Algorithm 2.1 seem to converge to a transmission policy u*(•) that has the properties outlined in Fig. 2.1, i.e. non-randomized as proved in Theorem 2.2, piece-wise constant as 47 Chapter 2. Optimal Distributed Medium Access Control proved in Theorem 2.3, and u* ( y) = 0 for all y greater than some constant - C. It can be seen from Fig. 2.3 that the convergence of Algorithm 2.1 is not plagued by any local optimum and the generated estimates 0( 1 ) converge to a good neighbourhood of the optimum after about 50 iterations. In fact, the system throughput for the transmission policy obtained after 50 iterations of Algorithm 2.1 is very close to the optimal throughput, which is approximately 4.61 for a network of 20 users with 1 = 5. 2.8 Conclusions The chapter considers optimizing a distributed MAC protocol where each user deploys a channel-aware transmission policy to access the common channel. It is assumed that the network has the multipacket reception capability and the outcome of each transmission time slot probabilistically depends on the channel states of all transmitting users. Structural results are derived for both spatially heterogeneous and spatially homogeneous networks. Specifically, the optimal transmission policy is non-randomized and piece-wise constant for every user. In the second part of the chapter, an off-line stochastic approximation algorithm is proposed to numerically solve the MAC optimization problem for the symmetric network model. The algorithms use the spherical coordinates parameterization method proposed in [95, 96] to confine the transmission probabilities in [0,1]. The optimal transmission policy is then estimated by a gradient-based algorithm, where the gradient is estimated using the score-function method in [94]. The optimization methodology in the chapter is a centralized design approach. In the next chapter, the structure of optimal channel-aware transmission policies is analysed using a non-cooperative game theory approach. The game theoretic MAC protocol can be adaptively estimated and implemented in a completely distributed fashion. Game theoretic MAC is sub-optimal in the system throughput sense. However, it is showed via simulations that the decrease in the system throughput is minimal in many cases. 48 Chapter 2. Optimal Distributed Medium Access Control 2.9 Analytical Proofs This section contains proofs of the structural results presented in the chapter. 2.9.1 Proof of Theorem 2.1 The proof of this theorem follows very similar lines to the proof of the sufficient part of Theorem 1 in [10]. For completion and clarity we reproduce the proof with necessary generalizations. Readers are referred to [10] for the original proof on the equivalence between the system maximum stable throughput and the full-load system throughput for homogeneous networks with the multipacket reception capability. Let B(t)^Bit) ,^, BI(ct ) , where Bi t) is the length of the buffer of user i at time t. The time evolution of Bi t) is given by B(,t+ 1) = (Bi t) — Yi (t) )+ X (t) ,^ (2.42) where Y(t) is 1 if user i transmits a packet successfully in time slot t and 0 otherwise, X( t) is the number of packets that arrive during time slot t. B(t) is therefore a K-dimensional Markov Chain. Assume that the arrival process and the reception model are such that the Markov Chain is aperiodic and irreducible. This assumption is satisfied for most nontrivial arrival processes and reception models [10]. With this assumption the stability of the system in the sense of (2.3) is equivalent to the existence of a stationary distribution for the Markov Chain B( t). The theorem follows from two lemmas. First, Lemma 1 states that the full-load version 0 t) of the Markov ChainBZ t) stochastically dominates e ) . Lemma 2.1. (Also see Lemma 2 in /10]) Consider a system of K spatially heterogeneous networked users. For each user i = 1, 2, ... , K define a one-dimensional Markov Chain 0, t) that is the full-load version of le , i.e. P(O (t+1) = k lO (t) = s) = P(e+1) = k IB(t) = s, Bit) > 0, i = 1, . . . , K, j i)• 49 Chapter 2. Optimal Distributed Medium Access Control The Markov Chain 0( t) stochastically dominates Bi t) , i.e. for all t, s, k, x EN + ip(O(t+1) x 0?) =^(0)^k) —^ <p(o (t+i) > xio (t) p( e+i) > xiB(t)^s, L;(0)^< p(o(t+1)^xio(t) ^ s 1, 0 i(°) = k) (2.43) ^ k),^ ^O(°) (2.44) where k^(k J., . . • , kid • Proof (2.43) is implied by the fact that the probability that the buffer goes beyond level x at time t + 1 is greater if the buffer store more packets at time t [10]. We now show (2.44). The time evolution O t) can be written as ( o(t+i)^(0?) z(t) )+ x(t) , where 4 t) is 1 if user i transmits a packet successfully in time slot t and 0 otherwise, 4 t) is the number of packets that arrive during time slot t. Therefore, in order to show (2.44) it suffices to show that the probability of success is lower in the full-load system, i.e. P(Yi = 11B i(t) = s, BP) = k) > P(Zi = 110?) = s, O, °) = k).^(2.45) It should be noted that the success probability of a user decreases when there is one more user that has a packet to transmit (due to (2.7)). Hence, (2.45) clearly holds. ^^ Second, we restate Lemma 3 in [10], which is an application of Pakes' Lemma [98]. This Lemma is used to obtain a sufficient condition for the stability of 0( t) , which implies the stability of e) . Lemma 2.2. For the Markov chain 0? ) define the drift Ji(j) E[0? +1) — 0 (t) 10 (t) = jj. Suppose that the drift Ji(j) < oo for all j and that for some scalar 6 > 0 and integer Tj > 0 we have Ji(j) < —8 for all j Then the Markov chain 0? ) has a stationary distribution. For the full-load system the drift Ji (j) of the Markov chain corresponding to 50 the buffer Chapter 2. Optimal Distributed Medium Access Control of user i is independent of j and is given by K E ^ Ji(i) tU^ i frYmax k=1 ArDi forymax • u/I bi) dF ril(,),( 11) —. ui(-yi))E[Vijk users tx, 1A4c] • • 1E4^ivA AK dFK ('YK)• ^ (2.46) From the above equation it can be seen that a sufficient condition for the stability of 0( t) is < EE 7max af 7m x H uibi) H (1 - tiibi))E[79 k users tx, l r] i A k=1 ArDi °^1E4^TOAkK dFl ('Yl) ... dFK (-yK ), (2.47) which is also a sufficient condition for the stability of Bi t) . Therefore a sufficient condition for the stability of the system is K = i=1 ,ui< -ymax K Th r rrax f0 . .^ E^ui(-yi) fi (1-nibi»W(k)(1AK) k=0 AK JEAr^j OA r dF1 (71) • • • (2.48) d Fic (71() • Therefore, the system maximum stable throughput defined as the supremum of all accumulative input rates for which the system is stable is lower-bounded by (2.10). 2.9.2 Proof of Theorem 2.2 Let (u1(.),^, u*K (.)) be a solution to (2.12). Suppose that (74(.),/6(.),^u*K (.)) does not satisfy Theorem 1. Without loss of generality, suppose u*K (.) is randomized in the sense of Def. 2.1, i.e. u*K (7) {0,1} for some non-zero measure set of values of • y. - We will now show that the solution to the optimization problem max UK()^ 00 , imax) [0•11TIC (UT (•) U2(.), • • • 7141C-1(.), UK (•)) ^ (2.49) - is non-randomized in the sense of Def. 2.2. Indeed, the system throughput when user K 51 Chapter 2. Optimal Distributed Medium Access Control never transmits is given by L7max TK —1 = 1/. K-1 E E fi fi (1-u;(-y3))w(k)(1A L, -1) u'iTyi) k=0 4,c iEAfec- 1^jcZAIkc—i dl i(71) • ..dFK-1(7K-1)•^(2.50) l When user K uses the transmission policy uK(.) the new system throughput is given by TK(UTO)U 4 (.), • • • u *K —1(.),UK(•)) = f o -Ymax UK (y)6(7) + (1 — uK (7))TK- dFK (7) (2.51) where 6(7) = E F(71,-,7K -1 E H^H (1_,Li(7i ))(4,(k+1)(1A K_1, 7 ))]. , k=0^jEAlk(-1 (2.52) It is clear that uK (7) { 1 if 8(7) > TK-1 0 otherwise (2.53) is a Lebesgue measurable transmission policy that maximizes (2.51). Hence, if we assign uK(•) = uK(•) given by (2.53), the system throughput is still maximized. This step can be repeated until (uT u2(.), , uK(.)) only contains non-randomized policies, hence Theorem 1. ^ Remark: If S(.) is non-zero almost everywhere, we can claim a stronger result that solution(s) of (2.12) must consist of pure policies only. Intuitively, (5(.) is non-zero almost everywhere means that the system throughput when user i always transmits is almost surely different from the case when user i never transmits . Verifying this condition is hard. Nevertheless, we conjecture that this condition holds for the SINR reception model for CDMA systems. This condition does not hold for the collision model or the MPR reception model without CSI. 52 ^ Chapter 2. Optimal Distributed Medium Access Control 2.9.3 Proof of Theorem 2.3 We follow the same approach as the proof of Theorem 2. Let (uT(.), ^, u*K (.)) be a solution to (2.12). We will show that there exists a non-randomized piecewise continuous solution to (2.49). In particular, we will show that the solution (2.53) is a piecewise continuous, non-randomized transmission policy. This can be done by showing that O(7) given by (2.52) is piecewise continuous. Given the channel states of all users, the probability that the only users indexed by Ar -1 transmit is given by P(il ikc-1 1-yi, = fl icA K 1 ui(yi) 11 4 dAK-1 (1 — Uj('yi)). - Hence S(.) can be rewritten as (2.54) 6(y) =E fr(71,-,7K-1,ALC-1 ) [ kij (k+1) HAK -1 7 7)] TK 1, - where P(-y1,^,^AkK-1) is the joint distribution of 71,^k,^. As expectation is a smoothing operator, (5(7) is piecewise continuous if (2.17) holds.^1=1 2.9.4 Proof of Theorem 2.4 Let (u1(.), /60,^, u*K (.)) be a solution to (2.12). Without loss of generality we will show that (2.20) holds for u'Ic (.). For (uI(.),u'2`(.),^,u*K (.)) to be an optimum, it must be the case that u*K ey) 61± UK (-y) V7 s.t. 8(7) ^ 0, where uK (7) is given by (2.53) and (5(-y) is given by (2.52). The proof will be complete if we can show that 6(y) < 0 for all 7 > C. Due to the theorem's assumption, L (71'10, . , u*K (.)) > 2+ co. Furthermore, when user K never transmits the system throughput is TK_1 given by (2.50). From (2.50) it is easy to see that TK_1 > rk - 1 > 1 + €0. For the SINR threshold reception model (2.9) we have kif(k+1)(,.;. K , n,) Iey > +^ /A k E^E ,(^ >0 1± Y +E 7. \ ^TEA K <1+ E ^jEA kK 1 - 1 j ^i K -1 ley <^— N —^< 1+ ^ 53 1 (7 < i=1 ) Chapter 2. Optimal Distributed Medium Access Control Hence the following inequalities and equality holds: K -1 5(y) < 1 +^ N I(ry <^- TK -1 i=1 K -1 =1+ P('Yz >^TK -1 i=1 K -1 5_ 1 + E- TK -1 (By Markov's inequality) i=1 1\1),^N-7y" < 1 + — — K _1 < 1 + - -1K-1 = 1 ± CO TK -1 13C 07 < 0.^(2.55) Hence 8(7) < 0 for all ry > C.^ ^ 2.9.5 Proof of Corollary 2.1 Under the assumption that all users have the same channel distribution, (2.10) becomes (2.21) and the optimization problem (2.12) becomes (2.22). Let (uT (.), , /4( 0) be the unique solution of (2.22). Then due to Theorem 2 /40 is a pure policy for all i = 1, 2, ... , K. Since (2.21) is symmetric with respect to the transmission policies ui (.), , uK 0 it must be the case that any permutation of (14(.), , u*K (.)) is also a global optimum of (2.21). It follows from the uniqueness of the global optimum that u1(.) = u ;(.) for all i, j E . 1, . . . , K = u*(.). Then u*(.), which is non-randomized in the sense of Def. 2.2, is the unique optimum of (2.13). ^ 2.9.6 Proof of Corollary 2.2 Similar to the proof of Corollary 2.1, under the assumption that all users have the same channel distribution, the objective function (2.10) becomes (2.21), which is symmetric with respect to the transmission policies of all users. This symmetry implies that if (ul (.), , u*K (.)) is a global optimum of (2.21) then any permutation of (7/10, , u'i c (.)) will also be a global optimum of (2.21). It then follows that t; (•) = u*(.) for all i E 1, . . . , K, and u*(.) is piecewise continuous due to Theorem 3 and u*(.) is the unique optimum of (2.13). 54 ^ ^ Chapter 2. Optimal Distributed Medium Access Control 2.9.7 Proof of Theorem 2.6 Our approach to prove the theorem is as follows: first, we aim to solve for uT(.) without any assumption or constraint on u2(.). Then by symmetry, we claim the same result for 14(.). 1. yl > A ^ > /3. (2.39) can be rewritten as f Yl) - 2) '1 ) + 1(72 > 13 + 13:Ar A 00 fo °0 (71) 71 2('Y2)^> + 0 1 + nib]) + u2 (72)g-y2 > 13)dF('-x2)dF(71). We need to select u1(.) to maximize f (71) (f 2 u2 (72) ( 1 (72 < N 2-1- — N) +1(1'2 > + 3 21 ) — 2) dF(72) +1) dF(71) For 71 > Awe have N if - - N > 0+0*. Hence I(-y2 < Naii N)+I(72 > - N aici) 1 and ^f u2(-y2) (1(72 <^- N) +1(72 > 13 +,3a) _ 2) dF(72) + 1 > 0.^(2.56) ^> A Therefore ut(71) = 1 for all 71 > A. Since (2.56) does not require any assumption on u2(.), by symmetry we can claim that u2(72) = 1 for all 72 > A. 2. /3 <^< A = Airv_130 For 71 in this region we have 0 < Naff 1V 2,3 - N) + 1(72 > + /FP fi, °3 - N < + 0 21\71, < A and therefore 1(72 < 2 > -1 for all 72 > A. It then follows that u2(72) (1(72 < N - TTI ) - 2) N) + I( -y2 >+ OI A -2^u2(-y2)dF(-y2) - ^)3 - f ^ A dF (72 ) + 1 U2 (72)dF(72) + 1 A^cx) > -21 dF(72) - f dF(-y2) + 1 A = 1 e -A h - 2e -13 h It follows that a sufficient condition for the strict optimality of ui (71) = 1 in this 55 Chapter 2. Optimal Distributed Medium Access Control region is (2.41). We can repeat the arguments to show that if ( 2.41) holds u'(.) = 1 for all • E [13,A]. Lastly, it is clear from (2.39) that tl(-yi) = 0 and u2(72) = 0 for all -yi,y2 E [0 03]. ^ 56 Chapter 3 Game Theoretic Approach for Medium Access Control 3.1 Introduction The previous chapter has deployed a centralized design approach to optimize the distributed channel-aware MAC protocol in the system throughput sense. This chapter uses a noncooperative game theoretic approach to analyse the distributed channel-aware MAC protocol.In particular, it is assumed here that each user accesses the common channel using its own transmission policy to maximize its individual expected reward (utility), instead of having a common objective as in the previous chapter. It will be showed that even though users do not cooperate, multiuser diversity is still enhanced, and results in a system throughput that increases with the number of active users in the network. Game theoretic solutions are autonomous, robust and very scalable. Therefore, game theory has been widely used for distributed resource allocation in wireless networks. For example, [46-48] adopted a game theoretic approach for random access in a collision channel model, [45, 48, 49] used the non-cooperative game theoretic formulation for power control. In general, the drawbacks of game theoretic approaches are sub-optimality of the solutions and the fact that it is rarely straightforward to analytically, or numerically compute an equilibrium of a multiuser game. In this chapter, we use a non-cooperative game theoretic approach to exploit CSI for random access in multipacket reception networks. The above mentioned drawbacks are partly overcome in this chapter, mostly due to a threshold structural result that will be discussed below. Similarly to the previous chapter, the multiaccess system model of this chapter is wireless 57 Chapter 3. Game Theoretic Approach for Medium Access Control ca 21 H 0 7 Channel state Figure 3.1: An example of a channel-aware transmission policy. (7) 21 0 Channel state 7rnax Figure 3.2: Threshold structure of the optimal transmission policy proved in this chapter: u*(-y) > 0*). This optimal threshold 0* can be estimated using an adaptive gradient ascent stochastic approximation algorithm. At a Nash equilibrium, every user adopts a policy of this threshold structure. networks with the G-MPR model proposed in [10]. Every user has access to its CSI at every transmission time slot and transmits with a certain probability according to its channelaware transmission policy. An example of a channel-aware transmission policy is given in Fig. 3.1. Every user is selfish and rational and selects a channel-aware transmission policy to maximize its expected reward, without any concern for other users. As there are an infinite number of channel-aware transmission policies, the transmission game has an infinite strategy space. Therefore, the existence of a Nash equilibrium for the transmission game is not at all straightforward. In this chapter, we prove that the best response transmission policy of each user is always threshold in the channel state, and furthermore, there exists a Nash equilibrium profile at which every user adopts a threshold transmission policy. In other words, in order to optimize its utility (and at a Nash equilibrium), a user will transmit with certainty if and only if its channel state is better than a threshold (see Fig. 3.2). These results are intriguing because they imply that in order to find a Nash equilibrium, it suffices to restrict to the class of threshold policies. Furthermore, in order to optimize its transmission policy, a user has to optimize a channel state threshold, i.e. a single scalar, instead of a function. In this 58 Chapter 3. Game Theoretic Approach for Medium Access Control chapter, the optimal channel state threshold is estimated by a stochastic approximation algorithm that can track slowly time-varying systems. In other words, due to the threshold structural results, each user can adaptively learn its part of the equilibrium and the game theoretic MAC solutions can be estimated online, in an adaptive and completely distributed fashion. It has been mentioned above that a drawback of the game theoretic MAC protocol is the suboptimality in the system performance sense. However, it is shown via simulations that the system throughput achieved at the Nash equilibrium is very close to that of the centralized design optimal MAC. Furthermore, the game theoretic formulation has three parameters that can be designed to improve the system performance: the transmission cost, the waiting cost and the success transmission reward. These parameters offer a means to take into account factors such as battery constraints and delay requirements. The waiting cost, for example, can be adjusted to enhance long-term fairness in the network and can be increased to encourage transmission and reduce delay. On the other hand, the game theoretic MAC solutions can be estimated online, in a distributed fashion, where every user adaptively learn its optimal threshold. That is, the computation of game theoretic MAC solutions is much easier. Furthermore, the proposed stochastic approximation algorithm has the tracking capability and virtually does not require any information other than individual instantaneous channel states, costs and rewards. However, it should be noted that, while the algorithm for estimating the best response policy for each user is provably convergent, the emergence of a Nash equilibrium while users learn their optimal policies is not analytically proved (except for the two-user case). Nevertheless, in simulations the system always converges to a Nash equilibrium while each user adapts its optimal policy in slowly time-varying systems. Furthermore, in Section 3.7 of this chapter, we also discuss the generalization of the game theoretic MAC results to the case of wireless networks with packet/user priorities. Packet priority is a means to prioritize specific application functions and important for networks with multiple types of traffics or classes of users, e.g., wireless local area networks (WLANs) or cellular networks. Packet priorities can even be a means of coordination or collaboration which does not require explicit communications between users. 59 Chapter 3. Game Theoretic Approach for Medium Access Control 3.1.1 Main Results The random access channel-aware MAC problem is formulated as a non-cooperative transmission game, where each user aims to find a transmission policy to maximize its individual expected reward. The results presented in this chapter include the following. • If the probability of correct reception of a packet transmitted by a user (player) is a non-decreasing function of its CSI, there exists a threshold transmission policy that maximizes the expected reward of the user (see Figs. 3.1, 3.2). • There exists a Nash equilibrium at which every user deploys a threshold transmission policy. Therefore, in order to find the optimal policy for a user, or an equilibrium profile of the system, it suffices to consider only threshold policies. • For the symmetric network model, where users have identical channel state probability distribution function, identical transmission and waiting costs, there exists a symmetric Nash equilibrium profile. That is, at the Nash equilibrium, all users deploy the same threshold transmission policy. In addition, the common threshold is increasing in the number of active users. • Due to the threshold structural result, in order to determine the optimal (best response) transmission policy for a user, it suffices to estimate only the optimal threshold, which is a single scalar. We propose a gradient-based stochastic approximation algorithm with a constant step size for adaptive estimation of the best response transmission policy for any single user without knowing the policies, or the channel probability distribution functions of other users. • The deterministic best response dynamics algorithm and the emergence of a Nash equilibrium are studied. Convergence of the best response dynamics algorithm is proved for the two-user transmission game. For the general case of a multiuser transmission game, the best response functions are explicitly characterized for Rayleigh fading channels. Local asymptotic stability of a Nash equilibrium can then be deterministically (but numerically) verified. • A stochastic best response dynamics algorithm is derived by combining the deterministic best response dynamics algorithm with the adaptive stochastic approximation algorithm. Specifically, in the stochastic best response dynamic algorithm, at each 60 Chapter 3. Game Theoretic Approach for Medium Access Control iteration a user updates its best response transmission policy using the proposed adaptive stochastic approximation algorithm. Then the system behavior is asymptotically identical to the deterministic best response dynamics algorithm. Simulations show that even without this time coordination, the system still converges to a Nash equilibrium. Furthermore, the stochastic best response dynamics algorithm can track variations in the system, such as fluctuations in the number of active sensors, or any other statistical properties. However, it should be noted that the convergence to a Nash equilibrium is not analytically proved, except for the two-user case. Lastly, in Section 3.7, we present the generalization of all the above structural results and algorithms to incorporate packet priorities into the wireless network model. In the numerical studies section, characteristics of the Nash equilibrium and performance of the proposed algorithms are studied, and the system performance at equilibrium is compared to the optimal distributed MAC protocol derived in the previous chapter. 3.1.2 Related Work The framework of this chapter is very similar to that of Chapter 2. In particular, here we also consider exploiting CSI for random access in wireless networks with the generalized multipacket reception (G-MPR) model proposed in [10]. It has been mentioned in the previous chapter that the first MPR model (without CSI) was proposed in [24] and generalized to the case of an asymmetric network model in [4]. Furthermore, [25, 89, 90] studied the performance of the ALOHA random access protocol for MPR wireless networks, where CSI is not available. In [10], the G-MPR model, which is a generalization of the MPR model in [24], was proposed. In the G-MPR model, the correct reception probabilities depend on the channel states of all transmitting users. The G-MPR model explicitly incorporates CSI into the packet reception process and is suitable for exploiting CSI for cross-layer MAC optimization. Additionally, research work related to this chapter include papers on the use of game theory for various wireless network optimization problems. The game theoretic approach has been used to analyse the ALOHA random access protocol in [39, 46-48]. The channel model assumed in [39, 46] was the collision channel model. In comparison, the channel model assumed in [48] was the MPR model proposed in [25], where the number of transmitting 61 Chapter 3. Game Theoretic Approach for Medium Access Control users determines the probabilistic outcome of each transmission time slot. The ALOHA game in [48] was that every user selects its own transmission policy mapping the numbers of users contending the channel to transmit probabilities to maximize its individual utility. The main assumption in [48] was the global knowledge of the number of users contending the channel at every time slot. The authors proved the existence of a symmetric Nash equilibrium for the collision channel model in [46], [39] and the MPR reception model in [48]. CSI was not exploited in these three papers. Besides medium access (or transmission) control, the game theoretic approach has also been used for power allocation in [45, 49, 53], flow and congestion control [51], load balancing [52], pricing [50]. The main differences between the game theoretic transmission scheduling problem in this chapter and early work on game theoretic analysis of ALOHA networks [39, 46-48] are the exploitation of CSI via channel-aware transmission policies, the relaxation of the assumption that all nodes are symmetric, as well as the elimination of the assumption that the number of active users is known globally. In comparison, the most important difference between the work in this chapter and other work on exploiting CSI for optimal transmission control such as in the previous chapter or [10, 17-19], is the non-cooperative game theoretic formulation, which leads to a highly scalable, adaptive transmission scheduling algorithm. In this chapter the Nash equilibrium is learned via a stochastic best response dynamic algorithm, where every user updates its policy using a stochastic approximation algorithm. As it was mentioned previously, stochastic approximation is useful for problem of optimizing an objective function that can only be observed in noise, with respect to some adjustable parameters. Stochastic approximation is widely used in electrical engineering and wireless network optimization, for example see [100-103]. The remaining of the chapter is organized as follows: Section 3.2 describes the wireless network model. The game theoretic MAC formulation is given in Section 3.3. Section 3.4 presents the theoretical results on the structure of the optimal transmission policies as well as the existence of a Nash equilibrium. Section 3.5 analyses the existence of a symmetric Nash equilibrium for the symmetric game. The stochastic approximation algorithm for estimating the optimal transmission policy and the Nash equilibrium is proposed in Section 3.6. In Section 3.7, the structural results and numerical algorithms are generalized to incorporate packet priorities into the network model. Section 3.8 contains numerical examples. Lastly, the conclusions are given in Section 3.9, followed by Section 3.10, which contains most proofs of the analytical results in the chapter. 62 Chapter 3. Game Theoretic Approach for Medium Access Control 3.2 Distributed MAC for Wireless Networks This section describes the wireless network model and the distributed MAC mechanism that will be analysed and optimized using a non-cooperative game theoretic approach in the later sections. 3.2.1 Random Access Wireless Network Model The wireless network model considered in this chapter is almost identical to that in Chapter 2. In particular, we consider a wireless network of K users, where K is a finite positive integer. Index the users by i = 1, 2, ... , K. Assume the users communicate with a common base station via non-orthogonal communication channels, e.g., using CDMA with random signature sequences. Assume that time is divided into slots of equal size and transmissions are synchronized at the beginning of each time slot. Let SNR represent the channel state of a user. It should be noted that besides SNR, any parameter that influences the reception of packets at the base station can be chosen as the channel state. For example, the position of a user with respect to the base station can represent its channel state. If SNR represents the channel state, a user can estimate its individual CSI by measuring the strength of a beacon signal broadcast from the base station to all users. Denote the channel state of user i by -y, for all i E {1, 2, ... , K} and assume that -y, is a continuous random variable in [0, y max ], where -y max E R+ and -y max < oo. -y max can be arbitrarily large so that [0, -y max ] includes the entire range of CSI that is of practical interest. Denote the continuous (channel) probability distribution function of y2 by Fi(.). Furthermore, at the beginning of each transmission time slot, user i is assumed to know its instantaneous CSI, -y i , perfectly. In the network, a user that does not have any packet to transmit is referred to as an inactive user. In contrast, a user with at least one packet to transmit is referred to as an active user. At each time slot, inactive users perform no action while active users must either transmit or wait, i.e. not transmit. The transmit probability of an active user is determined by its instantaneous CSI and its transmission policy. 63 Chapter 3. Game Theoretic Approach for Medium Access Control Define a transmission policy to be a function mapping channel states of a user to its transmission probabilities: ui(•) : [0, -ymax ] 4 [0, 1]^ - (3.1) for i E {1,..., K}. A transmission policy is sometimes referred to as a transmit probability function. We now define pure, randomized and threshold transmission policies. Definition 3.1. A transmission policy u(.) : [0, -ymax ] -4 [0, 1] is pure if u(y) E {0, 1} for all -y E [0, 7max ] except for possibly a zero measure set (with respect to the probability measure F(.) of the channel state -y) of values of 7. Definition 3.2. A randomized transmission policy is a transmission policy that is not pure. Equivalently, a randomized policy is a transmit probability function u(.) : [0,-y max ] [0, 1] such that 0 < u(y) < 1 for some non-zero measure set (with respect to the probability measure F(.) of the channel state) of values of y. Definition 3.3. A threshold transmission policy is a transmission policy u(.) : [0 ,7,,, ax ] —> [0, 1] such that u(7) a.e.^0^< 0 1 Otherwise (3.2) for some 0 E 0 < 0 < ymax . For a precise formulation of the game theoretic MAC optimization problem, we need to define the space of admissible transmission policies. Consider the normed linear function space L 00 [0, -ymax ], that contains all Lebesgue measurable functions defined on [0, -ymax ] that are bounded almost everywhere (a.e.). The norm of a function in L e.3 [0, -ymax ] is its essential supremum [93]. Let BL 00 [ 0 , 7.] [0, 1] be the set of all functions in L 00 [0,-ymax ] that have norms in the range [0, 1]. Then BL.[O,rymaxi [0, 1] is the admissible transmission policy space. In a transmission time slot, if user i does not transmit a waiting cost wi > 0 is recorded, if it transmits it has to pay a transmission cost ci > 0. At the end of a transmission time slot, if a packet is received correctly the user receives a reward of 1 unit. More conditions on the cost and the derivation of instantaneous and expected rewards for each user are delayed until Section 3.3. The objective of each user in the network is to find a transmission policy mapping channel states to transmit probabilities to maximize its individual expected reward. This distributed MAC problem is formulated as a non-cooperative game in Section 3.3. 64 Chapter 3. Game Theoretic Approach for Medium Access Control 3.2.2 G-MPR Model This chapter assumes the G-MPR model that is proposed in [10]. A short summary of the G-MPR model is given below for notational clarity and to make the chapter self-contained. Section 2.2.2 gives a more detailed description of the G-MPR model. In the G-MPR model, the outcome of a transmission time slot, when k users indexed by Ar transmit, belongs to an event space where each elementary event is represented by a binary k-tuple OA K 05ti : i E Ain , where 19i = {0, 1} and V, = 1 indicates that the packet sent by user i is correctly received and V, = 0 indicates otherwise. The reception capability of the system is described by a set of K functions, where the k-th function 4)(1A E; O A K) assigns a probability to the outcome by Ar with channel state -Y AK e Arc when k users indexed transmit: 4)(1A r ; O A r) P(e A ikc lk users transmit, 1,4 0 (3.3) Equation (3.3) means that the distribution of the possible outcomes {O A K } is determined by the channel states of the transmitting users. It is assumed that the reception model (3.3) is symmetric (as in (2.5). Furthermore, the probability that the transmission from user i is successful depends on the channel states of all transmitting users as specified below: Pi (1'i " Y At" —i) = Ed,[19 ilk users tx, 1'AK—i], (3.4) :4 where Ar E C2K is the set of transmitting users. For a time slotted CDMA system with matched filter receivers and the SINR threshold reception model [91], (3.4) can be rewritten as: = I( Throughout it is assumed that 'Yi 1+ Oi (eyi , Iii tcc_ i ) 65 x y; > 0). (3.5) N defined by (3.4) is a Lebesgue measur- Chapter 3. Game Theoretic Approach for Medium Access Control able function of ry i . Besides Lebesgue measurability, the assumptions listed below are also used. These assumptions are satisfied by most non-trivial systems, e.g., the SINR threshold reception model (3.5). 1. The success probability of user i, defined by (3.4), is non-decreasing in its channel state (--7i,^v-yi > <=>E[V i l k users tx,^/Ar —i] > E Pi I k users tx, yi, lAkf—ii V7i > ^ (3.6) Some of the analytical results in this chapter can be strengthened if the inequality in (3.6) is strict, i.e. E[i9i I k users tx, -yi ,^_ i ] > E[19i I k users tx, -7 i ,^Vry, > 7i .^(3.7) Unless it is stated otherwise, it is assumed that (3.6) holds, but not (3.7). 2. The success probability of a user is lowered when one more user transmits. This is also a condition in [36]. (lAr_i,11)) dh > 0 <#.1E[Vilk users tx, 'Yi,^^_ Epiik + 1 users tx, ryi (lA r _i, h)],^(3.8) where h > 0 is the channel state of the additional user. 3. When the channel state of a user is 0, its success probability is 0: 1Gi ( 0, 1Ar_ i ) = 0 ^ (3.9) 3.3 Game Theoretic MAC Formulation The distributed MAC optimization problem can be formulated as a non-cooperative game with a continuous state space as follows: • The set of players is the set of active users indexed by i = 1, 2, ... , K. 66 ^ Chapter 3. Game Theoretic Approach for Medium Access Control • At each time slot user i can choose an action ai E A = {W T}, where W means to wait, T means to transmit. A user can also choose to transmit with some probability. • A strategy is a transmission policy, defined by (3.1). Pure and randomized transmission policies are defined in Definitions 3.1 and 3.2 respectively. Since the space of pure policies is not finite, the existence of a Nash equilibrium is not straightforward. • A profile is a set of strategies deployed by all users in the network, denoted by u = 0, • • • ) 1,1 K( •) } The only remaining element of the non-cooperative game is the expected reward (utility) function of each user. In a transmission time slot, if user i does not transmit a waiting cost wi is recorded, if it transmits it has to pay a transmission cost ci. At the end of a transmission time slot, if a packet is received correctly, the user receives a reward of 1 unit. The instantaneous reward of user i is then determined as follows: ri = 1 — ci If ai = T, Ai = 1 —ci If ai = T, 19i = 0 —wi If ai = W, i.e. user i did not transmit. (3.10) The condition that ensures a successful transmission is preferable to no transmission, and no transmission is preferable to an unsuccessful transmission is 1 >ci>wi> 0 Vi=1,2,..., K.^ (3.11) By the symmetric property of the reception model (3.3), the expected reward of user i, denoted by G i (u,(•),{11_,(•)}), can be easily derived: fmax ,),max G i (u i (•), {u—i(.)})fo Ili (TO [ E E H u1(2,1) II (1- uj(Yi)) K Fax^k=1^ leAr_i^JoAL‘ X (( 1 ci) 1PieTi, -7Ar -i) ci( 1- Oi("Yi, ;;Ar -i)))^dFi (yi)] jOi — (1 — ui(ryi))wi dF(-yi) 67 ^ ^ Chapter 3. Game Theoretic Approach for Medium Access Control /max^/Max ( xi) o^• • f - [f -rmax • fo E E^u ( 1(/ )^- ui k=1 Akcpi/EAr_i^iv Ar K x Oi (-xi, -iAr_i) - fi (3.12) dF3 (73 ) - ci + wi dFi('Yi) — wi. .1#1: ] j=1 Let T i bi, {u_ 2 (•)}) : [0, 7max ]^[0, 1], i^1, 2, ... , K be a function mapping channel states of user i to the (average) probability of receiving its packet correctly given the policies of other users. W i^{u_ i OD can be calculated from (3.4) to be [K E E H uieyo fi(1 -uh.obribi, -yA r _o • ( 3.13) {tt—i(*)})^ k=1 ArDi/EAf,c_i^iv Akc , For example, for the SINR threshold reception model (3.5) for CDMA systems we have K^ f^ 7rnax //max { U— i( * )}) _ E E^H u )^(1- u/(7/)) 0^0 k=1 ArDi^ jEAr—i^/ctpir x I( ^ 1± > 13)^dFi (7j). E^jOi (3.14) jEAr —ti Using (3.13), the expected reward of user i given by (3.12) can be rewritten as Gi(ui(.), {tt_ i (•)}) = fo^Ui -rmax (-yi )^{u_i(.)}) —^ w2) dFi (yi ) —^(3.15) In the remaining of the chapter, (3.12) and (3.15) are used interchangeably as the utility function of user i. The problem of maximizing the expected utility for user i is: sup^Gi(ui(•), {u_i(•)}), u i(.)EBL oo [0,,. ] [0,11 (3.16) where Gi (ui(-), fu_i(-)1) is given by (3.12) and B.L co [ 0 ,..),.] [0, 1] is defined in Section 3.2.1. The optimization problem (3.16) completes the formulation of the non-cooperative game. The remaining of the chapter is dedicated to analysing the existence and structure of Nash equilibrium profiles. A Nash equilibrium profile is a profile at which no player can benefit 68 Chapter 3. Game Theoretic Approach for Medium Access Control by unilaterally deviating from its current policy [43],[41]. Definition 3.4. A profile a* =^,u*K01 is a Nash equilibrium if and only if Gieal (.), {u*_ i (.)})^Gi(ui(.), {u* j(.)}) for all players i = 1, 2 . . . , K and ui•) E BL.[ 0,7max ] [0, 1]. It should be stressed that at a Nash equilibrium point (3.16) must hold for all users. One observation that can be made at this point is that the relaxation of the assumption that all users are indistinguishable (as in [24, 25, 89, 90]) by allowing for different channel state distributions, Fi (.), can lead to some unfairness. One possible solution is to design the transmission and waiting costs to guarantee certain fairness requirements. The designing of transmission and waiting costs has not been studied in this chapter. 3.4 Structured Nash Equilibrium Having formulated the problem of optimal decentralized transmission control as a noncooperative transmission game in the previous section, in this section we study the structure of the Nash equilibrium transmission policies. We follow a common technique in game theory to prove the existence of a structured Nash equilibrium profile. This technique consists of three steps: 1. Showing that a particular class of policies is optimal (Theorem 3.1) 2. Proving the existence of a Nash Equilibrium when the policy space is restricted to this class of policies (Theorem 3.2) 3. Proving the existence of a Nash Equilibrium in the original game by showing that a Nash Equilibrium in the game with the restricted policy space is also a Nash equilibrium in the original game (Corollary 3.2). Readers are referred to [43], [41] for examples of the early use of this technique in game theory. 69 Chapter 3. Game Theoretic Approach for Medium Access Control 3.4.1 Optimality of Threshold Policies We prove that the utility of a user can always be maximized by a threshold transmission policy. Theorem 3.1. Consider the multipacket reception random access network of K active users described in Section 3.2. Consider the non-cooperative transmission game formulated in Section 3.3, where the problem of optimizing the utility for user i =- 1, 2, ... , K is given by (3.16). Assume the reception model (3.4) of the network satisfies (3.6) and (3.9). There exists a transmit probability function that maximizes user i's expected reward (3.12) and is a threshold policy: a.e^0^< u',;(7) = (3.17) 1 Otherwise for some 0 E Proof. The proof of this theorem is an application of the bang-bang principle, presented in [104]. The objective of user i is to maximize its utility, which is given by (3.15): „max Gi (140, {u_i(*)}) = fo^ui(ryi)^fu_i^— + wi) dFi (yi) — wi It can easily be seen that if T i (., fu_i(•)1), defined by (3.13), is Lebesgue measurable then u;(7i) =^1 if Wieyi, {u_i(.)}) — ci + w i > 0 0 otherwise (3.18) is a function in BL c.,[0,-ymax ][0, 1 ], and G(14 (.), {u_i(.)}) = sup G(u,(.), fu_ i (.)1). 1,2(•)eB Lc.3,-^ [0 , m ax i fo,1] In other words, if Ti(., fu_i OD is Lebesgue measurable then it is clear that the supremum of G(ui(•), {u_i(•)}) is attained in BLoo [ 0 ,..),.] [0,1] at ui ( ) given by (3.18). It will now be shown that T t e, {u_,(.)}) is indeed Lebesgue measurable. Recall that 70 ^ Chapter 3. Game Theoretic Approach for Medium Access Control {u_ i (.)}) is given by (3.13), i.e. [ {u—i(*)}) = E{F_i} - K EEH fi ttic-yo (1k=1 AL Di letlr^jcZAr ( Due to Lebesgue measurability of the set of functions O i (-yi ,lA fcc_ { , } ) and all transmission policies, it is clear that T i (-yi, {u_ i (•)}) is Lebesgue measurable. Therefore, the supremum of G(ui(•), fu_i•1) is u:(-yi) given by (3.18). In addition, T i bi ,{tt_ i (-)}) is non-decreasing in -yi due to (3.6), and Ti(0, {u_,(•)}) = 0 due to (3.9). The latter implies that kIi i (0, fu_i(•)1) — ci + wi < 0. If Wieyma,x , {u_ i (.)})-c i +wi < 0 then it must be the case that T i (-y, {u_i(•)})—c,+w i < 0 V -y E [0, -ymaA. The optimal solution is then not to transmit, i.e. u:(-yi) = 0 V -yi E [0, -yma „]. This function is a threshold function with the threshold being equal to '' max . If Wibmax, {u—z(•)})^wi > 0 there must exist a threshold 0 E [ 0 , ')/max] so that ^fu_i•1) — i (ry, ci + wi^0 for all -y < 0 and tIf i (-y, {u_i(.)}) — ci + wi > 0 for all -y > 0. Again, the optimal transmission policy defined by (3.18) is a threshold policy. Hence the proof is complete.^ ^ Remarks: • It can be noted from the above proof that the optimal transmission threshold for player i is the threshold Oi E [0,-y max ] that satisfies Wie7,{u—i0})^+ wi <0 Vry< Bi >0^> 0i • (3.19) • If the condition (3.6) is strict, i.e. if (3.7) holds, then Wibi, {u_i(.)}) is strictly increasing in ryi . In this case, there exists a unique 0i that satisfies (3.19). In other words, the transmission policy defined by (3.18) is the unique optimal policy, and its optimality is strict. Hence, Theorem 3.1 can be strengthened that the optimal transmission policy 4(-yi) is unique and of the threshold structure for every user i, given the policy of other users. Corollary 3.1. Consider a time slotted CDMA network of K active users, where K is a 71 Chapter 3. Game Theoretic Approach for Medium Access Control finite integer, with the SINR threshold reception model (3.5) for matched filter receivers. There exists a threshold transmission policy solution to the optimization problem (3.16) for user i in the network, for any i E {1, 2, ... , K}. Proof. The SINR threshold reception model (3.5) satisfies (3.6) and (3.9). The proof then follows from Theorem 3.1.^ ^ 3.4.2 Existence of a Nash Equilibrium We reformulate the transmission game allowing only transmission policies in the class of threshold policies and prove the existence of a Nash equilibrium for this case. The existence of a Nash equilibrium at which every user adopts a threshold transmission policy for the original game then follows. Reformulation of the non-cooperative transmission game: In light of Theorem 3.1, it is clear that in order to maximize the expected reward, it is sufficient for a user to search for its transmission policy in the class of threshold policies. Each threshold transmission policy is completely defined by a single parameter: the CSI threshold beyond which the user transmits with certainty. Assuming every user seeks for a transmission policy in the class of threshold policies, the non-cooperative transmission game can be reformulated as below: • The set of players is the set of users indexed by i 1, 2, ... , K. • Each player i = 1, 2, ... , K selects a transmission policy in the class of threshold policies defined by (3.2). Equivalently, player i selects a threshold 0,, which determines its transmission policy as: ui (7i) = { 0^< O i 1 Otherwise (3.20) The space of pure policies is then the set of possible values of 0„ which is [0, 7„nax ]. • The utility function of user i is obtained from (3.15) by replacing u • by I(. > 03 ) 72 Chapter 3. Game Theoretic Approach for Medium Access Control for j^1, 2, . . . , K: =f 0 llmax^ > 0i) [fo 'imax 7max 10 E H^> 01) k=1 ArDi /eAr—i H (1_^> 0i) )0i^H dFi (Y ) —^wil dFi^— wi (3.21) Then the optimization problem (3.16) can be rewritten as sup^Gi(19i3O_i),^ ei E[0,-yma,d (3.22) where Gi(19i, O_i) is given by (3.21). Having reformulated the transmission game, in Theorem 3.2, we use a classical result on the existence of a Nash equilibrium for games with continuous utility functions ([105], [106], [107], also see Theorem 1.2 in [43]) to prove the existence of a Nash equilibrium for the new game. Theorem 3.2. Consider a multipacket reception random access network of K < oo active users, where the system and reception models are given in Section 3.2. Assume that the users only select their transmission policies in the class of threshold transmission policies. Assume that the reception model (3.4) of the network satisfies (3.6) and (3.9). The non-cooperative transmission game formulated above, where the problem of optimizing the expected reward for user i 1, 2, . , K is given by (3.22), has a Nash equilibrium profile. Proof. See Section 3.10.^ ^ Remark: The assumption that the channel distribution function is continuous is essential in the proof of Theorem 2. If the channel distribution is not continuous it is not guaranteed that the space of pure policy is a compact, convex subset of any Euclidean space and the classical result on the existence of a Nash equilibrium for infinite game with continuous payoff ([105], [106], [107]) cannot be used. Finally, we claim that the original strategic non-cooperative transmission game for73 Chapter 3. Game Theoretic Approach for Medium Access Control mulated in Section 3.3, where the players search for their policies in the whole subset BL.[0 , 7.] [0, 1] of the normed linear function space L c„,, also has a Nash equilibrium profile. Corollary 3.2. Consider a multipacket reception random access network of K active users described in Section 3.2. Assume that the reception model (3.4) of the network satisfies (3.6) and (3.9). Consider the non-cooperative transmission game formulated in Section 3.3, where each user can select a transmission policy in B L eo [0 ,1,..] [0, 1]. This non-cooperative transmission game has a Nash equilibrium at which every user deploys a threshold transmission policy. Proof. According to Theorem 3.2, the transmission game where only threshold transmission policies are considered has a Nash equilibrium. Furthermore, due to the optimality of threshold policies (Theorem 3.1), a Nash equilibrium profile in the transmission game considering only threshold policies must also be a Nash equilibrium profile in the original transmission game, where a user can use any transmission policy in BL co [0 , 7,..] [0, 1]. The existence of a Nash equilibrium at which every user deploys a threshold policy then follows. ^ Remark: If the inequality (3.6) is strict, i.e. if (3.7) holds, then threshold transmission policies are strictly optimal (see the remark after the proof of Theorem 3.1). As a result, all Nash equilibria can be obtained in the class of threshold policies. In comparison, if (3.7) does not hold, the game may admit a Nash equilibrium outside the class of threshold policies. Furthermore, in this case, it is not immediately clear whether there exists a Nash Equilibrium that is Pareto dominant outside the class of threshold policies. In light of the theoretical results presented in this section, in the remaining of the chapter we only consider transmission policies in the class of threshold policies. In that case, the policy of each user i is defined by a channel state threshold 0, the utility by (3.21). For convenience, we introduce the definition of a Nash Equilibrium channel state threshold vector. Definition 3.5. A channel threshold vector 0* = (01, . , °k) E [0,-yrna,c ]K constitutes a Nash Equilibrium if and only if, Gi (07, 0* i ) > Gi(0 i , 0*_ i )^ 74 (3.23) Chapter 3. Game Theoretic Approach for Medium Access Control for all players i = 1, 2, . . . , K, O i E [0, 'ymax ] In the next section, we study the existence of a symmetric Nash equilibrium for the symmetric network model where all users have the same channel distribution function, transmission and waiting costs. 3.5 Symmetric Nash Equilibrium for the Symmetric Game This section considers the special case of a symmetric multipacket reception network model, where all users have the same continuous channel distribution function, i.e. F3 (•) = F(.) V j E {1, 2, ... , K}, the same transmission and waiting costs. In light of the optimality of threshold policies (Theorem 3.1), let the users search for their transmission policies in the class of threshold policies. The existence of a symmetric Nash equilibrium for the symmetric transmission game is proved under a mild condition on the continuity of the success probability function (3.13) (Theorem 3.3). It is also showed that the symmetric Nash equilibrium threshold is a non-decreasing function of the number of active users in the network (Theorem 3.4). We now define the two conditions required by Theorems 3.3, 3.4. Let all users except user i deploy the same threshold transmission policy, i.e. uj('yj) = I(-y3 > 0) V j i, for some 0 E [0, -yinaid. The success probability function, given by (3.13) for user i, becomes K Ti(yi3O)=E F, E E H^> 0) H (1 -^> 0) )1/)i (-'i,, lA re _ { {j}) ]. ( 3.24 ) Lk=1 Ar Di lEtli^j The first condition of Theorem 3.3 is that (3.24) is a continuous function of the CSI That is lim^0) =^0) V-Yi,^0 E [0,-ymax], Yi^Yo for i (3.25) E {1,^, K} Writing^0) in the integral form reveals that the continuity of tp i ( yi , -YA r_ fil ) with respect to (w.r.t) -yi is a sufficient condition for (3.25). 75 Chapter 3. Game Theoretic Approach for Medium Access Control The second condition of Theorems 3.3, 3.4 is that (ynaax , 0 ) > C — W,^ (3.26) which implies that T i (-yffia,„ 0) > c — w for all 0 E [0, 'ymax ] since W j (-yi , 0) is non-decreasing in 0 and yi by conditions (3.8) and (3.6), respectively. Intuitively, (3.26) means that if the channel state of a user is -y nia„, which is the maximum value in the channel state space, and the user transmits then it will gain a positive expected reward. This condition is not very restrictive due to the fact that -y max can be arbitrarily large. In addition, it might be desirable that the transmission and waiting costs are adjusted so that (3.26) holds in order to eliminate the possibility that the optimal policy for user i is to never transmit. It should be noted that (3.26) is sufficient (but not necessary) to ensure that u(•) = 0 is not optimal. Theorem 3.3. Consider a symmetric multipacket reception random access network of K active users. Assume that (3.25) and (3.26) are satisfied. There exists a Nash equilibrium profile at which every user deploys the same threshold transmission policy, i.e. there exists a Nash equilibrium profile a* = ,u*K(.) : u1(yi) = > 0*) V i = 1„ 2, ... , K for some 0* E [0, -ymax ). Proof. See Section 3.10^ ^ Theorem 3.4. Consider a symmetric multipacket reception random access network of K active users. Assume that (3.25) and (3.26) are satisfied. Let be a transmission threshold such that aK = fuj(ryi) = > C) : i 1, 2, ... , K} is a Nash equilibrium profile. Consider the same network but with K +1 users. There exists > such that a l( + 1 =^> e) : i = 1, . . . , K +11 is a Nash equilibrium profile for the transmission game of K +1 players. Proof. See Section 3.10.^ ^ 76 Chapter 3. Game Theoretic Approach for Medium Access Control 3.6 Learning Best Responses via Stochastic Approximation The main contribution of the chapter has been to prove the optimality of threshold transmission policies and the existence of a Nash equilibrium at which every user adopts a threshold transmission policy. In this section, we are interested in the secondary problem of estimating a Nash equilibrium for the transmission game. Although there exist many learning algorithms in game theory for estimating a Nash equilibrium [54], proving the convergence of these algorithms is very difficult. Here we focus on the best response dynamics algorithm for estimating a Nash equilibrium and exploit our structural result on the optimality of threshold transmission policies. It is known that if the best response dynamics algorithm (Algorithm 3.1 below) converges, then it converges to a Nash equilibrium [54]. However, apart from the two-user case, we have been unable to give useful sufficient conditions for convergence of Algorithm 3.1. To give more insight into the sufficient condition for convergence of Algorithm 3.1 below, we also give an explicit characterization of the best response functions for networks with exponentially distributed channel state (i.e. Rayleigh fading channel). Algorithm 3.1 is the deterministic best response dynamics algorithm. At each iteration of Algorithm 3.1, it is assumed that each user can solve the stochastic optimization problem (3.22) for its best response transmission policy. This optimization problem can be solved efficiently using the stochastic gradient ascent method given by (3.33) in Section 3.6.3. Combining (3.33) and Algorithm 3.1, we propose the stochastic best response dynamics algorithm (Algorithm 3.2), that behaves asymptotically identical to Algorithm 3.1. In the stochastic best response dynamic algorithm, each user uses a gradient estimator and a constant step size to estimate its best response transmission policy via (3.33). Algorithm 3.2 can track the slowly time varying best response dynamic adjustment process for a slowly time varying network. 3.6.1 Deterministic Best Response Dynamics Algorithm First, we restate the utility optimization problem that needs to be solved numerically by each user. Specifically, in order to maximize its individual expected reward, user i must solve the optimization problem (3.22): sup G ,(0,,0_,). 9 2 E [O,ymax 77 ] Chapter 3. Game Theoretic Approach for Medium Access Control In light of Theorem 3.1 the supremum of G,(0,, 0_,) is attainable and it follows from (3.19) that H,(0_,) = arg max^G, (0,, 0_,) = Oi e [0, -ymax] such that: O E[0,-y.] t Wi eY, 0 —i) — c, + w, { <0 V-y < 0, >0 by > 0, (3.27) where W,(7, (L i ) is given by (3.13) with ui(.) = I(. > 0i) for all j = 1, 2, ... , K. H,(19_,) defined above is the best response function for user i. H,(0_,) gives the threshold that defines the optimal policy for user i in response to the policies of all other users. In this section, we assume (3.7), which implies that the best response of each player is unique (see the remark after the proof of Theorem 1). H,(0_,) is then a well-defined scalar value function mapping from [0, -y max ]K -1 to [0, -y max ]. In addition, H,(61_,) is non-increasing in every component of 0_, due to (3.8). The best response dynamics algorithm (see [54] for details) for the transmission game, where the users take turn to update their transmission policies, can be written as below. Algorithm 3.1. Best Response Dynamics of the Transmission Game • Initialization: At batch n^1, select W(n) E [0, -ynna,c ] K • For n, = 1, 2, .. . For user i = 1, 2, ... ,K B(n) = H i(e(n) 0 —1). ) where 0 i(n) = ( 0. 71)) 3<i, 0 (n)^(19.71)) (3.28) 3.6.2 Convergence of Best Response Dynamics At this stage, we have not yet given a method to solve (3.27), we will do so in Section 3.6.3 via a stochastic approximation algorithm that can estimate the optimal 0, for user i. However, temporarily assuming that (3.27) can be solved, i.e. the mapping (3.28) can be done in 78 Chapter 3. Game Theoretic Approach for Medium Access Control each iteration, it is well known that if Algorithm 3.1 converges then it always converges to a Nash equilibrium [43]. So we first focus on giving conditions for Algorithm 3.1 to converge. First, we present a lemma on global convergence, i.e. convergence for any initial condition, of Algorithm 3.1 for the two-user case. Lemma 3.1. For a two-user transmission game, Algorithm 3.1 converges to a Nash equilibrium. Proof. See Section 3.10.^ ^ Apart from the two-user case, proving global convergence of Algorithm 3.1 is very difficult, hence we focus on local convergence conditions. Local asymptotic stability of a Nash equilibrium is sufficient for Algorithm 3.1 to converge locally, i.e. for initial conditions in some ball around the Nash equilibrium. In Lemma 3.2, we present a sufficient condition for local asymptotic stability of a Nash equilibrium. Lemma 3.2. Local Asymptotic Stability of a Nash Equilibrium. Let 0* be a fixed point of Algorithm 3.1. Assume that iJi(0i, 0_i,) is differentiable (and hence continuous) at 0 = for all i = 1, 2, ... , K. Then kIfi(07, 0* i ) — +^= 0 Vi = 1, 2, ... , K^(3.29) Denote the gradient matrix by D(0) =^where dij = Voj (0i, 0_i). If the eigenvalues of D(0*) lie strictly within the unit circle then ^is is a asymptotic stable fixed point of Algorithm 3.1. Proof. (3.29) follows from (3.27) and the assumption on continuity of T i (•, .) at the fixed point. It then follows straightforwardly from Theorem C.7.1, pp 334 of [108] that if the eigenvalues of D(0*) lie within the unit circle then 0* is a asymptotic stable fixed point of Algorithm 3.1 ^ It is difficult to analytically verify the condition in Lemma 3.2 since it is not known a priori where the Nash equilibrium is. Later on, we give an explicit representation for the success probability function T i (., .) in (3.27) and (3.29), which allows us to numerically evaluate the Nash equilibria and verify local asymptotic stability of a Nash equilibrium. 79 Chapter 3. Game Theoretic Approach for Medium Access Control Characterization of Best Response Function for Rayleigh Fading Channels In this section, we give an explicit representation of the best response functions for CDMA networks with Rayleigh fading channels. This allows us to numerically and deterministically (i.e. without requiring any stochastic simulation) characterize the Nash equilibrium and numerically verify local convergence of Algorithm 3.1 by studying the deterministic system of equations (3.27) for all i = 1, 2, ... , K. We assume a Rayleigh channel (i.e. exponentially distributed power) and the SINR threshold reception model (3.5), where SNR represents channel state and hence can explicitly compute kIf (-y i , (Li) in (3.27). Denote the Rayleigh channel probability density function of user i by fi N = where the exponential parameter Ai > 0. Lemma 3.3. For a symmetric network i.e. all Rayleigh channels are identically distributed (Ai = A Vi = 1, 2, ... , K), the best response function (3.27) for user i has the following representation for Wi (-y, O_i): K Ilf j('Y 1 e--i) =---- E k=1 EH (1 — e -A91 )I(a > •) (I(k = 1) + I(k > 2) Ar Di 1 0 Ar^JEA -i x (e A EJEAr-i 6i _ e -Aa + t I -2 Ale-Aa ( a — EjEAr--i °i) 1 ) (-1) 1 1! 1=1 (3.30) Here a Nry1,3 — N, N and are the spreading gain and the QoS requirement parameter of the SINR threshold reception model (3.5) respectively. For an asymmetric network, i.e. the Rayleigh fading channels have different probability density functions: Ai A3 V i j, Wi(y, O_i) in (3.27) is given by K Wi(y, 0_ i ) =E EH k=1 Ar Di i o Ar (1 _ e -A8 I)I(a^ E ei ) I(k = 1) + I(k = 2) — E A i ej^— E x^JEAr^E e—Aia) + I(k > 3)e E A r iGAr-i 80 Ai e; ^ Chapter 3. Game Theoretic Approach for Medium Access Control x (1 (-1) k-1 E^11 -Ai (a- ) A,^ —^e^ ^Am jEALC -i mEAk -i-j K _ e„) n EAri (3.31) ( Here a = N-y113 — N, N and 13 are the spreading gain and the QoS requirement parameter of the SINR threshold reception model (3.5) respectively. Proof See Section 3.10.^ ^ We will use Lemma 3.3 to analytically compute the best response functions for a simple three-player transmission game and verify local asymptotic stability of a Nash equilibrium in Section 3.8. 3.6.3 Stochastic Best Response Dynamics Algorithm In Algorithm 3.1, it is assumed that at each iteration, user i can do the mapping (3.28), i.e. it can solve (3.22) for its best response transmission policy. In this section, we propose an algorithm (Algorithm 3.2 below), where (3.28) in Algorithm 3.1 is replaced by a stochastic gradient ascent algorithm that can estimate the best response transmission threshold for any user i, i.e. numerically solve (3.22). As shown below, the asymptotic behaviour of Algorithm 3.2 is the same as that of Algorithm 3.1. It should be noted that each user needs to estimate its optimal transmission policy without knowing the policies, the channel distributions of other users, or the number of users in the network. Hence, the optimization problem (3.22) is a stochastic optimization problem. Because the gradient descent is an unconstrained algorithm, an important condition for the validity of Algorithm 3.2 is that the optimization problem (3.22) without the constraint 0, E [0, -y max ] actually has a solution in [0, 7„, a),], or equivalently, T i (•, O_„) changes sign from negative to strict positive at some 9i in [0, for all 01, E [0, -ymax ] K-1 (see (3.19)). A sufficient condition for this fact is CI) > c — w,^ ^(M, where^.) is given by (3.13) with u3 (.) (3.32) = 1(73 > 0i) V j = 1, 2, ... , K. Similarly to the explanation of condition (3.26) for the case of a symmetric network, T i (., .) is monotone, 81 Chapter 3. Game Theoretic Approach for Medium Access Control — c + w = —c + w < 0 and (3.32) implies that T i (M,^— c + w > 0 for all CL.i E [0, 'yrnax ] K-1 . Hence, (3.32) is sufficient to guarantee that the solution of the unconstrained version of (3.22) is in [0, limax]• Algorithm 3.2. Stochastic Best Response Dynamic of the Transmission Game • Initialization: Outer loop index n = 1, a batch size A > 1, 0, a constant step size 0<e<1 • For n = 1, 2, . . . For user i = 1, 2, ... , K user i updates its transmission threshold once every m transmission time slots, ^(n (A,T1-1.),\^) other users keep their policies fixed, i.e. 19_i =(1\ 0(A,n) k ^)3<z, j^13>i For Inner loop index 1 = 1,2, ... , A Estimate the gradient of V e ( t , n )G I ' n) (0 1 n) 19_i) via a gradient estimator (see (3.35) and (3.36)) 9 (/±1,n)^0 (1,n) + si t, ou ,n)G (1,n) (0(1,n) 9—i)^ (3.33) Implementation of Gradient Estimator First, the utility function G i (0i, 0_,) of user i, given by (3.21), can be written as l'max Gi (0i^>^0—i) —^Wi) dFi (7i ) — w i l'max =^(-yi, ei f 6—i) — +^— w i , (3.34) where T z (-y z ,0_,) is given by (3.13). G i (0,,0_,), given by (3.34), is differentiable almost everywhere by the Fundamental Theorem of Calculus [109]. Furthermore, the gradient of G,(0,, _ i ) is Ve t Gi(0i, 0 i) = — — (wi(0i, 0_i) 82 — ci + wi)fi (0i).^(3.35) Chapter 3. Game Theoretic Approach for Medium Access Control Since fi(0,) can be absorbed into the step size, to estimate the gradient of G ,(61„ _ i ) we only need to obtain an unbiased estimate of which is the probability that the packet from user i is received correctly given its channel state is equal to O. If the network is static during the period of m transmission time slots of the /-th iteration, the unbiased estimate ofl) (BY ) ,0_ i ) is: 41 (1,n) ( e,n) _ i) _= Number of ACK(s) -yi = O i , Number of Tx I -y, = Oi (3.36) As an example, consider a system where SNR represents the channel state of a user. The probability of success of user i when its channel state is Oi can be sampled using a power control mechanism to ensure that the received SNR is equal to Oi and counting the number of ACK(s) that are sent to the user by the station. Hence a user can obtain the unbiased estimate kIi' n) ((e n) , 9_ i ) without knowing the channel distributions or transmission policies of other users. 3.6.4 Properties of Stochastic Best Response Dynamics Algorithm Define a local maximizer of the objective function (3.34) as Bz = {Oi :^O_i) = 0, VT9, Gi(Oi, O_i) < 01.^(3.37) Theorem 3.5. For each user i and a fixed outer loop index n, if A > 1/e i , the sequence ,1 = 1, 2, ... , A, generated by (3.33) converges in probability (i.e. weakly) to the threshold level corresponding to a local optimizer Bi of user i's expected reward Gi(0 i , 0_0 , where = ( (a (A,n) )^(49(A,n-1)^.) )3<il^/3>z • Proof. For a fixed initial {O 1 ' n) } the sequence V 0(1,70G1'n)(B(1,n)) = _ ( .1F1(0(1,n) ) f(B (1,n) ) : 1 = 1, 2... A are i.i.d. Therefore, equation (3.33) is an instance of the well known constant step size Robbins Monro algorithm. Define the continuous time interpolation of the discrete sequence { e' n) : 1 = 1, . A} as follows (see [97] for details): 0 (1,n) 0 6:(t) =^2(7 for t < 0 Or n '^ for le i — E z < t <^1 = 1,^, A 83 (3.38) Chapter 3. Game Theoretic Approach for Medium Access Control In [97] (Theorem 2.1, pp 219), it is shown that under the condition of uniform integrability of Vo z G i (Oi, 0_0 = ,(0i, Li) — ci + w z )f,(0i) and four other conditions that are trivially satisfied (conditions A1.2, A1.3, A1.4, A1.10 on page 217 of [97]), as E t -÷ 0, Bi (t) defined as above converges weakly in trajectory (on the space of continuous on the right, limit on the left functions, endowed with a Skorohod metric) to the continuous time function Bi (t), i.e. lim P( sup I B7 (t) — -6,(t) I> 0) 0. k(t) satisfies 2 ez^o<t<A dei dt (3.39) (4' i(0i,^+ A sufficient condition for uniform integrability of Vei Gi(Oi, O_i) is that the channel distribution has finite variance. The Ordinary Differential Equation (3.39) is asymptotically stable in the sense of Liapunov due to the fact that G i (8 i3 O_ i ) is bounded, quasi-concave w.r.t Oi, and a local maximizer of G i (19i, O_i) is also its global maximizer (see the remarks after the proof of Theorem 2). Therefore, on a time scale LS, = 1 / i oo as Et 0 the sequence of transmission thresholds generated by (3.33) converges weakly to a local maximizer of the utility function Gi(0,,^In [97], the weak convergence proofs for much more general correlated signals (e.g. Markovian signals) are given. ^ Since at every outer loop iteration, the users take turn to update their policies and for each user the estimates generated by (3.33) converges in probability to a global optimizer of its utility function, Algorithm 3.2 converges weakly in trajectory to the deterministic best response dynamic adjustment process. Hence, if Algorithm 3.1 converges to a Nash equilibrium transmission threshold vector W*, Algorithms 3.2 also converges in probability to 9* [110]. Remark: For the convergence of Algorithm 3.2, we have the condition that A > 1/E,. When A is small, Algorithm 3.2 does not exhibit the best response dynamic adjustment process for the transmission game but may still converge to a Nash equilibrium. In fact, if a small value of A is used, the algorithm may converge at a much faster rate and may be able to track changes in the network more efficiently. We illustrate the convergence of Algorithm 3.2 with A 1 via a numerical example in Section 3.8. 84 Chapter 3. Game Theoretic Approach for Medium Access Control a o f -r Channel state Packet priority Figure 3.3: Example of a transmission policy for networks with packet priority Channel state Packet priority Figure 3.4: The optimal transmission policy is threshold for every packet priority 3.7 Extension to Wireless Networks with Packet Priorities This section discusses the generalization of the game theoretic MAC protocol to incorporate packet priority into the wireless network model. Packet priority is a means to prioritize specific application functions and important for ensuring Quality of Service (QoS) for multiple types of traffics or different classes of users, e.g., in wireless local area networks (WLANs) or cellular networks. Many recent wireless network standards have taken packet priorities into account. For example, in IEEE 802.11e, packet reordering is deployed to adjust the time spent by packets in queues according to their priorities. In addition, in the contention-free MAC protocol of the IEEE 802.15.4 standard, slots are reserved for high priority packets. Furthermore, in systems such as sensor networks, where the objective usually is to extract information from correlated data from multiple sensors, packet priorities can be a means of collaboration which does not require explicit communications among sensors. The game theoretic formulation and the results presented in previous sections can be straightforwardly generalized to incorporate packet priorities. Here it is assumed that the 85 Chapter 3. Game Theoretic Approach for Medium Access Control packet priority probability distribution function is i.i.d over time for every user. A transmission policy is then a function mapping channel states, packet priorities into transmit probabilities (see Fig. 3.3). The results derived previously are generalized as follows: it is proved for the model with packet priorities that the optimal transmission scheduling policy of each user is threshold in the channel state (see Fig. 3.4). Furthermore, there exists a Nash equilibrium at which every user adopts a threshold policy. The generalization of the threshold results are straightforward due to the assumption that the packet priority probability distribution function is i.i.d over time. Specifically, the main argument is that the problem of optimizing the expected reward of a user can be decomposed into maximizing the expected reward obtained for every packet priority. In what follows, we simply state the new game theoretic formulation and the results. The proofs follow the same lines as the proofs of the results in the previous sections and the details are omitted. 3.7.1 MAC for Wireless Networks with Packet Priorities Consider a network model that is almost identical to the network model in Chapter 3 except that data packets can have different priorities. Specifically, assume there are q packet priorities. Denote the packet priority of user i by the discrete random variable rh, then 77, E {1, 2, ... , q} for all i. A higher value of ni represents a higher priority. Let 7,(•) denote the probability mass function of i.e. P(rii = j) = 7ri(j) and 71-i (j) = 1 Vi = 1 , 2, . . . , K. Each user accesses the medium by a transmission policy mapping channel states and packet priorities to transmit probabilities (see Fig. 3.3 for an example of a transmission policy for the case when there are 2 packet priorities): ui(•, •) : [0, -y rnax ] x {1, 2, ... , ----+ [0, 1], (3.40) which can also be rewritten as ui (7, n) =E j=1 I(77 for all i E {1, ... ,K}. 86 = (Y) (3.41) ^ Chapter 3. Game Theoretic Approach for Medium Access Control Non-cooperative Transmission Game with Packet Priorities In order to incorporate packet priorities into the transmission game, the transmission and waiting costs and the correct reception rewards are redefined as functions of actions and packet priorities. In a transmission time slot, if user i does not transmit its current packet of priority j, a waiting cost wi o is recorded. If a transmission is attempted, a transmission cost cio occurs. However, if the packet is received correctly, the user receives a reward ai o . Analogously to (3.10), the instantaneous reward of user i is given by a io — c io If ai = T,^= 1 rio =^—cio^If ai = T,^= 0 ^ (3.42) —wio^If ai = W, i.e. user i did not transmit. We assume the following condition cei >^>^> 0 V i^1, 2, ... , K and j^1, 2, ... , q,^(3.43) which ensures that a successful transmission is preferable to no transmission, and no transmission is preferable to an unsuccessful transmission. Furthermore, to enforce that the correct reception reward and the waiting cost increase with packet priority whereas the transmission cost decreases with packet priority, the following must hold for every user: (3.44) for all j > /, j,/ E {1,^, q}. From (3.15) and the definition (3.41) for a transmission policy, the expected reward of user i is given by •),^•)})^ 7ri(n) (EFi [Ui,n(Yi)^ 7^.)})^ Ci,n^wi,n)]^Wi,n) n=1 (3.45) where Wi(y; { u_ i (., •)}) is the function mapping channel gain of user i to its success proba- 87 Chapter 3. Game Theoretic Approach for Medium Access Control bility given the policies of other users: [ K H si(-yo H E E^ l (1 - si (ryi ))/P i e-yi ,^, ( 3.46) k=1 Ak Di leAr^j(ZAII,‘ where s3 (7) =^irj (1)%,i(7) is the average transmit probability for the given channel state 7. By comparing (3.45) with (3.15), it is easy to see that the problem of optimizing (3.45) can be decomposed into the maximizing the reward obtained for each and every packet priority j = 1, 2, ... , q. 3.7.2 Structured Nash equilibrium The expected reward of a user i, given by (3.45), can always be maximized by a transmission policy that is threshold in the channel state for every packet priority. The following corollary is a generalization of Theorem 3.1 to incorporate packet priorities into the network model. Corollary 3.3. Consider a multipacket reception network of K active users, where K is a finite integer, as described in Section 3.7.1. Assume that the reception model (3.4) of the network satisfies (3.6), (3.8) and (3.9). There exists a transmission policy that maximizes user i's expected utility (3.45) and is of the following form: 4(7, 77) =^I(71 = n)I(-y >^ (3.47) n-=1 for some Bi = (00, • • • ,t9i, q ) E [0, rymwc r. Proof. The proof follows the same lines as the proof of Theorem 3.1. Hence we omit the details.^ ^ A transmission policy of the form (3.47) is completely defined by the transmission threshold vector Bi where each component corresponds to a packet priority. Hence, the dimen, sionality of the problem of estimating the best response transmission policy is reduced from infinity to q (also see Fig. 3.4). 88 Chapter 3. Game Theoretic Approach for Medium Access Control Corollary 3.4. Consider the transmission game for the multipacket reception network of K active users with packet priorities as described in Section 3.7.1. Assume the reception model (3.4) satisfies (3.6), (3.8) and (3.9). There exists a Nash equilibrium profile at which every user adopts a transmission policy of the form (3.47). Proof. See Section 3.10.^ ^ Besides the above analytical results, it is easy to prove the existence of a symmetric Nash equilibrium for the symmetric non-cooperative game, analogously to Theorem 3.3 for the case when there is no packet priority. These minor results are skipped here to avoid repetition. 3.7.3 Learning Best Responses via Stochastic Approximation It has been mentioned that a transmission policy of the form (3.47) is completely defined by the transmission threshold vector Bi c [0, rymax iq. The gradient ascent algorithm given by (3.33) is then generalized (see Algorithm 3.3) to allow any user to adaptively estimate its optimal transmission policy without knowing transmission policies, or channel state distribution functions, or the number of active users in the network is proposed. Given the optimal transmission policy estimation algorithm, the implementation of the stochastic best response dynamics algorithm is the same as Algorithm 3.2 and the details are omitted. Due to Corollary 3.3, the problem of estimating the best response transmission policy for user i can be written as sup^Gi(Bi, L i ), Bi E [o,'ymax]9 where the expected reward GA, 6_i) of user i is given by q GA,^= E 7ri(n) f^I(-y > Oi , )^-^+ wi, ii dFi(-y) - n=1 (3.48) 89 Chapter 3. Game Theoretic Approach for Medium Access Control where^t-Li) is given by (3.46). The gradient of the objective function is then given by Gi(ei,ri)^—7i(n) (ai jiTi(ei,n,^— Ca,n + wi,n)ii( 0 i,n)• ^(3.49) Assume unbiased estimates of {W,(0i, n ) : n = 1, 2, ... , n}, which is the success probability for channel state -yi = Oi,,, can be obtained (e.g., by counting the number of ACKs and NACKs that are sent to the user from the base station while a power control scheme is used to ensure that the received SNR (or equivalent channel state) is equal to 0i, n ). Then the best response transmission policy of a user can be adaptively estimated by Algorithm 3.3 below. Algorithm 3.3. Initialization: 1 = 0, e ) E [0, 7max ]q; Select step size 0< Si < Ei such that 1. For / = 1, 2, ... do: If packet priority is n, then obtain a gradient estimate Wi(e) , (Li) and update 0 1) = 0( /) — Ei (ai n 11ii (en) , Li) — ci,n + wi,n) ^ (3.50) Algorithm 3.3 can be deployed by any user to estimate its optimal transmission threshold vector without any knowledge of transmission policies or channel state probability distribution functions of other users. As the update step size Et is constant (over time) for every user i, Algorithm 3.3 can track changes in the number of active users or statistical properties of the network. Additionally, the estimates e ) , 1 = 1, 2, ... generated by Algorithm 3 3 converges in probability (i.e. weakly) to an optimum of the user's utility [97]. Lastly, (3.50) can also be used to construct a stochastic best response dynamics algorithm, analogously to Algorithm 3.2. In the numerical studies section, performance and tracking capability of the stochastic best response dynamics using (3.50) for wireless networks with packet priorities is illustrated. 3.8 Numerical Studies This section presents numerical examples to illustrate the structural results and the performance of the learning algorithms derived in the chapter. The numerical examples also illustrate the distributed and adaptive implementation of the non-cooperative game theo- 90 Chapter 3. Game Theoretic Approach for Medium Access Control retic MAC solutions, as well as the performance gains due to the exploitation of CSI for MAC. The numerical studies are established in the following framework. For examples of wireless networks that have the multipacket reception capability as described in Section 3.2, we simulate time slotted CDMA networks with matched filter receivers and assume the SINR threshold reception model (3.5). In particular, it is assumed that the Signal to Noise Ratio (SNR) represents the channel state, and that in a transmission time slot, a packet from user i is received successfully if and only if: "Yi 1+ EjEAr-i where A ilcc is the set of indexes of all transmitting users, -yi is the channel state, i.e. SNR, of user i and /3 represents the SINR threshold beyond which a transmission is received error-free. We assume that the underlying physical channel is Rayleigh fading, hence the channel state (SNR) is exponentially distributed, i.e. the channel distribution function for user i is = 1 — e where pi is the mean value of -yz , i = 1, 2, ... , K. , Each non-cooperative transmission scheduling game is then completely defined by 6 parameters: the number of active users in the network K, the means of the exponentially distributed channel states pi for i 1, 2, ... , K, the spreading gain of the CDMA code sequences N, the threshold /3 of the SINR reception model, the transmission and waiting costs for each user. In order to analyse the simulation results, for every numerical example, we precompute an f—Nash equilibrium profile using a modified version of Algorithm 3.2. In particular, we select A = 1 and a new stopping criterion for Algorithm 3.2: the iterations are stopped when no player can gain more than E = 0.0005 by deviating from its current policy. In the numerical examples that we investigated, the best response dynamics Algorithm 3.1 (and hence Algorithm 3.2) always converges to the same Nash equilibrium, regardless of the initial profile. However, this does not imply the unique existence of a Nash equilibrium. In fact, examples of simple two-player transmission games that have more than one Nash equilibria are easy to derived. 91 Chapter 3. Game Theoretic Approach for Medium Access Control Best Response Correspondences in a 3-user Transmission Game 7 6.5 - X H. 4.5 Co 32.5 2 Best Reponse of User 1 3 Best Response of User 3 2 User 1 Tx Threshold User 2 Tx Threshold Figure 3.5: The best response functions for a simple 3-player transmission game intersect at a unique point, i.e. the game has a unique Nash equilibrium at which every player deploys a threshold transmission policy. 92 Chapter 3. Game Theoretic Approach for Medium Access Control A unique Nash equilibrium for a 3-player Transmission Game We compute the best response functions H(19_i) : [0, ry max ] K-1^[0, -ymax ] (defined by (3.27)) analytically using (3.30) for a symmetric three-player transmission game. The best response functions for all three players are then plotted in Fig. 3.5. The parameters of the simulation are: the mean of the channel state probability distribution function is p, = 10 for all users, the reception model is the SINR threshold reception model (3.5) for a CDMA network with random signature codes of spreading gain N = 16 and matched filter receivers. Assume the SINR threshold is 13 = 3 (i.e. 4.7 dB). It can be seen from Fig. 3.5 that H(0_i) is indeed non-increasing in every component of O_i. Furthermore, Fig. 3.5 suggests that the system has a unique symmetric Nash equilibrium as the best response functions intersect at a unique point in the graph. In addition, computing the eigenvalues of the gradient matrix define in Lemma 3.2 at this Nash equilibrium reveals that it is asymptotic stable. In other words, the best response dynamic algorithms (Algorithms 3.1, 3.2) will converge to the Nash equilibrium, given suitable initialization. Convergence and tracking capability of stochastic approximation algorithm Fig. 3.6 illustrates the convergence and adaptive capability of Algorithm 3.2 with A = 1 (i.e. the stochastic best response dynamics algorithm, where users take turn to update their policies using the gradient-ascent method). The fixed parameters of the simulation are: the spreading gain is N = 16, the SINR threshold parameter is Q = 4dB, the waiting and transmission costs are fixed at ci = 0 and wi = 0.2 for all users i = 1, 2, ... , K. Initially, the network has six active users that have a common exponential channel distribution function with mean = 8. Every user uses (3.33) with the constant step size E = 0.008 to adaptively estimate its optimal transmission threshold. The chosen initial transmission threshold is (0) = 2. Each user is programmed to send a pilot signal once every 10 transmission time slots. A pilot signal is simply a signal transmitted with exact power control. A user then updates its policy immediately based on the feedback (i.e. ACK or NACK) obtained from this pilot signal, i.e. m = 1. The time slots at which a user sends a pilot signal is programmed as follows: user i sends a pilot signal at transmission time slot n if and only if n mod 10 = i, i.e. the users take turn to update their policies. In other words, the users are updating their policies according to Algorithm 3.2 with m 1 and A = 1. 93 Chapter 3. Game Theoretic Approach for Medium Access Control SA learning algorithm can track changes in the number of active nodes 8 T - - Nash Eq Tx Threshold o7 .0 .0 3 - Estimates of the Nash Eq Tx Threshold .2 6 co u.) C CU i— To 5 E 0 4 to E w 3 4 user becomes inactive 2 0 A user becomes active 2^3^4^5 Transmission Time Slot Two users become active 6 7 8 x 10 5 Figure 3.6: Algorithm 3.2 with A = 1, m = 1 can adaptively estimate the Nash equilibrium transmission thresholds as the number of active users fluctuates. This is due to the tracking capability of the stochastic approximation algorithm given by (3.33). 94 Chapter 3. Game Theoretic Approach for Medium Access Control A user is scheduled to leave the system at time slot t = 100000. A new user with the same channel distribution function as other users in the system (i.e. exponential distribution with mean it = 8), arrives at time slot t = 280000. Two more users become active at time slot t = 500000. One of these two users has an exponential channel distribution function with mean ati = 6, and the other user has an exponential channel distribution with mean = 14. Fig. 3.6 shows that every time a user becomes active or inactive, the users in the system update its transmission thresholds to the new equilibrium in approximately G1 G2 G3/3 P..1 10000 transmission time slots. In real time, 10000 transmission time - slots are in order of seconds assuming a bandwidth of 1MHz and a packet size of 1000 bits. It can also be seen that the two users that become active at time slot t = 500000 have two different transmission thresholds at the equilibrium point. This is predictable as the channel distribution functions of these two users differ from each other and from other users. In comparison, the users that have the same channel distribution function deploy the same transmission threshold at equilibrium. Emergence of a Nash equilibrium via Best Response Dynamics We illustrate the convergence of Algorithm 3.2 via an additional numerical example. Specifically, we consider a symmetric network of five users with a common exponential channel distribution with mean 1..t = 8, the spreading gain is N = 16, the QoS requirement -=- 4dB, the waiting and transmission costs are fixed at 0 and 0.2 respectively. The mapping (3.28) is numerically computed using (3.33). The stopping criterion is max k i+1) — 0, 1) 1 < e = 0.01. The initial points are randomly generated. Algorithm 3.2 stops after 49 iterations. In Fig. 3.7, the estimates of the best response transmission thresholds in all 49 rounds of updating transmission policies are plotted. It can be observed that the best response transmission thresholds are quickly updated to a good neighbourhood of a Nash equilibrium. System performance Fig. 3.8 illustrates the gain in the system throughput at a Nash equilibrium point, which is estimated using Algorithm 3.2 with A = 1, in comparision to other decentralized transmission scheduling schemes. The network model parameters are: the users have the same exponential channel distribution with mean pi = 10 for all i = 1, 2, ... , K, where K is the 95 Chapter 3. Game Theoretic Approach for Medium Access Control Best Response Dynamic for a 5-player Transmission Game 6 "5 5 co 112 .c c4 th y) - - Pre-estimated Symmetric e-Nash Equilibrium Transmission Threshold 23 42 , g2 0.) t1 Initial Tx Thresholds 0 0 10^20^30^40 Number of rounds of updating Policies 50 Figure 3.7: Algorithm 3.2 converges quickly to a good neighbourhood of a Nash equilibrium. In each iteration, the best response transmission policy of a user i is estimated using the gradient ascent method (3.33). number of active users in the network, the spreading gain is N 32, the SINR threshold beyond which a packet is received error-free is 3 = 4dB, the waiting and transmission costs are fixed at 0 and 0.2 respectively. The empirical equilibrium system throughput is estimated for K = 2, 3, ... , 40. It is showed in Fig. 3.8 that the Nash equilibrium system throughput is improved with the number of active users. In comparison, when there is no transmission control, i.e. each user is greedy and always attempts to transmit the system throughput degrades as the system becomes more heavily loaded. In addition, when decentralized CSI is not available (or not exploited) and each user transmits with a constant probability as proposed in [25], the system throughput also slightly decreases when the network size increases to close to and above the spreading gain N = 32. Furthermore, Fig. 3.9 depicts the successful transmission rates of the game theoretic optimal MAC, in comparision to other decentralized transmission scheduling schemes that do not exploit CSI. The successful transmission rates are computed by the following formula: R= K 96 ^ (3.51) Chapter 3. Game Theoretic Approach for Medium Access Control 6 5.5 5 4.5 ^ 4 E R. 3.5 3 5- 2.5 E w 2 1.5 1 —4— No Tx. Control ■^ Optimal Decentralized Tx. Control without CS --*— Game Theoretic Tx. Control with CSI 0^5^10^15^20^25 Number of active users 30 35 40 Figure 3.8: System throughput comparison between the game theoretic cross layer optimal MAC scheme in this chapter, the optimal decentralized transmission control scheme without CSI in [25] and the case when there is no transmission control. 97 Chapter 3. Game Theoretic Approach for Medium Access Control 1.5 —4— Success Rate - No Tx. Control ^ Optimal Decentralized Tx. Sched. without CSI —is— Game Theoretic Tx. Sched. with CSI 1 0 5^10^15^20^25^30 Number of active users ^ 35 ^ 40 Figure 3.9: Comparison of successful transmission rates between the game theoretic cross layer optimal MAC protocol in this chapter, the optimal decentralized transmission control scheme without CSI in (25] and the case when there is no transmission control. 98 Chapter 3. Game Theoretic Approach for Medium Access Control 7 6 - ■ w■WAW I tt" .c =5 2 (1) 75. 4 3 2 --*— Optimal Tx. Sched. with CSI —9— 1 Game Theoretic Tx. Sched with CSI 5^10^15^20^25^30 Number of active users 35 40 Figure 3.10: Price of anarchy in system throughput. 1.5 --*--Optimal Tx. Sched with CSI e Game Theoretic Tx. Sched with CSI --- a) — 1 :_ 113 U) Cl) 0.5 0 5^10^15^20^25^30 Number of active users ^ 35 ^ 40 Figure 3.11: Price of anarchy in successful transmission rate. 99 Chapter 3. Game Theoretic Approach for Medium Access Control where pi is the average transmit probability of user i, given its transmission policy, and T is the system throughput. The above success rates directly measures the energy efficiency: a higher success rate implies less wasted transmissions. It can be seen from Fig. 3.9 that the game theoretic optimal MAC protocol have the highest successful transmission rates, which degrade very gradually with the system size. In other words, the exploitation of CSI improves both system throughput and energy efficiency. Furthermore, the system performance gains are very significant, especially when the number of active users increases. Lastly, in Fig. 3.10 and Fig. 3.11, the system performance of the game theoretic optimal MAC protocol is compared to that of the optimal MAC protocol derived in Chapter 2. The two figures show that the optimal MAC protocol outperforms the game theoretic MAC, but not very significantly. In particular, even though the system throughput is almost identical for both MAC protocols, the successful transmission rates are higher for the optimal MAC protocol. The price of anarchy is more apparent when the system is heavily loaded with many active users. This is intuitive because when multiuser interference increases, mediating the conflict interests is more important for maintaining the system performance, hence the optimality of the centralized design approach is more obvious. Characterization of symmetric Nash equilibrium In this example, a symmetric random access network is considered. The parameters of the simulation are: the users are symmetric and have a common exponential channel distribution with mean a = 10, the spreading gain is N = 32, the QoS requirement parameter = 4dB, the waiting cost is fixed at 0, the transmission cost is varied from 0.1 to 0.3. The symmetric Nash transmission threshold is estimated for network sizes from 5 to 32. It can be seen from Fig. 3.12 that as the number of active users or the transmission cost increase, the symmetric Nash equilibrium transmission threshold also increases. This is intuitive since an increase in the transmission cost implies that it is more expensive to transmit and each user should transmit with a less probability. The monotonic dependency of the symmetric Nash equilibrium transmission threshold on the number of active users has been proved in Theorem 3.4. 100 Chapter 3. Game Theoretic Approach for Medium Access Control Nash Equilibrium Transmission Thresholds versus Number of Active Sensors Number of Active Sensors in the Network Figure 3.12: The Nash Equilibrium transmission threshold increases with the number of active users in the network as proved in Theorem 3.4 for symmetric non-cooperative transmission games. 101 Chapter 3. Game Theoretic Approach for Medium Access Control Symmetric Nash Eq. Transmission Thresholds for Symmetric Transmission Game • 14 0 co _c 12, • 10, • 8, ciw 6, ccs z 4 40 30 3 Number of active users 20 Packet priority 10 1 Figure 3.13: Example of a Nash Eq. transmission policy. Adaptive Estimation of Nash Eq while the number of active user fluctuates Two users becomes active , A User b orhes inactive A user ecomes active ^ Tx. Time Slot 0 1^ Packet Priority Figure 3.14: Adaptive estimation of the Nash Eq. transmission threshold vector. 102 Chapter 3. Game Theoretic Approach for Medium Access Control Transmission Game with Packet Priorities Here we study the analytical results and the performance of Algorithm 3.3 for the transmission game with packet priorities. Consider the same symmetric random access CDMA network with the SINR threshold reception model (3.5), where a transmission is successful if the SINR is greater than a threshold [3. Assume the following parameters: spreading gain N = 32, SINR threshold 0 = 4dB, packet priorities 77 = 1, ... , 4, transmission and waiting costs: c = 0.2, w = 0 for all packet priorities, and the correct reception rewards al = 0.7, a2 = 0.9, a3 = 1.1, a4 = 1.3 for 77 = 1, ... , 4 respectively. Further assume that the underlying physical channel is Rayleigh fading, and the channel state (SNR) is exponentially distributed with the mean equal to 8 for all users. Characterization of the symmetric Nash equilibrium: Fig. 3.13 illustrates that as the number of active user increases, the Nash equilibrium transmission threshold vector increases component-wise. Additionally, for a fixed number of active users, the Nash equilibrium transmission threshold decreases with packet priority, i.e. packets of higher priorities access the medium with higher probabilities. Tracking Capability of Algorithm 3.3 In Fig. 3.14 the number of active users is initialized to be 7. A user becomes active at time t = 400000, inactive at time t = 1440000 and two users become active at time t = 2000000. Every user is programmed to send a pilot signal once every 10 transmissions, a pilot signal is a signal transmitted with exact power control. Fig. 3.14 shows that the estimates generated by Algorithm 3.3 with the constant step size c = 0.008 converge in less than 100000 transmission time slots, i.e. on the order of seconds assuming a bandwidth of 1MHz and a packet size of 1000 bits, to a good neighborhood of the new Nash Equilibrium transmission threshold vector each time a user becomes active or inactive. At the equilibrium, all users deploy the same transmission threshold vector. 3.9 Conclusions Channel state information, which is estimated at the PHY layer, is exploited for transmission scheduling at the MAC layer for wireless networks with the MPR capability. The transmission scheduling problem is formulated as a non-cooperative game, assuming that 103 Chapter 3. Game Theoretic Approach for Medium Access Control all users are selfish and rational. It is proved that a user can always maximize its expected utility by a transmission policy that is threshold in the channel state. Furthermore, there exists a Nash equilibrium profile at which every user adapts a threshold transmission policy. The optimality of threshold transmission policies converts the problem of estimating the optimal transmission policy, which is a function of the channel state to estimating a single scalar. This estimation problem can be easily solved by the gradient-based stochastic approximation method. In the chapter, an algorithm that uses the gradient ascent method (with a constant step size to enhance the tracking capability) to estimate the best response transmission policy for each user is proposed. This algorithm has the capability to track slowly time varying networks. This chapter studied the deterministic best response dynamics algorithm and its convergence to analyse the system behavior of the transmission game. However, except for the two-user transmission game, the convergence of the best response dynamics algorithm is not analytically proved. Nevertheless, the chapter contains explicit characterization of the best response functions for the transmission game of an arbitrary number of users assuming Rayleigh fading channels. This characterization offers a means to numerically, but deterministically, verify local asymptotic stability of a Nash equilibrium, which is related to convergence of the best response dynamics algorithm. In simulations, the stochastic best response dynamics algorithm, where users take turn to update their policies using the proposed gradient ascent algorithm, always converges to a Nash equilibrium. Furthermore, the stochastic best response dynamics algorithm can track changes in the system, e.g., the arrival or departure of users, and adapt to a new Nash equilibrium. Lastly, the existence of a structured Nash equilibrium and the stochastic-approximationbased learning algorithm are generalized for networks with packet priorities. 3.10 Analytical Proofs This section contains proofs of most analytical results in the chapter. 104 Chapter 3. Game Theoretic Approach for Medium Access Control 3.10.1 Proof of Theorem 3.2 When every user considers threshold transmission policies only, the space of pure policies is the set of possible values of the transmission thresholds, i.e. [0, -y max ], which is a compact, convex subset of IR + . A classical result on the existence of a Nash equilibrium for games with continuous utility functions ([105], [106], [107], also see Theorem 1.2 in [43]) states that if the space of pure policies of each player is a non-empty, compact, convex subset of an Euclidean space, in order to show the existence of a Nash equilibrium, it suffices to show that the utility function for each user is continuous with respect to all policies deployed by other users in the network and quasi-concave with respect to its policy. Consider user i. Rewrite the utility function of user i, given by (3.21), as Gi(ei, 0—i) 7max [ f''max f7max K =^ E EH si 0^k=1^Dit eAr Ien > 01) H 1 _^> oi )) ( J o Ar K dFi( Yi) — + wi] dFi('Yi) — wi - joi,j=i (3.52) Op j = 1,2, ... , K in (3.52) is to change the limits 0,(,),,,,7Ar _ fil ) is Lebesgue integrable since the function is It can be easily seen that the only role of of the integrals. Also note that Lebesgue measurable, defined over a bounded domain and has a bounded range. By the Fundamental Theorem of Calculus [109], GA 7 0_0 is absolutely continuous with respect to all 0i, j = 1, 2, ... , K . We now prove that G i (0 i , O_i) is quasi-concave with respect to O i . Indeed, we have G i (t9 i , 0_0 = [tIfi^O_i) J^ e "Vmax wit dFi (ryi ) —^ i — wi, (3.53) where tIi i (yi ,e9_i) is given by (3.13) with ui (-y i ) = I(7.) > 0i ): ^H^ > 0/) H ( 1 - I > Oi ))0i^- ) I • Wi (71, 0—i) = EfF_ ^k=1 Ak :ieAr icAr^JoAr Due to (3.6), (3.9) and (3.11), it is clear that T i (-yi , 0_,) is non-decreasing in yi and 105 Chapter 3. Game Theoretic Approach for Medium Access Control 0_,) — c, + w, < 0. Consider two cases: 1. If W,(1'max ,0_,) — c, + w, < 0 then 4/, (0,„ 0_,) — Ci + w, < 0 for all 0, < A = 0 for all 0, > A for some A E [0,7.]. Consequently, G,(0,, 0_,) is strictly increasing for 0, < A and constant for 0, > A. Overall, G,(0i, 0_i) is a non-decreasing, continuous function of 0, over the bounded domain [0, -y max ]. That is G,(0,, 0_ 2 ) is quasi-linear and hence quasi-concave in O. 2. If^— c, + w, > 0 then there exists a A E [0, 'Ymax] such that {< 0 for all 0, < Al Wi(612:, e—i) — ci + w,^= 0 for all Al < Oi < A2 • > 0 for all ai > A2 G i(0i, 0_i) is then increasing for 0, E [0, Ai), constant for 0, E [Ai, A2), and decreasing for Oi E (A2, rymax ]. Recall that G,(0,,0_,) is also continuous in O. Hence, G (0i, 0_ i ) is quasi-concave in 0,. Therefore, a Nash equilibrium exists.^ ^ Remarks: • From the properties that are listed above for G,(0,,0_,), it can be seen that even though the function is only quasi-concave in 0, , any local maximizer 0: of G i (0„, , 0_,) is also a global maximizer. • If (3.7) holds instead of (3.6) then the utility function G,(0,, 0_,) is strictly quasiconcave w.r.t 0„. 3.10.2 Proof of Theorem 3.3 In this proof, we make use of the Kakutani's fixed point theorem [111](Lemma 20.1 in [41]). 106 Chapter 3. Game Theoretic Approach for Medium Access Control ^Lemma 3.4. Let X be a compact, convex subset of IP and f : X^X be a set-valued function which • for all x E X the set f (x) is nonempty and convex • the graph off is closed (i.e. for all sequences {x 7,} and {y„,} such that y„, E f (x„,) for all n, x,,,^x, y„,^y, we have y E f(x). Then there exists x* E X such that x* E f(x * ). Using the Kakutani's fixed point theorem it will be showed that there exists a 0* E [0, -ymax ] such that if all users other than user i in the network deploy the threshold transmission policy u*(-y) = ley > 0*) then u*(.) is also an optimal policy for user i. Assume that users in the set {1,2, ... ,K} — {i} use the transmission policy u(7) = > 0). In order to maximize its utility, user i must solve the optimization problem max^G,(0,, 0), where Gi (0, , 0) = gmax^> 0i) (Wi (7,, 0) ozE[0,7max] — c + w) dF Due to the continuity, monotonicity of ^0) and the bang-bang principle in [104] we have arg max^Gi (Oi, 0) = {0i : W(0„,0) — c w = 0}^(3.54) eze[o am.] Define a set valued inverse function of W(•, .) as follows: - f(0)^WT 1 (0 , 0) =^:^0) — c + w = 0} [0, Ymax]^[0, 7max]• - ^ (3.55) Equation (3.54) can then be rewritten as arg max^G,(0,, 0) = f (0). We now prove that 8, E [0 ,rymax] f (0) satisfies the conditions of the Kakutani's fixed point theorem. Recall that Wi(y i , 0) is continuous and non-decreasing in -y i . Note that the conditions (3.9) and (3.26) imply that ^0) — c + w < 0 and ilf,(-yma,„ 0) — c + w > O.^(3.56) It follows from inequalities in (3.56) and the continuity of Wi(-yi, 0) w.r.t -y, that f (0) 0 for every 0 E [0, -ymax ]. In addition, due to the continuity and monotonicity of W(-y i , 0) 107 Chapter 3. Game Theoretic Approach for Medium Access Control w.r.t -yi, f (0) defined as in (3.55) is either a singleton set containing only one real number, or a closed, bounded interval of 118 ± . Hence f (0) is a compact, convex set for all 0 E [0, 7,, a),]. In addition, the closure of the graph of f(0) follows straightforwardly from the continuity of W(-y, 0). Therefore, by the Kakutani's fixed point theorem there exists a 0* such that 0* E f (0*). Therefore, a symmetric Nash equilibrium exists. ^ 3.10.3 Proof of Theorem 3.4 To distinguish between the two numbers of active users the superscripts (K) and (K +1) are added to refer to the system of K and K +1 active users respectively. From (3.55), it is clear that C and e are the Nash equilibrium transmission thresholds for the transmission game of K and K 1 players respectively if and only if qi ic) (C, C) — c w = 0 V i = 1, . . . , K. and kIf ic+1) (e, e) — c w = 0 V i = 1, , K 1. Due to condition (3.8), the success probability of a user decreases when there is one more user in the system and all the users still use the transmission threshold (, i.e. 'I'' () > tp(K44)(C, C). It then follows that W,C, K+1) (C, C) — c w < c + w = 0. Consider two possibilities: Tr c, - _ ( ( ) • W (K+1) ( () — c + w = 0 then = e is a Nash equilibrium transmission threshold for the K + 1-user system. The statement of the theorem holds. • ec+1) (C, C) — c w < 0. In light of Theorem 4, there exists a Nash equilibrium transmission threshold e for the transmission game of K + 1 user. This transmission threshold satisfies if, i( K+1) (e, e) — c + w = 0 > W ic+1) ((, C) — c + w. This inequality implies that e > C (due to the continuity and monotonicity (non-decreasing) of .)). ^ 3.10.4 Proof of Lemma 3.1 For a two player game, in the n-th round of updating policies, the deterministic best response dynamic can be written as below. Or = H(B2 n-1) ) )^ ♦ W1 = 108 0 Chapter 3. Game Theoretic Approach for Medium Access Control e n) = H(Or) = 7 : Wi(-y, (9 91) ) — c2 + w2 = 0 Due to the monotonicity of the best response function H(.), it is easy to see that 0 7/±1) > 4,(n) _ dn+1) < a, (n) __„, a (n+2) > n (n+1) n+1) > n ( n) __, n (n) < '1^'2 ^'2 = '1 0(n)^n( ^'1 ^and O 1(n+1) <^1^ '2 ^v^ 2 ' '1 0;7 44) . Hence, after the first iteration, the best response dynamic proceeds monotonically in both directions. Since the transmission thresholds only take values in [0, ry max ], the best response dynamic converges. 3.10.5 Proof of Lemma 3.3 We prove (3.30) by mathematical induction. The proof of (3.31) follows the same steps but is longer and its details are omitted. Rewrite (3.24) as follows: (7, 0_ i ) =E E II (1 - e —AeOP > E k=1 AL Di 10Ar^ jEAr-fil -yj and -yi > 19j Vj E A k — {i} ) In order to prove (3.30) we need to prove for all n > 1 that P (ct > > 91, • • •^> on) = I (a >^j) x j=1 =1 C -A^Oi e 3=1 n n-1 A/,-Aa (a —^0. 3 j=1 (3.57) 1=1 Indeed, (3.57) holds for n = 1. Assume that (3.57) holds for n, we now prove that it holds for n 1. n+1^ P (a > n+1 E•y3 ,'yi > 01, • • • 77/±1 > On+1) =^> E03 ) A, 3=13=1 109 Chapter 3. Game Theoretic Approach for Medium Access Control where a— E ei A =^j=1 P — 772+1 > en+1^ 7 n 7j, 71 > 9 1 / - / / / 7n > O a ) j=1 Ae —A-Yn+ 1 CPYrt+ 1 E^—A Ee; e —A-rn + i e^ j=1 A — ^e —A a ^a— fe n+i E n 1/,—A(a--yn d-i) (a — 7n+1 —^ E^ n-1 ' Ae —A-Yn+1 1=1 ^n+1^n =e —A E ej j =1 — e —Aa )1 j=1 (_1)11!^d7n+1 E A i e — Aa (a — n+1 )l E j=1 1= 1^ (_1)11! The proof of (3.31) follows exactly the same steps. Instead of (3.57), we have to prove the inequality below by mathematical induction. n P (a > j=1 7j) 7 1 > el, • • • , 7n > on ) =I (a > =1 n n Ell ( j=1 m=1 .), j=1 ^(1+ 3 - 1) n A m A— rn — A; (a— e^1=1 00) 7710:7 3.10.6 Proof of Corollary 3.4 The proof follows similar lines to the proofs of Theorem 3.2. Let consider the case when users can only deploy policies of the form (3.47). Then every transmission policy is determined by a vector: O .' = E [0, 'ymax ]q, which a compact, convex subset of K. By Theorem 1.2 in [43], in order to show the existence of a Nash equilibrium, it suffices to show that the utility function for each user is continuous w.r.t policies of other users and quasi-concave w.r.t its policy. 110 Chapter 3. Game Theoretic Approach for Medium Access Control Rewrite the expected reward (3.45) for user i as 7ri,n = n=1 f0:,max Cki ji Ti (7i^wi,ndFi(7i) — ci,n, ^(3.58) where Wi(yi, Li) is given by (3.46) with ui,n(eY) =^> 91,n) for all / = 1, 2, ... ,K; n = 1, . . . , q. • Continuity: It is easy to see that when (3.46) is rewritten in the integral form, the only role of 01, n is changing the limits of the integrals. By the fundamental theorem of calculus [109], G (6 Li) is a continuous function of 9l for all 1 E {1, 2,^, K} . • Quasi-concavity: We now prove that G(B Li) is quasi-concave w.r.t every component Oi, n of Bi . Indeed, due to (3.6), (3.9) and (3.11), Ti(-y, Li) is a non-decreasing in of and cxi,n Wi(0, Li) — ^wi,n < 0 V n = 1,2,^, q. There are two cases. First, if ai,n 41 - j('Ymax, Li) — < 0 then cvi, n 1lii (-y, Li) — Ci, n < 0 V-y E [0, ymaxl. Then G (6 Li) given by (3.58) is non-decreasing and continuous and hence . quasi-concave w.r.t Oi, n • Second, if ai,nWi(Ymax, Li) — ci n +wi, n > 0 then there exists a A E [0, 'Ymax] such that , <0 for all -y < A ai,nTi(7, 6.—i) — ci , n + wi ' n^> { 0 for ally > A. C(B L i ) given by (3.58) is non-decreasing in Oi, n for ei,n E [0, A) and non-increasing in O i ,„ for Oi,,, E [A,'Tmax]• G(Wi, 6_i) is also continuous in Oi, n . Hence G(Wi, Li) is a quasi-concave in ei,n• Therefore, by Theorem 1.2 in [43] a Nash equilibrium exists. Due to optimality of threshold policies (Corollary 3.3), every Nash equilibrium for the restricted policy space is also a Nash equilibrium for the original game. The proof is complete.^ 111 ^ Chapter 4 Finite Horizon Transmission Scheduling 4.1 Introduction In this chapter, quantized CSI at the physical layer is utilized for single-user opportunistic transmission scheduling in wireless networks where the fading channel is correlated over time. In other words, both CSI and the channel memory resides in the fading process are utilized for cross-layer single-user optimal opportunistic transmission scheduling. We consider the real time transmission of packets in the uplink of a wireless network (e.g., UMTS) over a correlated fading channel. Due to the effect of shadowing, multipath/multiuser interference, and mobility, wireless communication channels are typically correlated fading, and, in a discrete time system, can be modeled by the Finite State Markov Chain (FSMC) channel model. We adopt the FSMC channel model, assume that retransmissions are allowed via an ARQ protocol, and analyse the structure of the optimal transmission scheduling policy. At each time slot a user has to decide whether to transmit a packet, given its CSI, the number of transmission time slots remaining and the number of packets in the buffer. In other words, a transmission scheduling policy is a function mapping the residual transmission time, the channel state information and the buffer occupancy state into an action of whether to transmit a packet (see Def. 4.1). The optimization problem is finding a transmission scheduling policy that minimizes an expected total cost, which is the sum of accumulated transmission costs and a data loss cost, subject to the constraint that a each batch of a certain number of packets must be transmitted within a finite number of trans- 112 Chapter 4. Finite Horizon Transmission Scheduling u,s,.:(x li, xe) 1 N 0 Residual transmission time Buffer occupancy state Figure 4.1: The optimal transmission policy is threshold in the buffer state and the residual transmission time. mission time slots. The expected total cost represents a trade-off between energy efficiency and packet loss rate. The problem is formulated as a finite horizon Markov Decision Process (MDP). In this chapter, the concept of supermodularity is applied to the dynamic programming equations to prove that the optimal transmission policy is threshold in the residual transmission time and the buffer state. In general a MDP is computationally intractable to solve unless the optimal policy has a special structure. A threshold policy is of the form (where u ri (s) denotes the optimal action to choose at time n for system state s): u n (s) = Action 1 If state s < Tn, Action 2 Otherwise. Thus if a MDP has a threshold policy, one only needs to compute the threshold (Tri in the equation above) to implement the optimal policy. It is therefore of significant interest to derive sufficient conditions for which a threshold policy is optimal. This serves as the main motivation for the work in this chapter. 4.1.1 Main Results The context of the work in this chapter is as follows. We consider a time slotted system and assume that the wireless fading channel evolves according to a Finite State Markov Chain (FSMC). Due to multipath interference and shadowing, the fading process of a wireless communication channel is usually correlated. FSMC models are widely used to abstract the physical layer correlated fading channel in the context of link and network layer algorithm design. Papers that are devoted to deriving explicit FSMC models for different fading 113 Chapter 4. Finite Horizon Transmission Scheduling channels include [7, 8, 13, 114, 115]. The chapter assumes that retransmissions are allowed via a channel-aware ARQ protocol. ARQ was proposed for real-time video transmission in [116] and has been shown to be more effective than FEC in [117]. The optimal transmission scheduling problem here can be interpreted as optimizing a channel-aware ARQ protocol to conserve energy while the channel evolves according to a Markov chain, subject to some delay constraint. The structural results presented in the chapter includes the following list. 1. If the terminal penalty cost is increasing in the number of untransmitted packets at the end of all transmission time slots (i.e. the terminal buffer occupancy state), then the optimal transmission policy is monotonically decreasing and hence threshold in the residual transmission time. 2. If the terminal penalty cost is increasing and furthermore has increasing differences in the terminal buffer occupancy state, then the optimal transmission policy is monotonically increasing and hence threshold in the buffer occupancy state. The above threshold results are non-trivial generalization of the results in [118, 119] to allows for Markovian channels, as explained in the section below. 4.1.2 Related Work In [118, 119], the authors used the concept of supermodularity to prove the monotone structure of the optimal resource allocation policy for constructing a finite number of components subject to penalty costs. The MDP formulation in this chapter can be abstracted as the problem of constructing a finite number of components (transmitting a batch of link layer packets) within a finite number of time slots in Markovian systems subject to penalty costs. In other words, the finite horizon transmission scheduling problem here is a generalization of the classical penalty cost MDP considered in [118] and [119] since we introduce into the model a Markovian system parameter (i.e. the channel). The proofs of the monotone results in [118] and [119] cannot be generalized for the case of a Markovian system state. The proofs in this chapter work straightforwardly for both 114 Chapter 4. Finite Horizon Transmission Scheduling cases, i.e. the case of a Markovian system state and the classical case in [118] and [119] where there is no system state (or equivalently, the system state is a constant). In this sense, the work presented in this chapter can be viewed as a generalization of the classical results in [118, 119]. The problem of designing a transmission scheme for a channel-aware ARQ protocol to balance between average energy consumption and throughput over infinite time has been studied for a two-state Gilbert-Elliot channel model in [33] and for an i.i.d channel in [120]. Furthermore, in [32] the authors analyzed the problem of transmission scheduling in Markovian fading channels for a power-delay tradeoff under energy constraints with the use of a channel aware ARQ protocol. The remainder of the chapter is organized as follows. Section 4.2 contains the system model and the formulation of the transmission scheduling problem as a finite horizon MDP. Section 4.3 establishes the structural results on the optimal transmission policy. Numerical examples are presented in Section 4.4. Section 4.5 contains the summary and conclusions of the chapter. Lastly, all mathematical proofs are given in Section 4.6. 4.2 Finite Horizon Transmission Scheduling Problem This section presents the system model and the finite horizon MDP formulation of the transmission scheduling problem. 4.2.1 Packet Transmission in Correlated Fading Channels Consider the real time transmission of data packets in the uplink, e.g., from a mobile station to a base station, in a packet-switched cellular network (e.g., a UMTS network). Assume time is divided into slots of size equal to the time required for transmitting one (link layer) packet. At each time slot, the user has to decide whether to transmit a packet, depending on its channel state and buffer occupancy state. The dynamics of the channel state and the packet transmission model are described below, followed by the MDP formulation of the transmission scheduling problem. 115 Chapter 4. Finite Horizon Transmission Scheduling Finite State Markov Chain Channel Model In a time slotted system, a correlated fading wireless communication channel can be modelled by a Finite State Markov Chain (FSMC) [7, 8, 13, 114, 115]. The procedures for obtaining a FSMC channel model include properly partitioning the channel state domain into a finite number of ranges and computing the corresponding transition matrix. We assume the following channel model. Assume that the channel state is divided into M non-overlapping ranges, where each range is mapped into one channel state of the FSMC channel model. Specifically, let F = {71 , y2, . , 7M} be the finite channel state space and 7 2, is a better channel state than written y2 > rye, for all i > j. At time n, the channel state Xn E F. Furthermore X c evolves as a Markov chain according to a transition probability 72 matrix AC = (aij : i j 1, 2, ... , M), where P(X7c.i+1 = y, I = = c<7 . , The above FSMC model can approximate the physical dynamics or the error process of a wireless communication channel with different fading effects due to multipath time delay spread and mobility of network users. The use of the FSMC model is popular in the literature dealing with wireless video transmission [121]. The FSMC model has been derived for Rayleigh and Rician fading channels in [7, 8, 114], for Rayleigh channels with adaptive modulation and coding in [13], for frequency selective fading channels in [122]. The two commonly used channel models: the Binary Symmetric Channel (BSC) and the Gilbert-Elliot models, are simple special cases of the above FSMC model. Packet Transmission with a Channel-aware ARQ Protocol Assume that the transmitter has a buffer of finite capacity to store packets, that are to be transmitted with the use of a ARQ protocol for retransmissions. In a transmission time slot, if a packet is transmitted, a transmission cost is recorded and the probability that the transmission is successful depends on the channel state. The result of the transmission (an ACK or NACK) will be received error free in the next transmission time slot. In other words, we assume an instantaneous, error free feedback channel for the ARQ scheme, this is also an assumption in most work that considers ARQ for real time data traffic in packet transmission systems, e.g., [123-125]. The retransmission mechanism of the ARQ protocol that we consider is as follows. 116 Chapter 4. Finite Horizon Transmission Scheduling Instead of using an ARQ protocol with a fixed persistence level, we assume that each batch of L packets have to be transmitted within N > L transmission time slots. If a packet is transmitted but not successfully received, it is kept in the buffer and will be retransmitted. However, at the end of all N time slots, no more transmission is allowed. The variable N/L can be interpreted as the average persistence level of the ARQ scheme. At the end of all N transmission time slots, packets that are still in the buffer waiting to be transmitted/retransmitted are considered lost and a terminal penalty cost will incur on these packets. The terminal penalty cost function should be application dependent and could be optimized to yield a good trade-off between energy consumption and the packet loss rate. Remarks • An advantage of this retransmission scheme is that it allows the link layer to have a dynamic persistence: the number of times that a packet can be retransmitted varies from 1 to N, i.e. ARQ can be highly persistent when necessary. As a result, more excessive link layer packet errors can be recovered and the maximum delay introduced by ARQ is still bounded. • The values of N and L vary from systems to systems. For instance, if we consider the transmission of video data in the QCIF format at the rate of 30 fps with a data rate of 2 Mbps and a packet size of 1024 bits then each frame consists of about 70 packets. If each batch contains only packets from a single video frame then L = 70. Furthermore, assume 30% of the bandwidth is reserved for retransmission then N = 100 for L = 70. 4.2.2 MDP Formulation Here we formulate the problem of scheduling the transmissions of L packets within N time slots as a finite horizon MDP with a terminal data loss cost. We first define the time axis. Let n denote the residual transmission time, i.e. the number of remaining transmission time slots, n = N,... , 1,0. At time n, the system state is defined by a 2-tuple Xn = Xnc], where Xnb is the buffer state: .X;), E B = {1, 2, , L}, which is the buffer state space, and X,c, E F and evolves as a Markov chain as described 117 Chapter 4. Finite Horizon Transmission Scheduling previously. In the remaining of the chapter, the system state space is denoted by S = B x F. Furthermore, the action set for all buffer state Xb > 0 is A = {0, 1}, where 0 stands for not transmitting, and 1 for the action of transmitting a packet. When Xb = 0, the action set is Ao = {0} E A. At each time slot, if action a E A is selected, the user has to pay a transmission cost c(•, : A x F R, (4.1) where c(., -) is an increasing in the action. In order to make physical sense, c(.,.) must be decreasing in the channel state, i.e. the transmission cost is lower for a better channel state. At the end of all N time slots, packets that are still in the buffer are discarded and and a data loss cost will incur. That is, if the terminal buffer state Xo > 0, a data loss cost is recorded. The data loss cost function can be designed to meet specific application-dependent requirements. This chapter assumes the general model of a data loss cost function that is increasing in the terminal buffer state: Ct (•) : B R, where Ct (x b + 1) > Ct(x b ) Vx b E B and Ct (0) = 0. Furthermore, in a transmission time slot, if action a is selected, the probability that a packet is successfully received is given by fs(a, xc) = { ^0^If a = 0 p s (xc) If a = 1, (4.2) where As •) : F^[0, 1] is the success probability function, given that a transmission is attempted. As a better channel state must imply a smaller error probability, p s (xc) is assumed to be increasing in the channel state xc. The optimization problem is to minimize the expected total cost, i.e. sum of all transmission costs and the data loss cost. A transmission scheduling policy, defined below, consists of N decision rules for the N decision epochs: n = 1, 2, ... , N. A decision rule at time n maps states to actions, and is denoted by u n : B x F —4 A. Definition 4.1. A transmission scheduling policy is a function mapping the residual transmission time, the buffer state and the channel state information into an action: 7r : {1, ... ,N} x S^A.^ 118 (4.3) Chapter 4. Finite Horizon Transmission Scheduling In the chapter, a transmission scheduling policy is written as 7r = (u 1 (., .), . , u n (• , For convenience denote the space of admissible transmission scheduling policies by Y. The expected total cost of a policy 7r = (1•, -) : n = 1, 2, ... , N) for initial state s E S is given by n=N (X b , Xe ) E{^c(u n (xn ,^x .,) ct (xt)ixik 7 X b , X =^(4.4) n=1 The optimization problem is to find 7r* = (u n*(•, .) : n = 1, 2, ... , N) such that 17 7r * (x b , xc) = V(x b , xe)^inf Vir(x b , xc), for all initial state [x b , xc] ES=BxF (4.5) 7rET Remarks: 1. The above MDP formulation goes through for the general case of a countable action space, i.e. when A = {ao, al, a2,^: ai < a3 Vi < j, ao = 01, or even a continuous action space. The only modification required is in the definition (4.2) of the success probability function Ps (a , xc) . In addition, the proofs of the monotone results in Section 4.3 hold for any countable/continuous action space. Therefore, the results presented in this chapter can be easily generalized to establish monotonicity of the optimal power control policy to achieve trade-off between energy and QoS, or the monotonicity of the optimal policy of any other MDPs that have the same mathematical abstract (with a countable/continuous action space). 2. In the case when retransmissions are omitted, i.e. whenever a transmission is attempted, a packet leaves the buffer with certainty, one can still consider the same transmission scheduling problem. However, as the transmission cost depends on the Markovian channel state, the scheduling problem is still a stochastic control problem. However, if the transmission cost is independent of the channel state, the optimization problem becomes deterministic. 3. This chapter incorporates the delay constraint into the transmission scheduling problem via the finite horizon MDP formulation with data loss costs. In the next chapter, we propose an alternative approach, which is to formulate the transmission scheduling problem as an infinite horizon, average cost MDP with a constraint on an average 119 ^+ Chapter 4. Finite Horizon Transmission Scheduling delay cost. The advantage of an infinite horizon formulation is that the optimal policy is typically stationary. In Chapter 5, it is proved that the optimal transmission scheduling is a randomized mixture of two stationary deterministic policies that are threshold in the buffer occupancy state. This structural result makes the computation and implementation of the optimal transmission scheduling policy very efficient. 4.3 Optimality of Threshold Transmission Policies In this section, the structure of the optimal transmission scheduling policy is studied. We first present the dynamic programming problem and then proving the structural results by applying the supermodularity concept to the recursive Bellman equation of optimality. At the end of the section, the significance of the structural results and some implementation issues are discussed. 4.3.1 Finite Horizon Dynamic Programming The optimal transmission scheduling policy is 7r* = (u n*(., .) : n = 1, 2, ... , N), where the decision rules u n*(•, .) : n = 1, 2, ... , N are the solution of the following stochastic dynamic programming recursion called Bellman's equation [59]: Vn(xb, xc) = min Q n (x b , xc, a), (4.6) aEA un* ( x b x c)^ = arg min Q n (x b, aEA x° , (4.7) a), where Q n (x b , xc, a) = { c(a, xc)^ip(xc/Ixe) Ifs (a, x c )vn i(x b^1, xc/) xc'Er (1 — fs (a, x c )) Vn-1(X b , X ci )] },^ (4.8) with Vo(x b ,xc) = Ct (x b ) and Vn (0,xc) = 0. As in common literature on MDP, we refer to Vn (., .) as the value (or optimal cost to go) function and Q n (., -) as the state action cost function. The above MDP problem can 120 Chapter 4. Finite Horizon Transmission Scheduling be solved by a forward induction algorithm [59]. In the section that follows, we studied the structure of the optimal transmission scheduling policy 7* . 4.3.2 Threshold Structure of the Optimal Transmission Policies Two threshold structural results will be established. First, Theorem 4.1 states that the optimal transmission scheduling policy is monotonically decreasing in the residual transmission time n. Since the action space is a two-element set, monotonicity implies that the optimal transmission policy is threshold (in the residual transmission time n). Furthermore, Theorem 4.2 claims that if the data loss cost is integer convex (i.e. has increasing differences) in the terminal buffer state then the optimal transmission scheduling policy is monotonically increasing and hence threshold in the buffer state. The proofs of both theorems rely on the same mathematical principle, which is that if the state-action cost function Q n (x b , xc, a) is supermodular (or submodular) in the action and another variable, such as x b , then the optimal transmission scheduling policy is monotone in that variable. A simplified definition of supermodularity is given below. A very comprehensive exposition of the topic of supermodularity and its application in sequential decision problems is given in [6]. Definition 4.2. A function F(x,y) :X xY--IR is supermodular in (x,y) if F(xl.,Y1) + F (x2, Y2) > F (xi, Y2) + F (x2, yi) Vxi, x2 E X, Y1 Y2 E Y, xi > x2, yl > y2. If the inequality is reversed, the function F(.,.) is called submodular. Supermodularity is a sufficient condition for optimality of monotone policies. Specifically, if F(x, y) defined as in Def. 4.2 is supermodular (submodular) in (x, y) then y(x) = arg max y F(x, y) is non-decreasing (non-increasing) in x [6]. From the above definition of supermodularity, it is clear that Q n (x b , xe, a) being supermodular in (a, n) and submodular in (a, xb) implies that the optimal transmission scheduling policy 7r(n, x b , x c ) is monotonically decreasing in the number of transmission time slots left n and increasing in the buffer occupancy x b respectively. The proofs of the theorems are given in Section 4.6. The methodology deployed to prove the threshold results consists of two steps as follows. 121 Chapter 4. Finite Horizon Transmission Scheduling 1. Monotonicity of the optimal cost to go function Vn( . ,.) 2. Super, todularity: We show that the state-action cost function Q n (xb, xc, a) defined by (4.8) is supermodular in (a, n) and submodular in (a, x b ) (see Def. 4.2) using mathematical induction. The monotonicity, and hence the threshold structure of the optimal transmission scheduling policy then follows straightforwardly. In [126], when the system state is only a single variable, supermodularity of the stateaction cost function is established by imposing a condition on the system state transition probability matrices for the two different actions. In particular, this condition is as follows. Consider the matrix obtained by taking the difference between the two transition probability matrices for the two different actions. In order to establish supermodularity of the stateaction cost function, it must be the case that each row of this matrix has a tail sum that is dominated by the next row's tail sum. Here we have an augmented system state (xb, xc) and there is no state representation under which the conditions in [126] are satisfied. In the theorems below, instead of the restrictive conditions on the state transition probability matrices, we show that the optimal cost to go Vn (x b , xc) has a special structure, which leads to modularity of Q n (xb, xc, a). In particular, the optimal cost to go Vn (x b , xc) is submodular in (x b , n) and integer convex in the buffer state x b . It should be noted that if the channel state xc does not evolve as a Markov chain but is a constant (or equivalently, when channel state information is not present), the supermodularity and submodularity of Q n (x b , xe , a) w.r.t (a, n) and (a, x b ) respectively have been proved in [118]. However, the proofs in [118] are rather specific and cannot be generalized to work for the case of a Markovian channel The proofs below are more systematic and work straightforwardly for both Markovian and constant channel states. First, we prove that the optimal cost to go function Vn (xb, xc) is monotone. Lemma 4.1. The optimal cost to go function Vri (x b ,xc) defined by (4.6) is increasing in the number of remaining packets x b and decreasing in the number of remaining time slots n. Proof. See Section 4.6.1.^ 122 ^ Chapter 4. Finite Horizon Transmission Scheduling In the theorems below, modular properties of the state-action cost function is established via mathematical induction. Theorem 4.1. If the data loss cost Ct (.) is an increasing function (of the terminal buffer state) then the state action cost function Q n (xb , x', a) is supermodular in (n, a), i.e. Qn(xb, xc, 1) — Qn (X I) / x c , °) < Qn+1(X b , X e , 1 ) — Qn+1(X b , x c , 0). ^(4.9) As a result, the optimal transmission scheduling policy is threshold in the residual transmission time, i.e. =^s 1 if n < n*b xc ,^1 to cN 0 otherwise, where n x* b xc (4.10) is the threshold residual transmission time for channel state xc and buffer state xb . Proof. See Section 4.6.2.^ ^ In Theorem 4.2 below, we prove that if the data loss cost has increasing differences in the terminal buffer state, i.e. satisfying (4.11) below, then the optimal transmission policy is monotonically increasing and hence threshold in the buffer state as illustraed in Figure 4.1. Theorem 4.2. If the penalty cost Ct (.) is an increasing function and is integer convex as in (4.11) below Ct (x b + 2) — Ct (x b + 1) > Ct (x b + 1) — Ct (x b ) Vx b > 0,^(4.11) then the state-action cost function Q n (x b , xc, a) is submodular in (1b, a), i.e. Q n ( x b , x c , 1) Qn (xb , xc , 0) > Qn (x b + 1, x c , 1, xc , 0 ) .^(4.12) 1) Qn (x b + As a result, the optimal transmission scheduling policy is threshold in the buffer state, i.e. 1 if the buffer state x b > X b*n xc 0 otherwise, xc) -=-^ 123 (4.13) Chapter 4. Finite Horizon Transmission Scheduling where X b : xe is the threshold buffer state for channel state xc at residual transmission time n. Furthermore, x b *rz^is is increasing in n. Proof. See Section 4.6.3. ^ Remark: The proofs of Theorems 4.1, 4.2 are also valid for the case of a countable or continuous action space. Hence the monotonicity of the optimal policy holds for the power/rate control problems or any other MDPs with the same mathematical model but with a countable/continuous action space. In summary, Theorems 4.1, 4.2 mean that under assumption (4.11) the optimal transmission scheduling policy is an aggressive policy since it is optimal to transmit more often when the residual transmission time is less or the buffer occupancy is larger. The results of Theorem 4.1 are intuitive. However, Theorem 4.2 on the monotonicity of the optimal policy u n* (x b , xc) in the buffer state xb requires the less straightforward condition (4.11). In Section 4.4, numerical examples show that the optimal scheduling policy may or may not be threshold when (4.11) is not satisfied. 4.3.3 Implementation Issues and Discussion In the theorems presented above, we have derived conditions for which the optimal transmission scheduling policy is monotone in two dimensions: the residual transmission time and the buffer state. The aim of this discussion is to provide more insight into the significance of these threshold results and how they can be exploited. The results of Theorems 4.1, 4.2 can be exploited to substantially reduce the computational complexity required to solve the stochastic dynamic programming problem given by (4.6) and (4.7) [59], [126]. In particular, in the forward induction algorithm (see [59]), the results of Theorems 4.1, 4.2 translates to the following: = 0 4,, i (?, xe) = 0 4 (2. x c , = 1^4.,(xb + 1, xc) = 1, 4,(?, xc) ,1 1,c , )^ for all n = 1, 2, ... , N; X b E B, xc E (4.14) r. The number of times that (4.7) has to be solved 124 Chapter 4. Finite Horizon Transmission Scheduling is then reduced from irl x 181 x N to approximately ill x max(IBI,N), where 1 • I denotes cardinality, even in the worst case. Therefore, the forward induction algorithm becomes very efficient and can even be implemented in an online algorithm, which is necessary, for example, when the channel state transition probability matrix is time-varying and must be estimated in real-time. Given an optimal transmission scheduling policy, the packet transmission mechanism is outlined in Algorithm 4.1 below. Algorithm 4.1. Set the residual transmission time n = N, the buffer state x nb = L For time n = N, N — 1,.. .,1, obtain channel state information xcn , buffer state X nb If x nb > X b I'„L x e in (4.13) then TRANSMIT, if ACK is received x nb _ 1 = X nb —1, else x nb _ 1 -= X nb . A secondary significance of the threshold results is that they reduce the memory required to store to optimal transmission scheduling policies. Consider an example with a larger state space. Assume that we need to transmit L = 70 packets within N = 100 time slots in a Markovian fading channel with M = 10 channel states. Without structural results, the memory required to store the optimal transmission scheduling policy is NLM *1 = 70 kbits. With the two-dimensional structural result, for each channel state xc, we only have to store the increasing threshold xb: x c. It is easy to design a scheme for storing only x b*Lx c and the incremental values x b :+1xc — xb: xc. In this case, the memory required to store to optimal solutions is approximately NM * A = A kbits, where A is the number of bits required to store the incremental values. For L = 70, A should be at most 2 bits, hence the required memory is 2 kbits. In order to further illustrate the significance of the threshold results, let us consider an example where we need to transmit L = 10 packets in N = 15 time slots in a Markovian fading channel with M = 2 channel states. The total number of transmission scheduling policies is 2N-Lm = 2300. In comparison, the number of transmission scheduling policies that are threshold in both the residual transmission time and the buffer state is NLM = 150 2 . Hence, due to the threshold structure, the space of admissible policies is reduced significantly. In summary, the two-dimensional threshold structure significantly improves the efficiency of the forward induction algorithm, which is the only method to explicitly compute 125 Chapter 4. Finite Horizon Transmission Scheduling the optimal transmission scheduling policy [59], and the memory required to store the optimal solutions. 4.4 Numerical Studies Having proved that the optimal transmission policies are threshold in the residual transmission time and the buffer occupancy, we now illustrate these structural results via numerical examples that arise from video data transmission. Consider the real-time transmission of compressed video data in the uplink of a Rayleigh fading communication channel in a packet switched 3G cellular network. Assume that the video data is transmitted in the QCIF format at the rate of 30 fps. Assume a channel bandwidth of 2 Mbps and a link layer packet size of 128 bytes per packet. Each compressed video frame then corresponds to 70 link layer packets after the encoding process. Assume that 30% of the bandwidth is reserved for retransmissions, i.e. we have 100 time slots to transmit a batch of 70 coded packets. Assume the underlying channel is frequency-flat Rayleigh fading with the following parameters: carrier frequency = 1.9 GHz, mobile velocity v = 5 km/h, which leads to the maximum Doppler frequency of fd, max = 8.7963 Hz. Assume SNR represents the channel state, then we can use the equal-probability SNR partitioning method in [8] to model the channel by a 5-state Markov Chain (i.e. F = ^,75)) with (approximately) the transition probability matrix AC = (aii)5x5, ak,k+1 = ak,k-1^0.1; ak k = 0.8 for k = 2, ... , 9 and a1,2 = a5,4 = 0.2, ai,i = a5,5 = 0.8. , Assume that the transmission success probability function (defined by (4.2)) is given by fs (a =1,'yk) = Pk, where pk is the k-th element of the vector IT = (0.3 0.4 0.55 0.8 0.98). Assume that each transmission has a cost of 13 unit(s) for some Q > 0, i.e. select c(a, xc) = : a E A = {0, 1} for all channel states. In a video transmission system, picture quality degrades with the number of untransmitted packets, even though some data loss is acceptable. Therefore, intuitively, the data loss cost function should be increase and integer convex in the number of untransmitted packets. For the purpose of illustrating the structural results, select the data loss cost function to be Ct (x b ) = a(x b ) 2 for some a > 0 so that Ct (x b ) is integer convex. It can be expected that a larger value of ,3 will discourage 126 Chapter 4. Finite Horizon Transmission Scheduling Trade-off between QoS and Energy Efficiency 0.9 0.8 (a 0.6 8 03 0.5 V to o . .c co 0.4 2 0.3 Throughput (Policy: Always Transmit) Throughput (Optimal Scheduling, a =0.2) - • - Throughput (Optimal Scheduling, a =0.45) - Successful Rate (Policy: Always Transmit) O • Successful Rate (Optimal Scheduling, a = 0.2) - B - Successful Rate (Optimal Scheduling, a = 0.45) 0.2 0.1 - o o ■^, '^,^ 0.2^0.4^0.6^0.8^1^1.2^1.4^1.6^1.8^2 Transmission Cost:p Figure 4.2: Trade-off between QoS and Energy Efficiency: as the cost of a transmission 0 increases, the throughput rate decreases but the successful rate increases. a user from transmitting to conserve energy, whereas a large a will encourage a user to transmit to achieve a good throughput rate. Hence, a and 0 should be selected for a good trade-off between energy efficiency and throughput. Trade off between Throughput Rate and Energy Efficiency - The two factors that determine the trade-off between throughput rate and energy efficiency are the transmission cost function c(•, -) and the terminal data loss penalty cost function Ct (•). Without a transmission cost, the optimal transmission scheduling policy would be to transmit at all times for the best throughput rate. In contrast, a reasonable transmission cost makes a user be more economical and limits transmission in bad channel states. Here we still select c(a , xc) = Oa and Ct (x b ) = cxxb 2 . The performance of our threshold MDPbased optimal transmission scheduling policy is then compared with the policy of always transmitting if there is at least one packet in the buffer. We refer to the latter scheme as the free-transmission scheme. 127 Chapter 4. Finite Horizon Transmission Scheduling The performance statistics of the simulation depicted in Fig. 4.2 are the throughput and an efficiency factor defined as below: Throughput = number of successful transmissions total number of packets : L = 100 number of successful transmissions Efficiency = total number of transmissions It is seen from Fig. 4.2 that when the transmission cost is low, /3 0, the MDPbased optimal transmission scheduling policy offers almost identical throughput and energy efficiency to the free-transmission scheme. However, as the transmission cost increases, the MDP-based optimal transmission scheduling scheme offers a trade-off between throughput and energy efficiency. In addition, Fig. 4.2 shows that certain choices of a and /3 are more (or less) beneficial. For example, a = 0.2, 0 = 0.4 leads to a better tradeoff than a = 0.45, /3 = 0.4. The designing of transmission and data loss cost functions is beyond the scope of this chapter. Threshold Optimal Transmission Scheduling Policies We consider the numerical example described previously with a = 0.07 and 0 = 0.5 and solve the corresponding MDP for the optimal transmission scheduling policy 7r* = (u n* (x b , xc)). In Fig. 4.3, u n* (xb, xc) is plotted as a function of the number of packets remaining xb for n = 20, 40, 60, 80 and xc = -y2. It can be seen from the figure that the optimal transmission scheduling policies are indeed of the threshold structure and the optimal threshold x * xc bn increases with the residual transmission time n. In Fig. 4.4, we plot the optimal threshold X b *n as as a function of the residual transmis- sion time n for different channel states xc = yl xc = y2 i xC = -y3, and for two different , channel state transition matrices: AC defined previously and AE = AC E, where the nonzero elements of E are exb,min(xb+1,5) = —e x b ,„, a), (x b _ 14) c = 0.005. It can be seen from Fig. 4.4 that X b x c is indeed increasing in n. Furthermore, Fig. 4.4 also shows that the optimal transmission thresholds are only slightly larger for A. This means that the optimal transmission scheduling policies are not very sensitive to variations in the channel 128 Chapter 4. Finite Horizon Transmission Scheduling The optimal transmission policy is threshold in the buffer state n = 20 time slots left ..... n = 40 time slots lett - - n = 60 time slots left - - - n = 80 time slots left 0 10^20^30^40^50 Number of packets in the buffer ^ 60 ^ 70 Figure 4.3: By Theorem 4.2, the optimal transmission policy is of the form (4.13): at time slot n, for a given channel state xc, it is optimal to transmit if and only if there are at least x b r, x e packets remaining. In addition, x b n* is is increasing in n. state transition matrix. Furthermore, Fig. 4.4 shows that, for this particular example, the optimal threshold buffer states x b n* is is decreasing in the channel state xc, i.e. the optimal transmission scheduling policy is to transmit more aggressively when the channel is in a better state. 129 Chapter 4. Finite Horizon Transmission Scheduling 7 Optimal threshold buffer state increases in the residual transmission time x`..y i , Transition matrix Aam x`=y i , Transition matrix A: x`.=y2 , Transition matrix A' _ _ x x =y2 , Transition matrix A: _ x'-,y3 , Transition matrix of _ _ _ x ° =y3 , Transition matrix 10^20^30^40^50^60^70^80 Number of Transmission Time Slots Remaining 90 100 Figure 4.4: By Theorem 4.2, the optimal transmission policy is of the form (4.13). Here we show that the optimal buffer state threshold X b *n xe in (4.13) is increasing in the number of time slots left n for different channel states xc. Non-monotone optimal policy for a non-convex data loss cost function Since supermodularity only gives sufficient conditions for the monotonicity of optimal policies, it is interesting to see if the optimal transmission scheduling policy is threshold even when (4.11) does not hold. We now modify the data loss cost function so that (4.11) is not satisfied. Select Ct (x b ) -= { ^( a x b ) 2^If xb < L/2 aL 2 /4 + 6x b Otherwise. (4.15) We consider two cases: a = 0.1, 6 = 0.5 and a = 0.1, 6 = 1 and solve the dynamic programming problem for the optimal transmission scheduling policy 7 = ( un* ( x b x c) n = 1, 2, ... , N). The optimal policy u n* (xb, xc) is then depicted in Fig. 4.5 for n = 1, 20 and xc = -y2. It can be seen that u n* (xb, xc) is monotone x b for 6 = 1 but not for 6 = 0.5 even though the data loss cost functions have the same structure for both cases. Hence, when (4.11) is not satisfied, the structure of the optimal transmission scheduling policy is not intuitively predictable. 130 Chapter 4. Finite Horizon Transmission Scheduling Optimal Transmission Scheduling Policy 0 o.11 c 0 -*- n=5, a= 0.1, 8=1 E —0— n=35, a = 0.1, 8 =1 7:1 0. 0 10^20^30^40^50 Number of packets in the buffer ^ 60 ^ 70 2.. 1 0 0 - • - n=5, a = 0.1, 8 =0.5 E —4— n=35, a = 0.1, 8 =0.5 6 0. 0 000 00■10•011 00 00.04 00•0•00 10 ^ 20^30^40^50 Number of packets in the buffer ^ 0.111.0.1 60^70 Figure 4.5: For the data loss cost function (4.15), which does not satisfy (4.11), the optimal transmission policy is monotone when a = 0.1, 6 = 1, non-monotone when a = 0.1, 6 = 0.5. Optimal transmission scheduling for a two state FSMC channel model - We consider a two-element channel state space F = (y1, -y2) and two state transition matrices Ai = [0.95 0.05; 0.05 0.95] and AS = [1 0; 0 1]. Matrix AS corresponds to the special case of a degenerate Markov chain, where the channel remains in the initial state for the whole horizon. Assume that the function of transmission success probabilities (defined by (4.2)) is given by fs (a = 1, -yi) = 0.3, h(a = 1,72) = 0.9.h(a = 0, -) = 0. In Fig. 4.6, we plot the optimal buffer state threshold X 1) ,x e as a function of the residual transmission time n for xc = 'yi and xc = -y2 for both transition matrices. One would expect that the lines representing xb n* , ,yi and xb n* , ,y2 for Al will be somewhere between the lines representing xb r*, , ,.), i and xb ri ,y2 for A. Fig. 4.6, however, shows that there exists no intuitive relation between the two optimal solutions. The conclusion is that even though it was showed in Fig 4.4 that the optimal transmission scheduling policy is not highly sensitive to channel state transition probabilities, the optimal solution is completely different for a slowly time varying Markov chain in comparison to a constant (i.e. degenerate) Markov chain. 131 Chapter 4. Finite Horizon Transmission Scheduling Optimal buffer state transmission threshold for a two-element channel state space 100 = [0.95, 0.05; 0.05, 0.95], x‘' = = [0.95, 0.05; 0.05, 0.95],^=y2 90 - - 80 = [1, 0; 0,1], = = [1,0; 0,1] = 1, x` =y2 50 O 13 0 20 10 o o 10^20^30^40^50^60^70^80 Number of Transmission Time Slots Remaining ^ 90 ^ 100 Figure 4.6: The optimal transmission policies for a slowly time varying two state Markov Chain and a constant Markov Chain. 4.5 Conclusions The problem of optimal transmission scheduling for point-to-point real time data transmission in a Markovian fading channel with a channel-aware ARQ protocol subject to a hard delay constraint is formulated as a finite horizon MDP with a data loss cost. By applying the concept of supermodularity to the recursive Bellman's equation of optimality, we show that the optimal transmission scheduling policy is threshold in the residual transmission time and the buffer state. The proofs on the structure of the optimal transmission scheduling policies are valid under several variations. For example, one can consider the case of a countable or continuous set of actions (e.g., for a power control problem) instead of a two-element action set. Another possible variation is to add a Markov arrival process into the model, which is probably useful for file transfer/information delivery in sensor networks. The proofs in this chapter will generalize straightforwardly under these variations. In this chapter, the delay constraint is embedded into the MDP formulation of the transmission scheduling problem via the finite horizon. In the next chapter, an alternative 132 Chapter 4. Finite Horizon Transmission Scheduling infinite horizon average cost constrained MDP formulation is considered. The structure of the constrained infinite horizon transmission scheduling policy is studied and very efficient algorithms for adaptive learning of the optimal solutions are proposed. 4.6 Analytical Proofs This section presents the mathematical proofs of main structural results in the chapter. 4.6.1 Proof of Lemma 4.1 We will prove Vn(x b , xc) is monotone in x b and n by induction. Vo(x b , xc) is increasing in x b since Ct (.) is increasing. From the definition of Vn (x b , xc) given by (4.6) and (4.8), the monotonicity of Vn (x b , xc) in x b follows immediately by induction. It is clear that Vi(x b ,xc) < Vo(x b , xc) since Vn (xb, xc) < Qi(xb, xc, 0)^E 1P(xciixc)Vo(xb, xci)^ct(xb)^Vo (xb, x c ). xc'Er The monotonicity of V, (xb, xc) in n then follows straightforwardly from the definition of Vn (x b , xc) given by (4.6), (4.8). 4.6.2 Proof of Theorem 4.1 First, note that the structures of the value function and the state-action cost function are tightly coupled. In order to see this, rewrite the state-action cost function Q n (x b , xc, a) given by (4.8) as: nib c^ p(xo i x c) f^x c , c(a, xc)^(^) ,x , a) = xc'Er E p (x d ix c vn i(xb x c' ) ) xc'EF 133 n _i (x — 1, xc') — Vn_i (x (V^b b , xc/)/ xb for I^> 0, n > 0.^ (4.16) Chapter 4. Finite Horizon Transmission Scheduling From (4.16) it is clear that Qn(x b , x c , 1) — Q.(x b , x c , 0) = c(1, xc) — c(0, xc) + E IP (x d Ix') fs (1, x c ) xc'Er (vn _ 1 (xl) — 1,xci) — vri _ 1 (xb,xet)).^(4.17) ^x Therefore Q n (x b , xc, a) is supermodular in (n, a) if and only if Vn (x b , xc) is submodular in (n, x b ). We then use the following lemma to prove the supermodularity of Q n (x b ,xc, a) in (n, a). Lemma 4.2. If Ct (.) is an increasing function (of the terminal buffer state) then the optimal cost to go Vn (xb, xc) satisfies the following submodularity condition: (xb Vn+1(X b + 1, x c ) — Vn-F1(X b , Xc),^(4.18) Vri > Vn■X b , x c ) + 1, xc) —( ^ for all x i) > 0, xc E F. As a consequence, the state-action cost function Q n (x b ,xc, a) satisfies (4.9), i.e. is supermodular in (n, a). Proof. See Section 4.6.4.^ ^ Supermodularity of Q n (xb , xc, a) in (n, a) implies that u*n (xb, xc) defined by (4.7) is nonincreasing in n. Hence u n*(sb,xc) is of the form (4.10). 4.6.3 Proof of Theorem 4.2 From (4.16) and (4.17) it is clear that Q n (x b , xc, a) is submodular in (xb, a) if and only if Vn (x b , xc) has increasing differences in x b .. We then use the following lemma to complete the proof. Lemma 4.3. Furthermore, if the penalty cost C t (.) is an increasing function and satisfies (4.11) then the optimal cost to go Vn (x b ,xc) has increasing differences in the buffer occupancy state: ^Vn (x b + 2,xc) — Vn (x b + 1,xc) > Vn (x b + 1,xc)— Vn (x b ,sc), 134 ^(4.19) ^ Chapter 4. Finite Horizon Transmission Scheduling for all x b > 0, x c E F. As a consequence, the state-action cost function Q n (x b , xc, a) satisfies (4.12), i.e. is submodular in (x b , a). Proof. See Section 4.6.5.^ ^ Due to Lemma 4.3 if Ct (x b ) is nondecreasing in x b and satisfies (4.11) then Vn (xb, xc) satisfies (4.19) and Q n (x b , xc) given by (4.16) is submodular in (a, x b ). The submodularity of Q n (x b , xc) in (a, x b ) implies that u n* (x b ,xc) defined by (4.7) is nondecreasing in x b . Hence u n*(xb,xc) has the structure (4.13). The result that xb nx c is increasing in n follows from the fact that u n*(x b ,x9 is also monotonically decreasing in n (see Theorem 4.1). 4.6.4 Proof of Lemma 4.2 We now prove that Vn (xb, xc) is submodular in (x b , n) and hence Q n (xb, xc, a)) is supermodular in (n, a). The proof is by mathematical induction. First, (4.18) holds for n + x b = 0 since Vn (x b , xc) is nonincreasing in n. Assume that (4.18) holds for n + x b = k. We will prove that it holds for n + x b = k + 1. Let 1/7„, ± 1(x b + = Qn+i(x b + 1,xc,an), Vn±i (x b , x c ) = Qn+1(Xb7 xc, aio), Vn(x b + 1, xc) = Q n (x b + 1, xc, aol), Vn (xb, xc) = Qn(xb, xc, aoo) for some aoo, aol, a10, all. We have to prove that 1 , xc) vn±i (xb + 1 , xc) — vn+1(xb, c) < Vn (xb + xc) 1, xc) — vn(xb, xc) <=>Qn-f-i(x b + 1, xc, all) — Qn+1(X b , x c , aio) — Qn(x b + 1, xc, aol) + Qn(x b ,x c ,a00) < 0 <=>Qn-Fi (x b + 1, x c , an) — Qn(x b + 1, x c , aoi) — Qn±i (X b , x c , aim) + Qn(xb, xc, a00) < 0 <=> Q n ±i (X b + 1, xc, all) — Qn+1(X b + 1, x c , a01) + Qn-F1(X b + 1, xc, a01) — Qn (xb + 1, x c , a01) < 0 (By optimality)^ A Q n (x b , x c , a00)) < 0. ,x e ,aio)^ — (Qn+1(X b , x c , all)) — Qn(x b ,x c ,aio))+( — Qn(?^+ < 0 (By optimality) ^B ^ (4.20) By induction hypothesis we have A E p (x d ix c )(fs(aoi , xc) V n se' Er 135 (x b x c' ) — Vri-1 (X b , X c/ )1 Chapter 4. Finite Horizon Transmission Scheduling + ( 1 — m a w , xc)) E [Vk( xb +1 , xo — vri _1(xb+1,x0)]) ) (xe' ixc) [14,(x b , xc') — Vri _i(x b , xc')] . x'Er Similarly, B > > P(xcilxc) [vri (xb, xc') — vri _ 1 (xb,^, which implies that B > A. Therexc Er , fore, Vn (x b , xc) satisfies (4.18), which implies that Q,,,,(xb, xc, a) is supermodular in (n, a) due to (4.16). 4.6.5 Proof of Lemma 4.3 We now prove (4.19) by mathematical induction. First, (4.19) holds for n = 0 due to (4.11). Assume (4.19) holds for n = k. We will prove that it holds for n = k + 1. Let vk+i(xb+, xc) Vk+i(x b + 2 , x c ) Qk+i(x b + 2 , x c , a2), I ) Qk+1(i b + 1 , x c , al ), Vk+1(X b x c ) = Q k+l ^X e , a0) for some ao, a1, a2. We then have to prove that Vk + i (x b + 2, xc) — Vk + i (x b + 1, xc) — Vk ± i (x b + 1, xc) + Vk + i(X b , x c ) > 0 Q ( x b + 2, xc, a2)^k+1(Xb + 1, x c , al) — k-F1(X b + 1, x c , al) + Qk+i(x b , x c , ao) > 0 Q k+1(X b + 2, x c , a2) — Qk+i (x b + 1, x c , a2) +Qk+i (x b + 1, xc, a2) — Qk+i (x b + 1, x c , a1 ) A^ — Qk+i > 0 (By optimality) (x b + 1, x c , al) + k-F1(X b + 1, x c , ao) (Cdk+i (x b + 1, x c , ao) — Ch+1 (x b , x c , ao))^0. > 0 (By optimality) In addition, it follows from the induction hypothesis that A = E p(xdixc) I fs (a, x c )(Vk(x b + 1, xc') — 14(x b , x ci )) xc'Er + (1 — f s (a, x c )) (Vk(x b + 2, xc') — Vk(xb + 1, ))] E P(xc'lx c )(Vk(x b + 1, xc') — Vk(x b , xc')).^(4.21) xc'Er Similarly, B < E P(xdiz)(vk (xb + 1, xc') — Vk(x b , xc')). Hence, A — B > 0. Therefore, xc'EF V7,,(X b , x c ) satisfies (4.19), which implies that Q n (x b , xc, a) is submodular in (x b , a) due to (4.16). 136 Chapter 5 Constrained Infinite Horizon Transmission Scheduling 5.1 Introduction Channel memory and perfect quantized CSI at the PHY layer are again exploited for optimal opportunistic transmission scheduling at the MAC layer assuming correlated fading channels.As it has been mentioned previously, due to mobility and multipath propagation, the fading process of a wireless channel typically has short-term memory. In this chapter, channel memory, as well as CSI are exploited to devise adaptive transmission scheduling policies, which adjust the transmission process based on the current and predicted states of the channel A structural result that the optimal opportunistic transmission scheduling policy is monotone in the buffer occupancy state is proved. The monotone result is then exploited to derive efficient optimal policy estimation algorithms Similarly to the previous chapter, we consider the real-time transmission of packets in the uplink of a wireless network over a correlated fading channel The FSMC channel model is used to represent the correlated fading channel. Furthermore, it is also assumed that packets can be retransmitted via an ARQ protocol. Nevertheless, the framework of this chapter has important differences from that of the previous chapter. First, the trade-off between energy efficiency and delay is incorporated into the MDP formulation by setting the objective to be minimizing an average transmission cost subject to a constraint on an average delay (or waste of bandwidth) cost. Then the opportunistic transmission scheduling problem is formulated as a countable state, infinite horizon, average cost constrained Markov Decision Process (CMDP). Infinite horizon MDPs usually have stationary optimal policies, which can be computed by many dynamic programming or reinforcement learning methods 137 Chapter 5. Constrained Infinite Horizon Transmission Scheduling [67]. In comparison, finite horizon MDPs, such as the formulation in the previous chapter, rarely have stationary optimal policies, and can be explicitly solved via forward/backward induction only [59]. The structure of the optimal transmission scheduling policy is analysed by the Lagrange dynamic programming method that was introduced in [74], and applying the supermodularity concept to Bellman's equation of optimality. In particular, the Lagrange dynamic programming method uses a Lagrange multiplier to convert the constrained average cost MDP to an unconstrained average cost MDP. The supermodularity method is then used to prove that the unconstrained optimal transmission scheduling policy is threshold in the buffer state (see Fig. 5.1(a)). It then follows from [73-75, 129] that the constrained optimal transmission policy is a randomized mixture of two deterministic transmission scheduling policies that are threshold in the buffer state as outlined in Fig. 5.1(b). As it has been discussed in the previous chapter, threshold structural results are particularly useful since they reduce the computational complexity required to compute and implement the optimal solutions. For an infinite horizon MDP, where the optimal transmission policy is stationary, threshold-type structural results are even more significant as they change the problem of computing the optimal policy fundamentally. In this chapter, the structural results are proved and then exploited to derive two algorithms for estimating the constrained optimal policy. The first algorithm is based on the Lagrange dynamic programming framework and consists of two steps. In the inner step, the (threshold) unconstrained optimal transmission scheduling policy is estimated via a threshold-policy Q-learning algorithm. In the outer step, the correct Lagrange multiplier is estimated via the Robbins-Monro method. In comparison, the second algorithm estimates the constrained optimal policy directly via parameterization and stochastic approximation. Specifically, the optimal policy is approximated by the sum of two logistic functions, which are defined by a finite number of parameters. Then the optimal parameters are estimated via a gradient based stochastic approximation algorithm. Both algorithms do not require knowledge of the channel state transition probability matrix, and can adapt to changes in the state transition probability matrix or other system statistical properties. The context of the work in this chapter is as follows. Similarly to Chapter 4, this chapter considers a time slotted (i.e. discrete time) point-to-point packet transmission system and assumes that the wireless channel evolves as a Finite State Markov Chain 138 Chapter 5. Constrained Infinite Horizon Transmission Scheduling A Pr(a = 1; x b x , xY ) , Pr(a^1; x b , X C, xY) ` 1 q Xo^Buffer Occupancy State 0 xl 4^Buffer Occupancy State (a) A policy that is threshold in the buffer state. (b) A randomized mixture of two threshold policies. Figure 5.1: The constrained optimal policy u* (x b , xc, xY) is a randomized mixture of two policies that are threshold in the buffer occupancy state. (FSMC). As discussed previously, FSMC models are widely used to abstract the physical layer correlated fading channel in the context of link and network layer algorithm design. Papers that are devoted to deriving explicit FSMC models for different fading channels include [7, 8, 13, 114, 115]. The commonly used Binary Symmetric Channel and GilbertElliot models, are special cases of the FSMC model. The transmitter has an infinite-capacity buffer to store packets for transmission and retransmission. Hence, the transmission scheduling MDP has a countable state space. At each time slot, new packets arrive at the buffer. Furthermore, given the instantaneous CSI, a decision on whether to transmit a packet must be made at each time slot. It is assumed that there is an error-free feedback channel so that the outcome of every transmission is known at the transmitter in the next time slot, and packets that are not successfully received remain in the buffer for later retransmission via an infinitely-persistent ARQ protocol. The assumption of such an error-free instantaneous feedback channel is common when ARQ is deployed for real time data traffic [123-125]. Furthermore, the packet arrival process is assumed to be i.i.d.. This assumption is purely for convenience. The analytical results proved in this chapter only require that the packet arrival process is independent of the channel state and the buffer occupancy. 5.1.1 Main Results The opportunistic transmission scheduling problem is formulated as an infinite horizon, countable state, average cost CMDP. The structure of the optimal policy is analysed and algorithms are derived for adaptive estimation of the optimal policy. The results presented 139 Chapter 5. Constrained Infinite Horizon Transmission Scheduling in this chapter include the following. • Buffer stability: Lemma 5.1 presents a condition on the average packet arrival rate, the instantaneous delay cost and the success probability functions, for which every policy that satisfies the average delay constraint will induce a stable buffer and a recurrent Markov Chain. • We use the Lagrange multiplier method to convert the CMDP into an unconstrained MDP parameterized by a Lagrange multiplier. The unconstrained average cost optimal policy is proved to be deterministic and monotonically increasing (and hence threshold) in the buffer state by two steps: The unconstrained MDP satisfies the conditions given in [130] for which the average cost optimal policy is a limit point of a sequence of discounted cost optimal policies with discount factors approaching 1. Using the concept of supermodularity, we prove that for any discount factor v E (0, 1), there exists a stationary discounted cost optimal policy that is pure and monotonically increasing, hence threshold in the buffer state (see Theorem 5.1). Therefore, the unconstrained average Lagrangian cost optimal transmission policy is also threshold in the buffer state (see Fig. 5.1(a), Corollary 5.1). • Then it follows from a result in Lagrange dynamic programming [73-75, 129] that the constrained average cost optimal transmission policy is a randomized mixture of two transmission policies that are threshold in the buffer state (see Fig. 5.1(b), Corollary 5.2). • A Reinforcement Learning Algorithm: The constrained optimal transmission policy can be estimated by the Lagrange Q-learning method outlined in Fig. 5.2. At the inner step of the method, the unconstrained average Lagrangian cost optimal policy is estimated for any given Lagrange multiplier. At the outer step, the Lagrange multiplier is updated by the Robbins-Monro method. For the inner step, we propose a threshold-policy Q-learning algorithm, which can estimate the unconstrained average Lagrangian cost optimal transmission policy much more efficiently than the non-structured Q-learning algorithm. • A novel adaptive monotone-policy search algorithm that is based on stochastic approximation: Each monotone policy is approximated by a parameterized smooth function 140 Chapter 5. Constrained Infinite Horizon Transmission Scheduling and the optimal policy estimation problem is converted to estimating some optimal parameters (see Fig. 5.3). We propose a simultaneous perturbation stochastic approximation (SPSA) algorithm that can estimate the optimal parameters in real time. The resulting learning algorithm does not require knowledge of system parameters and have the tracking capability. The analysis of the CMDP relies on several classical results on the existence of a stationary average cost optimal policy for countable state MDPs and the relation between average cost and discounted cost optimal policies. In addition, the monotone structure proof relies on the supermodularity concept. 5.1.2 Related Work Many papers in literature deploy a MDP approach to address optimal transmission, rate and power control problems. In [33], a finite state Partially Observed MDP formulation has been used to exploit CSI for optimal point-to-point file transmission, and optimality of threshold policy has been proved for a Gilbert-Elliot communication model. In [32] the authors used a finite state, finite horizon MDP formulation for a fix-sized file transferring problem. It is observed in [32] that the probability that the file is successfully transmitted increases if CSI is exploited. Another related work is [131], where the authors also utilized the Lagrange multiplier method to analyse the problem of optimal delay scheduling policies, subject to a constraint on the average power cost for a countable state space. The closest work to this chapter is [132]. In [132] the problem of optimal rate control for a MIMO system is formulated as a finite state, average cost CMDP and it has been proved that the optimal policy is monotone. The chapter uses the same Lagrange dynamic programming approach as in [132] to prove monotonicity. Mathematically, the major difference is the nature of the transition probabilities and hence the proof of supermodularity of the stateaction cost function. In particular, in [132], retransmissions are not allowed, and hence the channel state does not affect the packet departure rate. In other words, channel state and buffer state are completely decoupled. In comparison, in this chapter the evolution of the channel state affects the evolution of the buffer state via the packet departure rate, hence the transmission model is more general and more complicated to analyse. Furthermore, in [132], it is assumed that the buffer size is finite and buffer overflow is avoided by an infinite 141 Chapter 5. Constrained Infinite Horizon Transmission Scheduling penalty cost. Here we relax this assumption by allowing the buffer to be of infinite capacity and provide a condition for buffer stability instead. The Lagrange multiplier method was first used in [74] to characterize the optimal policy for CMDPs for the case of a single constraint. In [75, 76, 129], the method in [74] was extended to the countable state space. A comprehensive reference for CMDPs is [73]. Lastly, studies on the existence of (unconstrained) optimal policies and computing methods for countable state MDPs include [83, 130, 133, 134]. Lastly, the monotone proof in this chapter is an application of supermodularity. [126] and [6] are excellent references for supermodularity and its use in sequential decision problems. Other papers that illustrates the use of supermodularity to prove structural results for MDPs includes [118, 119, 132]. Part of the chapter focuses on deriving optimal policy estimation algorithms. The two methods that we propose rely on Q-learning and gradient based stochastic approximation. Q-learning is a "model-free" method that is extremely useful for solving infinite horizon MDPs, where the transition probabilities that characterize how the system evolves may not be available. The principle of Q-learning is to gather information about the transition probabilities, as well as costs, through sampled trajectories of the system [135]. In [66, 136, 137] the authors provided an overview of the research revolving around Q-learning. In [138], the authors proposed modified Q-learning algorithms, which treat the structure of the Q-factors and the optimal policy as constraints, to solve the inner step of the Lagrange dynamic programming formulation they considered in [132]. The threshold-policy Q-learning algorithm proposed in this chapter differs from the algorithms in [138], and simply explores the set of threshold policies more extensively than the set of non-threshold policies. The second method for learning the constrained optimal transmission scheduling policy utilizes a clever parameterization technique to convert the (discrete) policy optimization problem into a continuous optimal parameter estimation problem. We then use an adaptive simultaneous perturbation stochastic approximation (SPSA) algorithm for estimating the optimal parameters. Details on the general SPSA technique are given in [5]. The chapter is organized as follows. Section 5.2 contains a description of the system model and the CMDP formulation of the transmission scheduling problem. Results on buffer stability and the Lagrangian formulation, which converts the CMDP to an unconstrained 142 Chapter 5. Constrained Infinite Horizon Transmission Scheduling MDP are established in Section 5.3. Section 5.4 proves the threshold structure of the unconstrained discounted cost optimal policy and analyses the structure of the unconstrained and constrained average cost optimal policies. The optimal policy learning algorithms are given in Section 5.5. Section 5.6 contains numerical examples and all analytical proofs are presented in Section 5.9. 5.2 Infinite Horizon Transmission Scheduling in Correlated Fading Channels This section describes the system model and the formulation of the transmission scheduling problem as an infinite horizon average cost constrained MDP. Consider the transmission of packets in the uplink of a wireless network, for example from a user to a local base station, in the following setting. Assume that time is divided into slot of equal size and indexed by n = 0, 1, 2, .... During each time slot a decision on whether to transmit a packet is made, unless the package storage buffer is empty. The decision is based on the current CSI, buffer occupancy state and the number of new packets that arrive at the buffer at the beginning of the time slot. If a packet is transmitted but not successfully received, the packet remains in the buffer for (not necessarily immediate) retransmission via an ARQ protocol Inherently, it is assumed that there is no limit on the number of times that a packet may be retransmitted, and that there is an instantaneous, error-free feedback channel, so that the result (ACK or NACK) of each transmission is perfectly known at the transmitter side. Below, the FSMC channel model is briefly described for clarity, and to make the chapter self-contained. Then the CMDP formulation of the infinite horizon transmission scheduling problem is explained in details. Finite State Markov Chain Model: Due to the effect of shadowing, multipath/multiuser interference, and mobility, the communication channel is correlated fading. For a discrete time system, a correlated fading wireless communication channel can be modeled by a Finite State Markov Chain (FSMC). The procedures for obtaining a FSMC channel model include properly partitioning the channel state domain into a finite number of ranges and computing the corresponding transition matrix 143 Chapter 5. Constrained Infinite Horizon Transmission Scheduling [7, 8, 13, 114, 115]. We assume the following channel model. Assume that the channel state is divided into M non-overlapping ranges, where each range is mapped into one channel state of the FSMC channel model. Specifically, let be the finite channel state space and-y, r =- {71 ,^• 1(2,^, 7m1^ is a better channel state than rye written ryi > 73 , for all i > j. At time n, the channel state , Xr,c E F. Furthermore .X.;, evolves as a Markov chain according to a transition probability matrix AC = (a i3 : i j = 1, 2, ... , M), where P(Xnc +i = 73 1Xne = = .43 . It is assumed , that AC satisfies the following first order stochastic dominance condition: E^dk3 for all i > j and / = 1, 2, • • • ,M• (5.1) k>1^k>1 The above condition implies that when the system is in the better state (')/i > -y 3 ), its evolution will be in the tail region of the channel state space, i.e. the region of better states, with a larger probability. This condition is satisfied by most non-trivial channel models. 5.2.1 The Transmission Scheduling Problem At each time slot n = 0, 1, ..., the system state is defined by a 3-tuple X?, = XThb , where: • .,( 7z c F - and evolves as a Markov chain as described above, and V, is the buffer state • X4 1 E B = {1, 2, ... , }, which is the infinite buffer state space • X7Y, denotes the number of new packets, which can only be transmitted in time slot n + 1 or later; X,Y, E y, where y denotes the packet arrival event space. For sim- plicity, assume that Y = {0,1} and .X7Y,, has an i.i.d. probability mass function, i.e. = 1) = 8, P(X,Y, = 0) = 1 — (5 for all time n. These assumptions on 3) and the i.i.d. condition on the probability distribution of X,Y, are purely for convenience: the analytical results in this chapter hold for any arrival process as long as .X,Y, is always finite and the probability distribution of X,Y, does not depend on the buffer state X,),1 and the channel state X. 144 Chapter 5. Constrained Infinite Horizon Transmission Scheduling Then the system state space is the countable set S = B x F x Y. At each time slot, the user has to decide whether to transmit a packet, unless the packet storage buffer is empty. That is, the action set is A = {0,1}, where 0 stands for not transmitting, and 1 for the action of transmitting a packet for all buffer state X b > 0. When Ao = {0} E Xb = 0, the action set is A. There are costs associated with each action, these costs are defined below. • At time slot n, the transmission cost depends on the action and the current channel state Xic, by the transmission cost function (5.2) c(•,^ In order to encourage transmissions in better channel states, c(a, xc) must be decreasing in the channel state xc for all action a. • Furthermore, there is a waiting (or waste of bandwidth) cost, which is applicable when the buffer is not empty, i.e. when X b > 0, 8 (5.3) w(.) : A^ The waiting/waste of bandwidth cost defines the constraint in the optimal opportunistic transmission scheduling problem. The definition above gives a waiting cost that is constant for all positive buffer occupancy states. For an infinite buffer, a constant waiting cost for all positive buffer occupancy states makes physical sense as each delayed transmission should be penalized by an equal cost. Furthermore, if the waiting cost indeed increases with the buffer occupancy state, each time a transmission is omitted, the average waiting cost is incremented. That is, two policies that differ at only one time point will yield two different average waiting costs, which is not desirable in an infinite horizon MDP formulation. In what follows, without loss of generality, it is assumed that when a transmission is not attempted, there is no transmission cost, i.e. c(0, xc) = 0 for all xc E F. In comparison, when a transmission is attempted, there is no waiting cost, i.e. w(1) = 0. Transmission success probabilities: If the transmission of a packet over the channel is attempted, i.e. action a = 1 is selected, a packet will be successfully received (and 8 Some applications may require a cost w x b = ,(0), for example to penalize the waste of bandwidth for the buffer state X b = 0. In such a case, as long as 0 < w x b =0 (0) < w x b, o (0) = w(0), where w(.) is defined in (5.3), all the results in the chapter hold. 145 Chapter 5. Constrained Infinite Horizon Transmission Scheduling hence removed from the buffer) with a probability that depends on the channel state. The transmission success probability function is given by fs : A x^—4 [0, 1],^ (5.4) where h(•, -) is increasing in the channel state (as a better channel state leads to a higher success probability) and MO, xc) = 0 for all channel state xc. It is assumed that the outcome of every transmission is known at the transmitter due to the error-free feedback channel. In the case when a packet is not successfully received, the packet is kept in the buffer for later retransmission via an ARQ protocol. The Markovian evolution of the system is then given by P(X, ± 11X,„ a = 0) = P(X7ci+1 IXOP(X41)I(V,,±1 = Xnb + XrYi) P(X, ± 11X,,, a = 1) = P(X7c24-11X0P(X7Y)+1).fs (1, K)I(Xnb +i = Xnb + X9Y, — 1) + P(K +1 1K)P(X,Y, ±1 ) (1 — f s (1, K))1(X ri = X71', XT,),^(5.5) where I(•) is the indicator function. 5.2.2 CMDP Formulation The infinite horizon transmission scheduling problem is formulated as an average cost constrained MDP as follows. A transmission scheduling policy consists of a decision rule for every time slot. A decision rule is a function that probabilistically maps the system states to actions. Formally, denote the collection of probability distributions on (Borel) subsets of A by P(A), then a decision rule is a function of the form [59] u:BxrxY >P(A).^ — (5.6) Furthermore, a deterministic decision rule is a function mapping system states to actions u:BxF x y —4 A. A transmission scheduling policy can then be written as 7r = (u n (•,^•) : n^1, 2, ...),^ 146 (5.7) Chapter 5. Constrained Infinite Horizon Transmission Scheduling where u n (., .) is the (deterministic or probabilistic) decision rule for time slot n. For an infinite horizon MDP, the only case of interest is when there exists a stationary optimal policy. A stationary policy is defined by a single decision rule 7r = .) = u(., .)), for all n = 1,2, .... In this chapter we focus on stationary policies. A condition on the existence of an optimal stationary policy will be given in the next section. In the remaining of the chapter, a general policy is denoted by it and a stationary policy by u. For convenience, denote the space of all policies by T, stationary policies by T, and stationary deterministic policies by 'Ts d. The average transmission and delay costs associated with a transmission policy it for initial state so E S are respectively given by [67] N Cx , (7r) = lim sup —E, [E c(an, ^IXo = xo] N-400 N n=1 (5.8) = lirn sup —E, [E w (an ) IX° = so] N n=1 (5.9) Wxo(r) The problem of optimal transmission scheduling is formally given by C'x' 0 = inf Cx0(7) (5.10) Wxo ( 7 ) < D exo E S, (5.11) such that where D E R + is the limit on the average waiting cost. Remark: Like the previous chapter, the above CMDP formulation holds for the general case of a countable action space, i.e. when A = fao,a1,a2,... : ai < a3 Vi < j, ao = 0}, or even a continuous action space with a straightforward modification of the definition (5.4) of the success probability function f s (a , xc) (see Section 5.7). 147 Chapter 5. Constrained Infinite Horizon Transmission Scheduling 5.3 CMDP Parameterization using Lagrange Multiplier Method This section establishes a condition for recurrence of the Markov chains and presents the formulation of the unconstrained transmission scheduling MDP, which is parameterized by a Lagrange multiplier. The condition for recurrence of the Markov chains is essential for the analysis of the constrained and unconstrained MDPs as the existence of an average cost stationary optimal policy always requires some condition on ergodicity of the corresponding Markov chains. 5.3.1 Buffer Stability and Recurrence of Markov Chains The role of the average delay constraint is to ensure reasonable data transmission and throughput rates. Here we present a condition for which the buffer is stable for all feasible policies, i.e. policies that satisfy the average delay constraint. Lemma 5.1. For convenience denote the smallest success probability by min{ f,(1, xc)} =- I xcEr If 6 1 D f <^w(0) then every policy 7 (5.12) satisfying the constraint (5.11) induces a stable buffer, and a recurrent Markov chain. Proof. Let 7r be a transmission policy that satisfies the constraint (5.11), i.e. 1 N D > lim sup 7 .E, [E w(0)I(a n = 0, Xnb > 0) IXo = xol N—,00 N n=1 Consider two cases. First, if 1 lim sup — ,,, N —■ oo lv I(xnb = 0)1x0 = xo] > 0 [ (5.13) n=1 then it is clear that the buffer is stable. Furthermore, 7r yields a recurrent Markov chain 148 Chapter 5. Constrained Infinite Horizon Transmission Scheduling as states with X7 = 0 are recurrent and communicating with all other states. Second, if N ^(5.13) does not hold, i.e. lim sup kE [E^o)Ixo x o ] = 0, stability of the buffer 7r N—^co^n=1 and recurrence of the Markov chain is shown as below. The average success transmission rate of policy n is given by N rxo (7r) = lim pifoo^{E fs(1, K)I(an 1)IX0 = xol n=1 N > lim inf —E, [E fi(an = 1 )IXo = xo] N n=1 = f (1 = f (1 — lim sup —1E, [E Ila n = 0, Xnb > 0) + I(Xnb = 0) IX° = xo] N,00 — N n=1 lirn sup —, 1 E, [E ga n = 0, Xnb > 0)1X0 = x0] ) N--,0,0 iv n=1 f^w()) > ■5 By (5.12). Therefore the buffer is stable almost surely for policy 7r, i.e. the Markov Chain is recurrent due to Foster's Theorem [139].^ ^ Throughout the chapter, we assume (5.12) and aperiodicity of the Markov chains. Then for every feasible policy, a steady state distribution and the limits in (5.8) and (5.9) exist [59]. Therefore, the average costs are well-defined and independent of the initial states. Lastly, (5.12) can be modified (by replacing D with D + c) to ensure a stability margin for policies that "almost" satisfy the average delay constraint by some margin E. In what follows, we drop the initial state in the notation of the average costs, except for a few cases when it is unclear hether the average costs depend on the initial states. First, given the stability condition in Lemma 5.1, the constrained average cost MDP formulated above can be rewritten as below [67]: C*^inf C (7r) s.t. W (7r) < irET 149 D,^ (5.14) Chapter 5. Constrained Infinite Horizon Transmission Scheduling where N C(7) = lim sup 1 lEa [ - W (7r) = C(an, Xnc )]1 (5.15) n=1 N 1 lim sup 7v.E7r [E w(an)] • (5.16) N " n=1 5.3.2 Unconstrained Transmission Scheduling with Lagrangian Cost Our approach in proving the structure of the constrained optimal transmission policy involves using the Lagrange multiplier method to reformulate the CMDP as a parameterized unconstrained MDP. For each Lagrange multiplier A, define the instantaneous Lagrangian cost at time n by c(a, X„,; A) = c(a, Xrci ) + Aw(a).^ (5.17) The Lagrangian average cost for a policy it is then defined as below Jx 0 ( 7 ; A) = lim sup —E, [E c(a n , Xn ; A) IXo = xo ] N n=1 (5.18) The corresponding unconstrained MDP is to minimize the above Lagrangian average cost: efr^A) Ixko;A 7rin = arg in c fr Jx, (7r; A).^ (5.19) Assume that condition (5.12) of Lemma 5.1 holds and that the Lagrange multiplier is selected so that the constraint is met in a fashion consistent with the CMDP given by (5.10) and (5.11), in the sense that the average Lagrangian cost optimal policy satisfies the constraint (5.11). Then the unconstrained average Lagrangian cost optimal policy 71- , defined by (5.19), induces a recurrent Markov chain. In that case, the Lagrangian optimal average cost 4 0;) is independent of the initial state xo and is denoted by J. 150 Chapter 5. Constrained Infinite Horizon Transmission Scheduling 5.4 Structure of Optimal Transmission Scheduling Policy The existence and structure of the constrained optimal transmission scheduling policy are established via several steps, which are outlined in the list below. • Given the instantaneous Lagrangian cost (5.17), a discounted (Lagrangian) cost MDP is formulated. For discounted cost MDPs, value iteration algorithms converge and provide a basis for the analysis of the optimal policy structure. In this chapter, the concept of supermodularity is applied to the recursive value iteration equation to prove that the discounted cost optimal policy is deterministic and monotone (and hence threshold) in the buffer occupancy state. • Under suitable conditions the average Lagrangian cost MDP can be viewed as the limit of a sequence of discounted cost MDPs with discount factors v —4 1 [59, 119, 130]. We use this method to show that there exists a stationary threshold transmission scheduling policy that minimizes the average Lagrangian cost defined by (5.18). • The structure of the constrained optimal transmission policy then follows from the results in Lagrange dynamic programming for CMDPs with a single constraint in [74, 75, 129]. In particular, the results in [74, 75, 129] imply that there exists a solution to the CMDP (5.10) that is of the form u* = qu*Ai + (1 — q)u*A2 , (5.20) where 0 < q < 1 and u*Ai , ttI 2 are the stationary unconstrained average cost optimal policies for some Lagrange multipliers Al and A2. In [74], it is assumed that the state space is finite and the action space is compact. These assumptions lead to the existence of a stationary (unconstrained) optimal policy. In this chapter even though the state space is countable, the existence of an unconstrained optimal policy is verified by viewing the average cost MDP as the limit of some sequence of discounted cost MDPs as outlined above, using Lemma 5.1 and the theory provided in [130]. In [75, 129], the Lagrangian methodology was extended to the case of a countable state space. 151 Chapter 5. Constrained Infinite Horizon Transmission Scheduling 5.4.1 Threshold Structure of Discounted Cost Optimal Policy A well-known method for studying average cost MDPs is by relating them to discounted cost MDPs [59, 119, 130]. In this section we formulate the discounted (Lagrangian) cost MDP and analyse the structure of the discounted cost optimal policy. In the next section, the existence and structure of the average cost unconstrained optimal policy are established by viewing the average cost MDP model as the limit of a sequence of discounted cost MDPs with discount factors v 1. Given the instantaneous Lagrangian cost (5.17), define the discounted Lagrangian cost associated with policy 7r and discount factor v E (0, 1) by Jxv. (u; A) = lim^ N—*oo ^= xo] I n=1 (5.21) where 0 < v < 1 is the discount factor. The discounted cost MDP is to minimize the above cost. The existence of a stationary, deterministic discounted cost optimal policy follows straightforwardly from [59]. Define the optimal discounted cost by Vu(xo) = inf(u; A) (for notational conveuETsd nience we omit Lagrange multiplier A in the notation Vv(xo)). 17v(x0) then satisfies the ° Bellman's equation of optimality: V v (x) = min {c(a, x; A) + v aE. uv = arg min {c(a, x; A) + v ci e.Ax E (x' ix, a)17' (x i )} A x'ES E P(x lx, a)Vu (x')}, (5.22) x'ES where x = [x b , xc, xY] E S. The discounted cost MDP can be solved via the value iteration algorithm given by Vnii± i (X)^inf {c(a, x; A) + v^P(x' lx , a)V: (x i ) .} aE.A.^ (5.23) x'ES Q mv +i ( x, a) = c(a, x; A) + v E 1P(x lx,a)V u(x ).^(5.24) i x'ES 152 n 1 Chapter 5. Constrained Infinite Horizon Transmission Scheduling Specifically, for any initialization of Vov(x) for all x E S, the sequence Vnv(x) generated by (5.23) and (5.24) converges to the optimal discounted cost Vv(x) for all states x E S. The monotone structure of discounted cost optimal policies (for any discount factor v E (0, 1)) is now established using the concept of supermodularity, which has been introduced in Chapter 4 (also see [6]). Due to convergence of (5.23), (5.24) for all initial conditions, in order to show that the discounted cost optimal policy is monotonically increasing in the buffer state b it suffices to show Q co u ([x b , xc, xY], a) is submodular in (xb, a) for all x b > 1 for some initial condition 9 This is completed via mathematical induction in the theorem . below. Theorem 5.1. For any discount factor v E (0, 1) (and, implicitly, a Lagrange multiplier A), the discounted (Lagrangian) cost optimal policy is deterministic and monotonically increasing in the buffer state component x b of the system state, i.e. u u* (-,.,.) given by (5.22) is of the form u '(xb,xc,xy)^ { 0 if 0 < x b < b(xc, xy ) 1 if b(xc, xY ) < x b , (5.25) where b( x c, xY) : F x^.8 defines the threshold for the pair of channel state and packet arrival event ( x c, xY). Proof. See Section 5.9.^ ^ Another analytical result that will be useful for the analysis of the average cost MDP basing on discounted cost MDPs is the monotonicity of the value function V" ([x b xc, defined by (5.22) above. Lemma 5.2. For any discount factor v E (0, 1), the optimal discounted cost V' ([xb x c x y]) is increasing in the buffer state component x b , the number of new packets x 9 and decreasing in the channel state xc. Proof. See Section 5.9.^ 9 See ^ Chapter 4 for definition of supermodularity. 153 Chapter 5. Constrained Infinite Horizon Transmission Scheduling 5.4.2 Threshold Structure of Unconstrained Average Cost Optimal Policy In [119, 130], the authors provide the theory for relating average cost optimal policies to discounted cost optimal policies. Specifically, under suitable conditions given in [119, 130], the optimal average cost policy inherits the structure of the discounted cost optimal policies for some sequence of discount factors that approach 1. The following lemma from [130] defines the concept of limit points for a sequence of discounted cost optimal policies. It will be shown later that any such limiting policy is a candidate for average cost optimality. Lemma 5.3. /130] Let vk be any increasing sequence of discount factors, k i n k—>oo = 1. Let (4k ) be the associated sequence of discounted optimal stationary policies. There exist a subsequence ,3k of vk and a stationary policy u E 'Ts d that is a limit point of u*ok . That is, for every state x = [xb,xc,xY], there exists an integer N(x) such that u7 3k (xb,xe,xY) = u(5 1 ) ,x',5Y) for all n > N(x). Any stationary policy u given by Lemma 5.3 is an average cost optimal policy under Assumptions 1, 2, 3 in [130]. We verify these assumptions in Theorem 5.2 below. Theorem 5.2. Any stationary deterministic policy u E Ts d given by Lemma 5.3 is an average cost optimal policy. In particular, there exists a constant g = lim(1 — v)17 11 (x) for vv. every x, and a function h(s) with —N < h(x) < M x , such that g + h(x) = min {c(a, x; A) + aE.A.. E P(x lx, a)h(x)} .^(5.26) x' ES Furthermore, the stationary policy u is average cost optimal with an average cost g. Proof. For any stationary policy u given by Lemma 5.3 to be average cost optimal, the following assumptions have to be satisfied [130] (also see Theorem 12.1 in [73]): • Assumption 1: For every state x and discount factor v, the optimal discounted cost Vv(x) is finite. • Assumption 2: Assume a reference state 0. There exists a nonnegative N such that — L N <^ hv(x) = Vv(x) — Vw(0) for all x and v. 154 Chapter 5. Constrained Infinite Horizon Transmission Scheduling • Assumption 3: There exists nonnegative Mx , such that 11, 11 (x) < Mx for every x and II. For every x there exists an action a such that E p(xlx,a)mx < 00. x' ES • Assumption 3': Assumption 3 holds and in addition, E p(xlx,a)mx < oo for all x x'Es and a. Define the reference state 0 = [0, -yK, 0]. Consider the stationary policy u(x b , xc, xY) = I(x b > 0), i.e. the policy of always transmitting. Under condition (5.12), policy u induces a stable buffer, hence the expected time and cost for first passage from any state x = [xb, xc, xY] E S to state 0 are finite. Hence by Propositions 5(i) and 4(E) in [130], Assumptions 1 and 3 above are satisfied. Assumption 3' then holds as the system evolution is given by (5.5). Assumption 2 is satisfied because "Vv([xb, xc, 5Y]) is increasing in x b 5Y, and decreasing in , xc, which implies that V" ([x b , xC 5Y]) > 14,(0) for all [xb, xc, xY] E S as shown in Lemma 5.2. , Due to Lemma 5.3 and Theorem 5.2, it can be claimed that the average cost optimal policy will inherit that monotone structure of the discounted cost optimal policy [119]. Corollary 5.1. For any Lagrange multiplier A, there exists a stationary average (Lagrangian) cost optimal policy that is deterministic and monotonically increasing in the buffer state component x b of the system state x = [5b, xc, 51. That is (5.18) can be minimized by a stationary policy of the form uA * (x b , x c , x Y ) = ^0 if 0 < x b < b(X c xY; A) 1 if b(X c xY; A) < (5.27) where b(xc, xY ; A) : F x Y^B defines the threshold for the pair of channel state and packet arrival event (xc, xY) and the given Lagrange multiplier A. Proof. Follows from Lemma 5.3, Theorem 5.2 and Theorem 5.1. ^ 155 ^ Chapter 5. Constrained Infinite Horizon Transmission Scheduling 5.4.3 Monotonicity of Constrained Average Cost Optimal Policy Having shown the threshold structure of the unconstrained average cost optimal policy, here we use the Lagrangian dynamic programming method introduced in [74, 75, 129] to prove that the constrained optimal transmission scheduling policy is a randomized mixture between two threshold policies. Corollary 5.2. There exists a stationary transmission scheduling policy u* that is the optimal solution of the CMDP given by (5.10) such that u* is a randomized mixture of threshold transmission scheduling policies as follows u *( x b ,x c ,xy , ) = qu Ai (x b , xc , xY ) + (1 — q)u A2 (x b , xc,xY), (5.28) x c xy) u* (xb xc xY) are two unconstrained average (Lagrangian) cost optiwhere u*A l (xb " l ' A2 k " mal policies that are of the form (5.27). Proof. This result follows directly from [74, 75, 129]. A proof is given in Section 5.9 for clarity.^ El 5.5 Policy Optimization Algorithms In this section two methods are proposed for learning the constrained optimal transmission scheduling policies. The first method follows the Lagrange dynamic programming framework and is named the Lagrange Q-learning method. In this method, the estimation of the constrained optimal policy is divided into two steps. At the inner step, a Q-learning-based algorithm is deployed to estimate the unconstrained optimal policy. At the outer step, the Lagrange multiplier is updated using the Robbins-Monro method. At convergence of both steps, an estimate of the constrained optimal policy is obtained. In comparison, the second method exploits the proved structural results (see Fig. 5.1 and Corollary 5.2) to estimate the constrained optimal transmission scheduling policy directly. In particular, the monotone constrained optimal policy is approximated by a parameterized smooth function and the parameters are estimated by the SPSA algorithm (Alg 5.2). 156 Chapter 5. Constrained Infinite Horizon Transmission Scheduling W(u%) Alg. 5.1: Monotone—policy Q—leaming Lagrange Multiplier Update by (533) A* An+1 Estimate u*A . ,u*A . by Alg. 5.1 u* =^++(1 — q)74.1 (q is given by (5.34)) -‹^ Figure 5.2: The process of estimating the constrained optimal policy includes iteratively estimating the Lagrange multiplier and the corresponding unconstrained optimal policy. After convergence to the correct Lagrange multiplier, the constrained optimal transmission policy is estimated. 5.5.1 Policy Optimization via Q learning - The constrained optimal transmission policies can be estimated by the Lagrange Q-learning method outlined in Fig. 5.2 with the use of a threshold-policy Q-learning algorithm (Algorithm 5.1). The procedure starts with an initial Lagrange multiplier. Then, for each Lagrange multiplier, the (threshold) unconstrained optimal policy is estimated via Algorithm 5.1. After the estimation of the unconstrained optimal policy, the Lagrange multiplier is updated via the Robbins-Monro method [5]. The two steps are repeated until convergence of the Lagrange multiplier estimates to A*, which is given by (5.32) [74, 75, 129]. Then the randomize factor and the constrained optimal transmission policy are computed. The threshold structure is exploited in the modified (threshold-policy) Q-learning algorithm in the sense that in the process of learning the Q-factors, monotone policies are used most of the times, and at the end of the learning (i.e. exploration) period, the estimate of the optimal policy is projected to the space of monotone policies. It will be shown via numerical studies in Section 5.6 that the threshold-policy Q-learning algorithm converges to a good neighbourhood of the optimal policy at a much faster rate than the standard non-structured Q-learning algorithm. Furthermore, when a constant step size is used, Algorithm 5.1 can track changes in the statistical properties of the system. 157 Chapter 5. Constrained Infinite Horizon Transmission Scheduling Unconstrained Optimal Policy Estimation via Q-learning We approximate the infinite-capacity buffer by a buffer of a finite size and assume that when there is a buffer overflow, the packet that has been in the buffer for the longest period of time is discarded. As a result, the state space is of finite cardinality and Q-learning can be used to numerically solve the average cost MDP. In what follows, we describe how the unconstrained optimal policy can be estimated for a given Lagrange multiplier. The unconstrained average cost optimal transmission policy satisfies the Bellman's optimality equation E (x lx, a)h(x')], arg min [c(a, x; A) + E 1[1 (x' lx, a)h(x')]. g h(x) = min [c(a, x; A) + aGA. U * (X) = x'ES aE.A. (5.29) x'ES Define the state-action cost function Q(x, a; A) so that g Q(x, a; A) = c(a, x; A) + E P(x' lx, a)h(x')^(5.30) x'ES then the Bellman's optimality equation (5.29) can be rewritten as g + Q(x, a; A) = c(a, x; + x'ES P(x'lx, a) min Q (x' , a 1 ; A). a'eAx , Then it is well-known that the Q-factors Q(x, a; A) for all x E S, a E Ax can be estimated iteratively via the model-free Q-learning algorithm, which is based on relative value iteration [67]. In our specific problem (5.19) (now with a finite state space), for a given Lagrange multiplier, the Q-factors can be estimated by the recursion Q(Xn) an) == ( 1 — Erz) 2(Xn, an) + e n (C(a n , X n ; A) ( aEti+1 Q (Xn,+1, — min Q(x ° , a) ) , aEko (5.31) where x ° is the reference state, n^1, 2, ... is the time index, a n = u n (Xn ) and E n is the step size at time n. For a finite state space the estimates generated by (5.31) converge to Q(x, a; A) [67]. 158 Chapter 5. Constrained Infinite Horizon Transmission Scheduling In Algorithm 5.1, (5.31) is used for updating the Q-factors. At time slot n, the action a n is determined by a policy u n , which is changed every batch of L time slots during the learning period. If u n is generated according to a uniform probability mass function then we obtain a non-structured Q-learning algorithm. In Algorithm 5.1, u n is picked from the set of monotone policies much more frequently (e.g., with probability 13 > 0.8 ). Furthermore, at the end of the learning period, the estimate of the optimal policy, which is obtained by applying (5.29) to the estimated Q-factors, is projected in the minimum Euclidean distance sense to the set of monotone policies. Algorithm 5.1. Let Trn E T be the set of policies that are deterministic, and monoton- ically increasing (hence threshold) in the buffer state. Set time n = 0, initialize step size c, policy u n (x) and Q n (x, a) for all x E S, a E Ax . Assume that Ai time slots are used for learning the optimum. Let L (time slots) be the frequency at which the policy u n (.) is changed. For n = 1,2, ... , A/ repeat • an = Un(Xn) • Update Q(Xn , a n ) by (5.31) using a constant step size e n = € • If mod (n, L) = 0, generate a , U[0,1], which is the uniform distribution on [0, 1]. — If a < 13, pick (in a uniformly random manner) a new policy u n E 'T, . — Else, pick in a uniformly random manner a new policy u u*(x) = PT,,,, (arg min Q(x, a)), where n ET Prim denotes the projection of a policy onto T n . The optimal policy u*(•) may then be deployed for the entire exploitation period, e.g., A time slots. If the system parameters vary, e.g., when the transition probability matrix changes, the estimate of u*(•) will adapt to the new optimal solution due to the use of a constant step size e n = E in (5.31). It will be shown via numerical examples that the above threshold-policy Q-learning algorithm performs substantially better than the standard Q-learning algorithm. The intuition is that in the threshold-policy Q-learning algorithm, the set of policies that contains the potentially-optimal policies (i.e. the set of threshold policies) are explored more extensively. Therefore, it can be expected that the threshold Q-learning algorithm would perform better 159 Chapter 5. Constrained Infinite Horizon Transmission Scheduling when the number of learning time slots is not too large. However, asymptotically, the two algorithms would give identical performance. Lagrange Multiplier Estimation To obtain a numerical estimation of the constrained optimal transmission scheduling policy, the Lagrange multipliers that relate the unconstrained optimal policies to the constrained optimal policies by (5.28) are set to A1 2 = A* ± 6, where 6 is some small constant and A* is , the Lagrange multiplier that satisfies the following equation [74, 75, 129] (also see the proof of Corollary 5.2 in Section 5.9) A* = inf{A : W(u4) < D.} (5.32) where u*A is the optimal transmission policy corresponding to the Lagrange multiplier A and W(u*A ) is the average delay cost given by (5.9). Since W(u*A ) is monotonically increasing and continuous in A [74], A* can be estimated iteratively by (see [5, 82]) A / + 1 = Al + 71 (W(u*Ai ) — D). (5.33) In the above update law, the step size has been chosen to be Ej = 1//, and hence satisfies the conditions for convergence of the Robbins-Monro algorithm. That is, the sequence of Lagrange multipliers (A 1 , A 2 , ...) generated by (5.33) converges in probability to A* defined by (5.32) [5, 82]. Once A* is determined, a numerical estimate of the constrained optimal transmission policy is given by (5.28), where Al = A* — 6, A2 = A* + 6 for some perturbation parameter 6. Furthermore, due to the fact that lim W(u*A ) > D > Ern W(u*A ) (see proof of Corollary 5.2 ATA*^— Ala* and [74, 129]), the randomize factor q is given by the following: q = D — W(u*A2 ) (5.34) W(u*Ai ) — W(u*A2 ) • Therefore, (5.31) and (5.33) complete the estimation of the constrained optimal transmission policy by the Lagrange Q-learning algorithm outlined in Fig. 5.2. ° 10 . ' At the inner step of the Lagrange Q-learning algorithm, Alg. 5.1 can also be replaced by Alg. 5.2, with the randomize factor q = 1. 160 Chapter 5. Constrained Infinite Horizon Transmission Scheduling Optimal policy 10^15^20^25^30^35^40 Figure 5.3: The optimal transmission policy, which is of the structure outlined by Fig. 5.1(b), can be approximated as the sum of two sigmoid functions as in (5.36). The optimal parameters n ik, 72, q can be estimated via SPSA. 5.5.2 Policy Optimization via Stochastic Approximation An alternative method for estimating the constrained optimal transmission policy is presented in Algorithm 5.2, which is a policy search algorithm, where the search domain is the set of policies of the form (5.28), and the search algorithm is a simultaneous perturbation stochastic approximation (SPSA) algorithm. In particular, we use a parameterized representation for policies of the structure given by (5.28) (see Fig. 5.3) and a SPSA algorithm to estimate the optimal parameters. Rewrite a policy of the form (5.28) as follows: {0 if 0 < tO b , X e , X Y ) =^q if 0 < Xb < bi (Xc , XV) Xb < b2(X e , xY) 1 if b(xc, xv ; A) (5.35) < Xb . From the above representation, it is clear that a policy that is a randomized mixture of two threshold policies is well-defined given three parameters: the randomize factor q and the two buffer state thresholds bi(xc, xY) and b2(xe, x 9 ). The last two parameters bi (xc, xY) and b2 (x, xY) only take integer values, and cannot be directly estimated using gradient-based stochastic approximation. In what follows, each policy of the form (5.35) is approximated by a parameterized smooth function. The discrete policy search problem is then converted into estimating some optimal continuous parameters, which can be numerically solved by a constrained stochastic approximation algorithm [5, 77]. The motivation for this conversion is the fact that continuous stochastic approximation is much more efficient than discrete search algorithms in general. 161 Chapter 5. Constrained Infinite Horizon Transmission Scheduling In particular, consider the parameterized sigmoid function _C, 1744 X X *- 19 1 Th When T 0,p 712) = ^ 1 ^+ 1—p 1+ e—(x b —n2)/7 . (5.36) q, n, b i (x , xY), 712 b 2 (xc, xv), ft(xb, xc, xY ; p, TA , 772 ) converges point, wise to u(x b , xc, xY) given by (5.35), as illustrated by Fig. 5.3. Therefore, we use the approximation (5.36) and convert the problem to estimating the optimal continuous parameters p* = q, 771 bi(xc , xY), 77; = b2(xc , xY) (for every pair of channel state and packet arrival event (xc, xY)) assuming T is sufficiently small. T can be a fixed constant or can be gradually decreased with the number of iterations [94]. The total number of parameters to be estimated is 3 x IF I x iyi, where I • I denotes cardinality of a set. For convenience, let a denote the vector of parameters that will be estimated. a can be updated using the constrained stochastic gradient algorithm given by: 11 a(n + 1) = a(n) — E (A,C(a (n)) + I a(n)) max [0, A (n) + p (u la (n)) — D)] (5.37) A(n + 1) = max`( — — pc )A(n),A(n) + E (14 7 (ula(n)) D),1 (5.38) where € is the constant step size, p is a large positive constant, A,C(a(n)), A,W(a(n)), W(a(n)) are the estimates of the gradient of the average transmission cost CO, the gradient of the average waiting cost WO and the waiting cost itself, respectively. There are several methods for estimating the gradients of the costs with respect to the parameters. Here we propose to use the SPSA method [5, 140, 141], which is a generalization of the finite difference method to the case of multiple variables. The efficiency of the SPSA algorithm is due to the underlying gradient approximation that requires only two measurements of the objective function regardless of the dimension of the optimization problem [5, 140, 141]. The resulting algorithm for estimating the constrained optimal transmission scheduling policy via the parameterization (5.36) and SPSA is Algorithm 5.2. 11 Here the fact that the optimal parameters must be within certain regions, e.g. p E [0, 1], is neglected. These boundary constraints can be dealt with by using the spherical parameterization method on the parameters (see Chapter 2), i.e. by putting p = sin 2 (0) for example. 162 Chapter 5. Constrained Infinite Horizon Transmission Scheduling Algorithm 5.2. Set time n = 0. Let m be the total number of parameters to be estimated: m = 3 x Irl x Select proper initilizations for the parameter vector a(n)o n —o, perturbation step size Sn and gradient update step size En. For n = 1, 2, . . . repeat • Generate the random perturbation vector An, e.g. according to a binomial distribution as follows ^31 n = (Ai,^,^where L ^Binomial (0,11, p = 0.5). • Simultaneously perturb all m parameters by anon, where an is the scale factor: a+ (n) = a(n) o- n ZSIn a_(n) = a(n) — o- nAn • Estimate the average transmission and waiting costs for a(n) ± and a_(n) using observations from T time slots by (ii(x b , xe , xY ; a ± (n))) = ^c(a t , Xtc)), t- 1 and similarly for^(ii(xb,xc,xY;a_(n))); and the average waiting costs VV. (ii,(•, • ,•;a + (n))), .(ii(ii(xb,xc,xY;a_(n))), given that at time t = 1,^,T, the action is selected with the probability mass function given by IP(a t = 1) = ii(Xb, X', XY;a 4.(n)). • Obtain gradient estimates by the finite difference method by ^Aai(C(xla(n)) =^(ii(x b , xc , xY; a+ (n))) — as 'ell(x b , xc , x'; a —(n)))) a n Ai (5.39) for l = 1, 2, . , m. • Update a(n) using (5.37) and (5.38) with a small constant step size E. The use of a small constant step size in (5.37) and (5.38) implies that Algorithm 5 2 converges in probability to a local constrained optimum and has the tracking capability 163 Chapter 5. Constrained Infinite Horizon Transmission Scheduling [5, 82]. In other words, Algorithm 5.2 can estimate the constrained optimal transmission scheduling policy in slowly time-varying systems, without requiring explicit knowledge of the system parameters. In Section 5.6 the performance of Algorithm 5.2 is illustrated via numerical examples. 5.6 Numerical Examples In this section we illustrate the performance of Algorithms 5 1, 5.2 via numerical examples. Similarly to the previous chapter, we consider the real time transmission of compressed video/voice data in the uplink of a Rayleigh fading communication channel in a packet-based 3G cellular network for numerical studies. Here, the infinite capacity buffer is approximated by a finite size (i.e. truncated) buffer and the characteristics of the solutions as well the performance of the proposed algorithms are studied. Assume that the video data is transmitted in the QCIF format at the rate of 30 frames per second. Assume a channel bandwidth of 2Mbps and a link layer packet size of 128 bytes per packet. Then each frame corresponds to 67 packets and the packet transmission rate is 2000 packets per second. Assume that time is divided into slots of size 0.5 ms, so that at each time slot exactly one packet can be transmitted. At each time slot, a decision has to be made on whether to transmit a packet, i.e. A = {0, 1} as usual. Assume the underlying channel is frequency-flat Rayleigh fading with the following parameters: carrier frequency f fc = 1.9 GHz, mobile velocity v = 5 km/h, which leads to a maximum Doppler frequency of d, max = 8.8 Gz. Assume SNR represents the channel state, then the equal-probability SNR partitioning method [8] can be used to model the channel by a 3-state Markov chain, i.e. I' = 70, 71, 0 2, with approximately the transition probability matrix , 0.85 0.15^0 Pl c = ( 0.15 0.7 0.15 ) . (5.40) 0^0.15 0.85 Assume that if a transmission is attempted, the success probabilities defined in (5.4) are given by fs (1, 70) = 0.3, fs (1, -yi) = 0.5 and fs (1, -y3) = 0.95. Assume that the user has a buffer of size B = 50 to store packets for transmission and retransmission, and that the average arrival rate is S = 0.25. Let the transmission cost 164 Chapter 5. Constrained Infinite Horizon Transmission Scheduling Optimal policy success rate ^ Aggressive policy success rate Optimal policy throughput © Aggressive policy success rate a) o0c) C _cQ 0) O _c 0. E a) >. 40 60 '00 1 9^40 40 60 , 0 ^ ^ ^ 2 1.5 25 Lagrange multiplier X Figure 5.4: The optimal opportunistic transmission scheduling substantially improves the transmission success rates while maintaining a throughput comparable to the aggressive policy. c(Xc, a) and the delay cost w(a) be defined by c(-yo, a) = 1.7a; c(71, a) = 0.8a; c(-y2, a) = 0.2a for all a E A and w(1) 0 and w(0) = 0.6. In what follows, we illustrate the tradeoff between energy efficiency and throughput that is achieved by the opportunistic transmission scheduling policies in this chapter. Then the performance of Algorithm 5.1 and Algorithm 5.2 are studied. 5.6.1 Energy Efficiency Improvement In this section the tradeoff between energy efficiency and system throughput is illustrated via a numerical example. Fig. 5.4 depicts the average throughput and the transmission success rates that are associated with the optimal transmission scheduling policies for different constraints. The transmission success rates directly represent the energy efficiency of a transmission scheduling policy. 165 Chapter 5. Constrained Infinite Horizon Transmission Scheduling The optimal transmission scheduling policies used in Fig. 5.4 are generated based on the following principle. From the proof of Corollary 5.2 (also see [74, 75, 129]), it is clear that if there exists a Lagrange multiplier such that the unconstrained optimal policy u*A satisfies the constraint on the average delay cost with equality, then u*A is also optimal for the constrained problem. In other words, by solving the unconstrained MDP with the Lagrange multiplier A, we obtain a policy u*A that is optimal for the CMDP with the constraint that the average waiting cost must be less than or equal to D WA = W(74). Therefore we vary the Lagrange multiplier A and depict the costs associated with u*A in Fig. 5.4. It should be noted that a larger A implies a tighter constraint on the average waiting cost, i.e. D = 13/4,1 decreases in A (see proof of Corollary 5.2 and [74]). It can be seen from Fig. 5.4 that as A increases, the success rate of the optimal policy decreases. This is intuitive because when the constraint on the average waiting cost is more restrictive, the optimal scheduling policy will be closer to the policy of always transmitting, and will be and less energy-efficient. Nevertheless, it can be seen from Fig. 5.4 that the opportunistic transmission scheduling policy is always much more energy-efficient than the aggressive policy of always transmitting. Whereas, the optimal throughput obtained by opportunistic transmission is only slightly less than the aggressive policy. 5.6.2 Optimal Policy Estimation via Lagrange Q learning - In Figs. 5.5-5.11 the performance of the monotone-policy Q-learning algorithm (i.e. Algorithm 5.1) is compared with that of the non-structured Q-learning algorithm. Fig. 5.5 depicts the unconstrained optimal policy obtained by Linear Programming for A = 2. The estimates obtained via 105 iterations of Algorithm 5.1 and 10 5 iterations of the non-structured Q-learning algorithm are plotted in Fig. 5.6 and Fig. 5.7 respectively. It can be seen that the estimate by Algorithm 5.1 is much closer to the optimal policy than the estimate by the non-structured Q-learning algorithm. Tracking capability of the threshold-policy Q-learning algorithm In Figs. 5.8, 5.9, 5.10, the tracking capability of Algorithm 5.1 with a constant step size is illustrated. In particular, the Lagrangian multiplier is changed to A = 2.2 while other parameters (i.e. the channel state transition matrix and the costs) remain unchanged. The 166 Chapter 5. Constrained Infinite Horizon Transmission Scheduling Buffer occupancy state x b Figure 5.5: The unconstrained optimal policy u*A (.) obtained by Linear Programming for A = 2.0. 15 Buffer occupancy state x b Figure 5.6: The estimate of u*A (.) for A = 2.0 by 10 5 iterations of Algorithm 5.1. new unconstrained optimal transmission policy is plotted in Fig. 5.8. Algorithm 5.1 and the non-structured Q-learning algorithms are initialized at the estimates obtained for A = 2 (see Figs. 5.6, 5.7). The Q-factors are then updated by (5.31) in 2000 iterations using Algorithm 5.1 and the non-structured Q-learning algorithm, with the constant step size E n = E = 0.01. It can be seen from Figs. 5.9 and 5.10 that the estimate by Algorithm 5.1 adapts reasonably well to the new optimal transmission policy. In comparison, the nonstructured Q-learning algorithm does not converge fast enough to a good neighborhood of the new optimum. The tracking capability of Algorithm 5.1 allows for adaptive estimation in slowly time varying systems, e.g. when the channel state transition matrix may vary as new users arrive or leave the network. 167 Chapter 5. Constrained Infinite Horizon Transmission Scheduling 5 10 15 Buffer occupancy state xb 20 25 30 12.3—^„ (x - ,x1 Figure 5.7: The estimate of u*A (.) for A = 2.0 by 10 5 iterations of the non-structured Qlearning algorithm. O t Buffer occupancy state x b Figure 5.8: The unconstrained optimal policy u*A (.) obtained via Linear Programming for A = 2.2. 168 Chapter 5. Constrained Infinite Horizon Transmission Scheduling 5 10 15 Buffer occupancy state Xb 20 25 A 56 „ 30 12 3' (x b de) Figure 5.9: When the Lagrange multiplier changes from A = 2.0 to A = 2.2, the estimate by Alg. 5.1 adapts to the new optimal point. 5 10 15 Buffer occupancy state xb 20 25 6 30 120" 5 „ (x b ,x') Figure 5.10: When the Lagrange multiplier changes from A = 2.0 to A = 2.2 the estimate by the non-structured Q-learning algorithm do not adapt to the new optimal point. 169 Chapter 5. Constrained Infinite Horizon Transmission Scheduling Average Costs for the constrained optimal transmission policy and the estimates by 0-learning Algorithms —e— Lagrangian cost, monotone-policy 0-learnin - e - Delay cost, monotone-policy 0-learning Lagrangian cost, non-structured 0-learning - Delay cost, non-structured 0-learning Lagrangian cost, optimal policy - - - Delay cost, optimal policy 2 1.5 Average Lagrangian Costs for (i) Optimal Policy, 'i) Estimates by monotone-policy 0-learning o iii) Estimates by non-structured 0-learning \111140.4 1.8040/40 1^011. 11.11 -. . 0.5 - 8:Ct - 0).0 ^ • 0.0.04).0 tr0.047-04P 0004.e 0-9-0-6-. POSOA7 41. 3.01,4 . wo-a. • Average delay costs for (i) Optimal policy, (ii) Estimates by monotone-policy 0-learning and (iii) Estimates by non-structured 0-learning 0 0 ^ 2^3^4^5^6 ^ 4rEP0.04, • 7^8^9^10 Time t x 10 ° Figure 5.11: The average Lagrangian and Delay costs for the estimates generated by the monotone-policy Q-learning algorithm (Algorithm 5.1) converge quickly to the costs corresponding to the optimal policy. Convergence of Q-learning in terms of average costs In Fig. 5.11, the estimates by Algorithm 5.1 and the non-structured Q-learning algorithm are compared to the optimal transmission scheduling policy in terms of the average costs. The Lagrange multiplier was set to be A = 2. It can be seen that after approximately 6000 iterations, the average costs corresponding to Algorithm 5.1 are already very close to the optimal costs. In comparison, even after 10 5 iterations, the costs for the non-structured Q-learning algorithm still does not converge to the optimal costs. The convergence in terms of costs is important as the estimate of the correct Lagrange multiplier by (5.33) depends on the estimated average delay costs. If the simulation is run for infinitely long, both monotone-policy and non-structured Q-learning should converge. However, as the number of iterations is limited, the monotonepolicy Q-learning algorithm outperforms the non-structured Q-learning algorithm as the exploration process is more extensive on the set of policies that are potentially optimal. Simulations show that for a larger state/action space, the improvement of the monotonepolicy Q-learning algorithm over the non-structured Q-learning algorithm is even more 170 Chapter 5. Constrained Infinite Horizon Transmission Scheduling 2.5 2.4 x Evolution of Lagrange linear programming - - A sample path of Lagrange 0-learning A sample path of Lagrange 0-learning 2.3 Co 22 -0 4) T:i. 2.1 2 E t% 1.9 rn0 1.8 -J 1.7 1.6 ^ ^ 20^40^60 100 80 Number of iterations Figure 5.12: The Lagrangian Q-learning algorithm converges after a finite number of iterations. substantial. Convergence of Lagrange Multiplier Updates Here the monotone-policy Q-learning algorithm is combined with the update of the Lagrange multiplier using the Robbins-Monro method to estimate the constrained optimal transmission policy by the Lagrange Q-learning algorithm outlined in Fig. 5.2. In particular, the average waiting cost constraint is given by (5.11) with D = 0.4. In this case, the correct Lagrange multiplier, given by (5.32), is approximately 2.2316. The Lagrange multiplier is initialized at Ao = 1.8. The unconstrained optimal policy for Ao is estimated using 100000 iterations of Algorithm 5.1 (i.e. using (5.31) with a decreasing step size). Then the two-level Lagrange Q-learning algorithm starts. After each completion of the outer step, i.e. after the Lagrange multiplier is updated by (5.33): A 1+1 = A l ± (W(/1 i ) — D), 171 Chapter 5. Constrained Infinite Horizon Transmission Scheduling an estimate of the new unconstrained optimal policy u*At±i is obtained by updating u*A1 using 2000 iterations of (5.31) with a constant step size € = 0.01 for adaptation. An estimate of the constrained optimal policy can be obtained when the Lagrange multiplier estimates converge. In the simulations, the stopping criterion is Ai + i — A/ < 0.0001. (5.41) In Fig. 5.12, the convergence of the Lagrange Q-learning method is illustrated, and compared with the case when the inner step is solved exactly by linear programming, which requires knowledge of the system model parameters. It should be noted that when the inner step is solved exactly by linear programming, the evolution of the algorithm is deterministic. In comparison, when the inner step is solved using the monotone-policy Q-learning algorithm (Algorithm 5.1), estimations of the unconstrained optimal policy and the average waiting cost are stochastic approximations. In Fig. 5.12, two sample paths of the Lagrange Qlearning algorithm are depicted. In one case, the convergence is very fast, even better than the Lagrange linear programming method. In the other case, which is more typical, the Lagrange Q-learning algorithm converges more slowly. However, when (5.41) is achieved, the estimate of the Lagrange multiplier is very close to A* given by (5.32). 5.6.3 Performance of the Gradient based Learning Algorithm. - This section describes the simulations that illustrate the performance of the SPSA algorithm (Algorithm 5.2). Here we assume buffer size B = 20, average arrival rate 8 = 0.25, FSMC channel model with the channel state space and transition probabilities as in the previous numerical examples. Furthermore, define the transmission costs and success probability functions by c(1, xc) = 1.5, 0.8, 0.2 and fs (1,xc) = 0.3, 0.45, 0.95 for xC = 'Yi , 72, 73 respectively. The delay cost function is still given by w(0) = 0.6. Lastly, the constraint limit on the average delay cost is D = 0.4. For these parameters, the constrained optimal policy is u* = 0.45ui + 0.55u2, where ul (x b , xc, xY), u2(x b , xc, xY) are two monotone, deterministic policies. In Fig. 5.13 the optimal policy u* (xb, xc, xY) is plotted together with the estimates obtained after 3000 iterations of (15), (16). Furthermore Fig. 5.14 shows the convergence of the estimates to the optimal (transmission) policy u*(xb, x', xY). At each iteration, the 172 Chapter 5. Constrained Infinite Horizon Transmission Scheduling 1.5 0.5 e o^ 0.^ -0.50^ • Estimate of the Opt Policy o Initial Policy -Optimal Policy 2^4^6^8^10^12^14^16 Buffer occupancy state 18^20 1.5 • Estimate of Opt Policy - - • Initial Policy - Optimal Policy 1. 0.5 - 4^6^8^10^12^14^16^18 Buffer occupancy state ^ 20 Figure 5.13: Estimates of the constrained average cost optimal transmission policy u*(x b ,xe,xY) for (xc, xY) = (-y1,0) and (xc, xY) = (72,1); u*(xb,xc,xY) is a randomized mixture of two deterministic policies that are monotone in the buffer occupancy state x b . Estimate of the. Optimal Policy 1. 0 o_ 0.5. Initial Policy 0 3000 10 2000 1000 Iteration index 4 0 0 6 Buffer occupancy state Figure 5.14: Convergence of the estimates of the constrained optimal transmission policy u* (xb, xc, xY) for (xe, xY) =^0). 173 Chapter 5. Constrained Infinite Horizon Transmission Scheduling gradients are estimated using a batch size N = 100 and T, which is initialized to be 70 = 0.3, is decreased by the formula rn+ i = 0.99993-n . It can be seen from the figure that even though the state space dimension is large: !SI = 120, the obtained estimates are very close to the true optimal policy. 5.7 Extension to the Multiple-action Case The results presented in the previous sections can be easily generalized for the case when the action set has more than two actions, such as in power and rate control problems. For example, consider the power control problem. The CMDP formulation of the transmission scheduling can be generalized with suitable modification of the costs as follows. Define the system state as before, i.e. at time n the system state is denoted by X = [x-b, x c xy] E , where S = B x F x Y. Assume now that the action set contains more than two actions: A = {0,1, , L}, where L is a finite integer and i stands for the action of allocating i power units to the current transmission. When the buffer is empty, however, the action set is Ao = {0} E A. The transmission cost c(a, xc) and waiting cost w(a) can still be defined as in (5.2) and (5.3) respectively. For the power control problem, the success probability function can still be defined by (5.4) fs : {0, 1, .^,^x F^[0,1],^ (5.42) where fs (•,.) is increasing in both the channel state and the action 12 . The formulation of the power control problem for point-to-point packet transmission as a CMDP then follows exactly the same steps as in Section 5.2.2. Furthermore, analogous structural results can be established using the same method. In particular, it can be proved that the unconstrained stationary optimal power allocation policy (for the unconstrained MDP that is parameterized by a Lagrange multiplier) is deterministic and monotonically increasing in the buffer occupancy state as illustrated in Fig. 5.15. Additionally, the constrained optimal power allocation policy is a randomized mixture between two policies that are deterministic and monotone in the buffer occupancy state. The only analytical result that is not so immediately straightforward to generalized "For the rate control problem, a further modification must be made to allow for the possibility of receiving multiple packets at one time slot. 174 Chapter 5. Constrained Infinite Horizon Transmission Scheduling Buffer Occupancy State Figure 5.15: For the case of multiple actions the unconstrained average Lagrangian cost optimal transmission policy u*A (x l X c XY ) is deterministic and monotonically increasing in the buffer occupancy state xb. The constrained average cost optimal transmission policy is a mixture of two such policies. is Lemma 5.1 on the stability of the buffer and recurrence of the Markov chains. In the multiple-action context, Lemma 5.1 translates to Lemma 5.4. As the modifications are subtle, the proof of Lemma 5.4 is also given below. Lemma 5.4. Denote the smallest success probability and the smallest waiting cost (for all channel states) by^min^{^xc)} = f. If xcEr aEA,a>o 7 (5 f < 1 w(0) (5.43) then every policy 7r satisfying the constraint (5.11) induces a stable buffer, and a recurrent Markov chain. Proof. Let 7r be a transmission policy that satisfies the constraint (5.11), i.e. 1 D > lim sup „,E, [E w(a n )I(Xnb > 0)Ixo xo] — N—*oo /V n=1 N 1 > lim sup —E, {E w(0)I(a n = 0, Xnb > 0)1X0 = xo] N n=1 Consider two cases. First, if N lim sup —E,^I(Xnb = 0)IXo xo] > 0 N —+co N n=1 then it is clear that the buffer is stable. Furthermore, 175 7 (5.44) yields a recurrent Markov chain Chapter 5. Constrained Infinite Horizon Transmission Scheduling as states with X71', = 0 are recurrent and communicating with all other states. Second, if N (5.44) does not hold, i.e. lim sup kr E, [^i(xnb o)pco n=1 and recurrence of the Markov chain is shown as below. = x o ] = 0, stability of the buffer The average success transmission rate of policy it is given by N rxo (7r) = Ern inf .17 E, [E fs (a n , K)I(a ri > 0)1X 0 = n=1 N > Ern inf —E, [E fI(a n > 0)1Xo = xo] N,00 N n=1 - = f (1 — Ern sup 7, 7. E, [E — N,00 Iv = f (1 — lim sup 1 N xo] I(a r, = 0, Xnb > 0) ± I(Xnb = 0)PC° xo] n=1 [ i(an = 0, X nb > 0) IXO = X0] ) n=1 f (1 ^ ) > S By (5.12). w(0) Therefore the buffer is stable almost surely for policy 7r, i.e. the Markov Chain is recurrent due to Foster's Theorem [139].^ ^ The generalization of the remaining results are straightforward. The only remark is that for the rate control problem, where the action indicates the number of packets to be sent in a transmission time slot, the generalization of the CMDP formulation and structural results is slightly more complicated. The underlying reason is that in such a problem, the action set varies with the buffer state occupancy. 5.8 Conclusions The chapter formulates the problem of opportunistic transmission scheduling in correlated fading channels as an infinite horizon, average cost, countable state MDP with a constraint on the average waiting cost. The stability issue is resolved. Furthermore, the Lagrange multiplier method and the supermodularity concept are used to prove that the constrained optimal transmission policy is a randomized mixture between to pure transmission policies 176 Chapter 5. Constrained Infinite Horizon Transmission Scheduling that are threshold in the buffer state. The monotone randomized mixture result is exploited to improve the performance and tracking capability of a Q-learning algorithm. The monotone randomized mixture result is also exploited to parameterize the optimal transmission scheduling policies to allow for the use of a SPSA-based policy search algorithm. This algorithm can adaptively estimate the constrained optimal transmission scheduling policy very efficiently. 5.9 Analytical Proofs This section contains the proofs of some analytical results in the chapter. 5.9.1 Proof of Lemma 5.2 Since all instantaneous costs are bounded, the value iteration algorithm converges for any (kb , x c , xy ij)\^: n = 0, 1, ... generated by discount factor v E (0, 1). Hence, the sequence V,iv V7.4 1 ( [x b , xc, xY]) = min [c(a, x; A) + v^ aEAx E^P(xc' lx )P(xY') (f c s (a, xe) xc' Er ,xY ' EY x V: ([x b + xY — a, x d , xY1) + (1 — f s (a, xe))17,11 ([x b + xY xc/ xV]))1 (5.45) i \ . Select Vov([x b , xc,x1) that is increasing in converges for any initialization of vov ( k b , x e , xy j) X i ' , XY , decreasing in xe. Note that c(a, x; A) (where x = [x b , xe, xl) is decreasing in xc and non-decreasing in X I) (due to the fact that the waiting cost w(.) does not apply to an empty buffer state) 13 . Therefore, it is clear by induction that 1741 ([x b ,xc, xl) is increasing in X b , XY . Furthermore, due to condition (5.1), it is also easy to see by induction that V„v([x b , xc, x"]) ' ([x b , xc, xY]) is increasing is decreasing in xe for all n > 1. Therefore, Vu axb, xc, xYD = Vc,„ in X b , XY and decreasing in xc. 13 1n fact, as in the rest of the chapter, it is only required that 177 w x b_ o (a = 0) < w x b, o (a = 0) = w(a = 0). Chapter 5. Constrained Infinite Horizon Transmission Scheduling 5.9.2 Proof of Theorem 5.1 The optimal discounted costs 17,.,'(x) given by (5.22) can be computed by the value iteration algorithm given by (5.23) and (5.24), which converges for any initial condition. In order to show that the discounted cost optimal policy is monotonically increasing in the buffer state we will prove inductively that Q 00 ([x b , xc,^a) is submodular in (x b , a) for all x b > 1. That is, it will be proved via mathematical induction that, for suitable initilization Q 4 1 ([x b , xc , xY], 1) — Q 72' +1 ([x b xe xY], 0) = c(1, xc; A) — c(0, xc; A) +^E^p(xcfixc) xc,,,xy,Ey x 111 (xV) f s (1, xc)[V72 ([x b + xY — 1 , x o, xy,11 — vmv ux b xy , x e,, x y, D ] is monotonically decreasing in the buffer state component x b for x b > 1 for all n > 0. j) has increasing differences in x b . The rest The above inequality holds if Vv([ x b , x c , x yl\ of the proof follows the same method as the proof of Lemma 4.2 in Chapter 4. Consider the value iteration algorithm given by (5.23), (5.24). Select v ov ( [x b , xc, x']) with increasing differences in xb. Assume that Vny((xb, xe, e]) has increasing differences in x b Q n+1([xb x c x ylj a) is then submodular in (x b , a). We will now prove that Vr vt + ([X1) x c x y]) . , also has increasing differences in x b , i.e., 17741G? + 1, xc, x Y1) — Vv (k b , xc, x Y D — (v741 ( fx b , x c, xYD — v:4_ 1 (kb — 1, xc, xYD) > 0. (5.46) Assume that 17,2v_44 ([x b + 1, xc, xY]) = Qn+i ([xb + 1, X c ,^a2), and Vnu_j_ i ([x b , xc, xY]) = ao)) for some actions = Qn±i ([x b — 1, xe , Q n+l ) X e XI) a i ), and Vnv+i ([X' — 1, X e , a2, al, ao E A. Due to the induction hypothesis we have Q, i4. 1 (k b , xe,^u) is submodular in (x b , u) (as 1741 ([x b , xe, xY]) has increasing differences in x b ). Therefore a2 > al > ao. Rewrite (5.46) as Qn+1([x b + 1, X e , XY J, a2) — Qn+1([x b , x e , xY ], — Qn+1([x b ,x c , x Y ], al) + Qn+i (ix b — 1 , x e , x Y 1, ao)) > <=> gn+i ([x b + x c , x Y ], a2) — n+1([X b X c X Y ], a2) A 178 0 ^ Chapter 5. Constrained Infinite Horizon Transmission Scheduling + (Qn+i ([x b , x c , x Y ] , a2) — Qn+i ([x b , x c , x Y ], al ) >0 By optimality __Qn+i ([x1), xc X Y ], al) + Qn+1 ([X b , X e , X9], a0) >0 By optimality — (Q724-1([X b , X c , X Y ], ao) — Qn-{-1(x b — 1, xc, xY,a0)) > 0 B Furthermore, it is clear that A =^P(xlxc)P(xv) [fs (1, xc)I(a2 = 1) (Vriv ([x b + 1 + xY — 1, x ci , x Y 'l) xc'Er,xY/EY —Vri, ([x b + X Y — 1, X c i , X Y 1 ])) + (1 — fs (1, xc)I(a2 = 1)) (W.L ax b + 1 + xY, x d , xv] — vnT x b + xy, x o , xy / D )] ^>^E^p(xdixe)P(xy )[v:([xbd_xy,x0,xy ])_17 ,([xb+xy_i,x0,xy/])] , , 7 sc/Er,xy,Ey > B. The second last inequality is due to the induction hypothesis on Vnu([x b , xc,xY]). The last inequality follows from expansion of B and the induction hypothesis that Vriv([x b ,xc, x 9 ]) has increasing differences in the buffer state x b . 5.9.3 Proof of Corollary 5.2 This result follows directly from [74, 75, 129]. The proof below is given for clarity only. Let 'OA denote the stationary optimal policy of the unconstrained average cost MDP with parameter A. Denote the optimal average Lagrangian cost by ,I;\`. Furthermore, denote the average waiting cost associated with u*A by WA, which is given by (5.16). Then WA is monotonically decreasing in A [74]. If there exists a A such that WA --=-- D then u*A is also optimal for the constrained MDP [74, 75, 129]. This is due to the following. Consider any feasible policy u e 179 Tf. The the Chapter 5. Constrained Infinite Horizon Transmission Scheduling following inequalities hold: J(u; A) J(u*A ) <#, C(u) + AW (u) ^ C(4) + AW (u*A ) <#. C(u) + AW (u) C(4) + AD <=> C(u) C(u*A )+ A(D — W(u)) C(4). The last inequality is due to the fact that if u is a feasible policy then W(u) < D. Since C(u) > C(u*A ) for any feasible u, i.e. 'OA is the constrained optimal policy. If there exists no A such that WA = D. Define rc = inf {A : WA < D.} (5.47) Due to monotonicity of WA, ic is well-defined. Define Ern WA D Ain liim WA = D. Then we have D < D < D. Let {A n } be a sequence that increases to ic such that the corresponding u*An converges to Tt E Tf. This is possible due to the &stability margin and the fact that J(u; A) is continuous in A for all feasible u. Then J(u; lc) = .1: and W(tt) = D. Similarly, we can construct u E Tf such that J(u; = J and W(u) = D. Take u g = qu+ (1 —q)77. We will show that there exists a q E [0, 1] such that J(u q ) = - and W(u q ) = D. First, J(u q ) = 4 for all q E [0,1]. Indeed, as both u and Tt satisfy the average cost optimality equation (5.26) for the Lagrange multiplier c we have E p(x/lx,,,,(x))1,(x) x'ES ^+ E P(x'lx,a)12,(x)},^(5.48) +^= c(tc(x),x; K) + =m in x' ES 180 Chapter 5. Constrained Infinite Horizon Transmission Scheduling and JZ, + h(x) = c(47(x), x; K) + ^ IP(4x, 17(x) )h(x) x' ES = min {c(a, x; k) +^P(x'lx, a)h(x)}. ae.A. x' ES Furthermore, h(x) and h(x) are equal up to a constant vector [59, 67] (5.49) 14 . Therefore it follows that for all q e [0, 1] ..I,*, + h(x) = c(u q (x), x; K) +^P(x Ix, u q (x))h(x) x'Es = min {c(a, x; K) +^P(x Ix, a)h(x)} ,^(5.50) aEA. x' ES for some bias function h(x). In other words, u g is an optimal policy for the Lagrange multiplier K. Second, observe that W(u q ) is continuous in q. This is because u g is continuous in q, which implies the continuity of W(u q ) in q given that a stationary distribution exists for the policy u g due to the buffer stability condition [74, 129]. Additionally, W(u q ) = for q = 0; W(N) = D < D>D D for q = 1. Hence, there exists a q E (0,1) such that W(N) = D. This completes the proof. "This can be seen from fact that h(x) = lim[Vv (x) —V" (0)] for some reference state 0, up to a constant. 181 Chapter 6 Discussion and Conclusions 6.1 Overview of Thesis The thesis considered the topic of PHY-MAC cross-layer optimal transmission scheduling in wireless networks. In many ways, the thesis has highlighted how CSI can be exploited by simplest transmission controllers in wireless networks. Specific problems analysed were distributed medium access control using decentralized channel-aware transmission policies and opportunistic transmission scheduling. In most cases, the optimal transmission scheduling policy is proved to have a special structure, which makes the estimation and implementation of the optimal solutions very efficient. The time-varying nature of wireless communication channels introduces interdependencies between the PHY layer and the MAC layer. Additionally, due to communication technologies such as the code division multiple access scheme, and multi-antenna communications, as well as other advances in signal processing and communications, users can share a wireless medium using non-orthogonal channels. Then the two issues arising in wireless network optimization are to exploit channel state information for opportunistic transmission and to balance between multiuser interference and multiple access efficiency in medium access control. Chapter 2 and Chapter 3 considered these issues by two approaches: optimizing the distributed channel-aware MAC protocol to maximize the system stable throughput, and analyse the non-cooperative game theoretic channel-aware MAC protocol. Chapter 4 and Chapter 5 of the thesis analysed optimal opportunistic transmission scheduling policies for the correlated fading channel model. In each chapter of the thesis, the treatment of the optimization problems is complete in the sense that both analytical structural results and real-time adaptive optimization algorithms are derived. The thesis makes clever use of several mathematical tools. For example, game theory is 182 Chapter 6. Discussion and Conclusions used in Chapter 3 to derive a MAC protocol that can be implemented and optimized in a completely distributed and adaptive fashion. The concept of supermodularity has been used together with dynamic programming to prove the threshold structure of the optimal transmission scheduling policy in Chapter 4 and Chapter 5. Several parameterization methods, including the spherical coordinate representation, or the logistic function approximation, have been used to parameterize the structured optimal policies. Stochastic approximation algorithms have been deployed extensively in the thesis for estimating/learning the transmission scheduling solutions. A contribution of the thesis has been the various optimization methodologies. The framework of this thesis is cross-layer wireless network optimization, nevertheless, some of the structural results proved here can find applications in resource allocation problems in other areas, such as operations research. In the remaining of the chapter, various aspects of the results and the algorithms derived in this thesis will be discussed, and proposals of possible directions for future research will be presented. 6.2 Discussion on Results and Algorithms The thesis contains several structural results and learning algorithms, which are discussed below. The first part of the work, which consists of Chapters 2 and 3 of the thesis, was on distributed medium access control. The essence of the distributed channel-aware MAC protocol considered is that every user accesses the channel via a transmission policy mapping its channel state to transmit probabilities. The channel-aware MAC analysis is done using both a centralized design approach and a non-cooperative game theoretic approach. The former approach is theoretically important as it determines the maximum performance gains (in the system throughput sense) due to decentralized CSI exploitation. In comparison, the latter approach offers robust, distributed solutions that can be estimated in real time via very efficient adaptive learning algorithms. Furthermore, it is shown via simulations that the system throughput obtained by the game theoretic MAC solutions are in fact only slightly lower than the optimal distributed MAC. In particular, the major results in the first part of the thesis include the structure of the optimal transmission policies. Especially, it is proved in Chapter 3 that the best response of a user to the strategies played by other users is always a threshold (in the channel state) 183 Chapter 6. Discussion and Conclusions transmission policy. Additionally, there exists a Nash equilibrium profile at which every user adopts a threshold transmission policy. Optimality of threshold transmission policy converts the problem of estimating the optimal transmission policy into estimating a single threshold, which can be solved via a gradient-based stochastic approximation algorithm with tracking capability. It is shown via simulations that as users update their policies using the proposed algorithm, the system converges to a Nash equilibrium. A minor contribution of Chapter 3 is the generalization of the game theoretic MAC structural results and algorithms to incorporate user/packet priorities into the network model. The second part of the thesis concerns opportunistic transmission scheduling in correlated fading channels. In Chapter 4 a finite horizon MDP formulation is used. In comparison, in Chapter 5, the optimal transmission scheduling problem is formulated as an infinite horizon average cost MDP with a global (average cost) constraint. Opportunistic transmission scheduling (or power/rate control) is important for networks where both energy efficiency and delay are critical issues, for example, sensor networks, or multimedia data transmission systems. Chapter 4 formulates the transmission scheduling optimization problem as a finite horizon MDP with a terminal data loss penalty cost. In the chapter, it is proved that the optimal transmission scheduling policy is deterministic, monotonically increasing (and hence threshold) in the buffer occupancy state and the transmission time. The two-dimensional threshold structural result reduces the space of admissible policies and makes the computation and implementation of the optimal solutions substantially more efficient. The finite horizon reflects the delay constraint of the optimization problem, which imposes that every batch of packets must be transmitted within a finite number of time slots. The finite horizon MDP formulation is suitable for applications where instantaneous (rather than average) delay constraint is important. A drawback of the finite horizon approach is that the optimal solution is non-stationary. Therefore online adaptive learning is less efficient than an infinite horizon MDP formulation. In Chapter 5, the transmission scheduling problem is formulated as an infinite horizon average cost constrained MDP. It is well-known that for such constrained MDPs, the optimal policy can be a randomized function of the state. In [73] and [74], the authors have introduced a powerful methodology based on Lagrangian dynamic programming which comprises of a two-level optimization: at the inner step, one solves a dynamic programming problem with a fixed Lagrange multiplier to obtain a pure policy; at the outer step, one 184 Chapter 6. Discussion and Conclusions optimizes the Lagrange multipliers. These Lagrange multipliers determine the randomization coefficients for switching between the pure policies. In Chapter 5, the supermodularity method is applied to prove a threshold structural result for the inner step. Then, as a result of the outer optimization step, the constrained optimal policy is a randomized mixture of two threshold policies. The application of the supermodularity method for the inner optimization step depends on a condition on the stability of the packet storage buffer and recurrence of the Markov chains. Furthermore, the analysis of the inner optimization step also relies on the theory in [59, 119], which relate average cost optimal policies to discounted cost ones. In fact, we proved that the discounted cost optimal policies are threshold and the average cost optimal policies inherit this structure. A natural method for estimating the optimal policy then also comprise of two levels, as in the Lagrange Q-learning algorithm proposed in Chapter 5. At the inner step of the algorithm, the unconstrained threshold optimal policy is obtained via a Q-learning based algorithm that exploits the threshold structural result. At the outer step, the Lagrange multiplier is updated by the Robbins-Munro method. Besides this two-level algorithm, Chapter 5 also proposes an algorithm for direct estimation of the parameterized optimal scheduling policy. In particular, due to the randomized mixture structure, the optimal policy can be approximated by a parameterized smooth function. The optimal parameters are then estimated in real time by an adaptive simultaneous perturbation stochastic approximation algorithm. This algorithm is very efficient but only converges to a local optimum. 6.3 Summation CSI at the physical layer has been exploited for transmission scheduling at the MAC layer. In particular, in the multiuser context, a channel-aware MAC protocol is optimized using a centralized design approach and a non-cooperative game theoretic approach. Due to the exploitation of CSI, multiuser diversity is enhanced and the system throughput is improved with the number of users. Furthermore, the transmission success rates are also substantially improved, due to opportunistic transmission. In the single-user context, CSI and the channel memory are exploited for opportunistic transmission scheduling using MDP approaches. The optimal opportunistic transmission scheduling policy achieves favorable 185 Chapter 6. Discussion and Conclusions tradeoffs between energy efficiency and throughput/delay. The thesis has provided analytical results on the structure of the optimal policies for all optimization problems considered. In most cases, it has been shown that the optimal scheduling policy is of the threshold structure. The threshold results substantially reduce the complexity involved in computing and implementing the optimal policies. Besides the analytical results, the treatment of the optimization problems is complete by the derivation of adaptive stochastic-approximation-based learning algorithms. These algorithms can estimate the optimal policies in real time and track slowly time varying systems. The applications of the work in the thesis range from cellular networks, to WLANs, to networks of primitive sensors. 6.4 Future Research The thesis has developed some systematic methods for deriving cross-layer transmission scheduling algorithms. In most cases, the results are also applicable to more general resource allocation problems in wireless networks. The theme of the thesis has been to derive structural results and adaptive, efficient learning algorithms for relatively general models. In other words, the analysis in this thesis has assumed slightly abstract frameworks. For future research, the work in this thesis can be extended in several directions. Either the theoretical results can be strengthened/generalized or further analysis can be carried out for more specific network models. In what follows, we propose a few research problems related to the work in this thesis. Correlated equilibrium for game theoretic MAC A correlated equilibrium is an outstanding solution concept in game theory that was introduced in [42] and further studied in [142-144]. The correlated equilibrium is a generalization of the Nash equilibrium concept. Correlated equilibria arise from situations where users select their strategy according to some distribution with inherent coordination. If all users have no incentive to unilaterally deviate, the distribution comprises a correlated equilibrium. In the thesis, we have analysed the structure of the Nash equilibrium of the game theoretic formulation of the MAC distributed optimization problem. It is known that all 186 Chapter 6. Discussion and Conclusions convex combinations of mixed Nash equilibrium are correlated equilibria [41, 43]. Therefore, given the Nash equilibrium analysis, there should be clever methods to provide some form of coordination between users to lead to correlated equilibria. Correlated equilibria can offer superior system performance and are typically easier to characterize [142]. Furthermore, while in general it is hard to prove the convergence of a learning algorithm to a Nash equilibrium, it is much easier to obtain a correlated equilibrium, see [55, 56]. The reason behind is that the set of correlated equilibria is a convex polytope that contains all Nash equilibria on the boundary. These facts serve as a strong motivation to analyse the structure of the correlated equilibria and the system performance at correlated equilibria. The next phase of our research will concern devising innovative, simple coordination techniques aiming for high efficiency correlated equilibria for the transmission game considered in this thesis. In [145], the authors studied the correlated equilibrium in random access networks, where users accesses the common collision channel via a synchronized slotted ALOHA protocol with simple coordination mechanisms. It is shown that channel utilization can be (but not always) improved in comparison to the case when there is no coordination. It could be expected that the study of the correlated equilibrium and derivation of clever coordination mechanisms will be much more challenging for multipacket reception networks. Another example of the use of correlated equilibrium for multiuser scheduling is given in [146]. Power allocation and transmission scheduling via a dynamical game approach A possible extension of the work in this thesis is to use a dynamical game approach. Dynamical games are a natural extension of MDPs to include multiple agents. In this thesis, we have considered game theoretic transmission scheduling in the multiple user context, and used MDP approaches for opportunistic transmission scheduling in the single user context. Hence, a dynamical game approach, roughly, is the combination of the methodologies deployed in our current work. Dynamical games are difficult to analysed, but a very interesting and realistic model for many scenarios in wireless communication networks. Furthermore, since learning in dynamical games can be notoriously difficult, structural results will be particularly useful. It is not clear in what context the threshold structural results proved in this thesis can be generalized to a dynamical game formulation. However, it is a research problem that is of our interest. 187 Chapter 6. Discussion and Conclusions Sensor medium access and sleep scheduling - Most of the work in this thesis has based on quite general wireless communication system models, which work well for the purpose of developing systematic analysis methods, which, to some extent, provide guidelines on how different mathematical tools can be used in wireless network optimization. In the next phase, it will be interesting to incorporate more specific features of sensor networks into the models, e.g., data correlation, sensor collaboration, sensor duty cycles, and study how the current approach can be applied, or how the opportunistic transmission scheduling algorithms perform in terms of network life time, or with regard to specific tasks of the sensor network. The work in this thesis will also be more complete if more research is completed on the lower-level issues such as how the costs can be designed to best reflect the relation between transmission energy consumption, delay and throughput, or how to derive the generalized multipacket reception model for communication systems other than CDMA networks. In short, even though the structured optimal transmission scheduling policies can be implemented very efficiently and offers substantial performance gains, lower-level analysis tasks need to be done in details so that the transmission algorithms can actually be incorporated in a complete design of sensor networks. Opportunistic transmission scheduling with partially observable states This thesis has made the assumption that perfect quantized channel state information is available to each user. In many situations, channel states can only be observed in noise and perfect quantized CSI is difficult to obtain. An immediate extension of the work here is to consider the case when the channel state cannot be observed perfectly. In such a situation, the opportunistic transmission scheduling problem can be formulated as a partially observable MDP (POMDP). The motivation for the POMDP formulation is the obvious fact that POMDPs are a closer approximation to the real system. However, the drawback is that POMDPs are much harder, and typically infeasible, to solve optimally. As a result, analysing a POMDP will require some sort of approximation eventually. The interesting question is then whether the necessary approximation will link the POMDP to a completely observable MDP similar to the model in this thesis. In which case, it will be possible to derive threshold structural results and numerical algorithms for the POMDP formulation. Additionally, even if the structural results cannot be generalized, POMDP is still an inter- 188 Chapter 6. Discussion and Conclusions esting approach toward opportunistic transmission scheduling for wireless networks. 189 Bibliography [1] S. Shakkottai, T. S. Rappaport, and P. C. Karlsson, "Cross-layer design for wireless networks," IEEE Communications Magazine, vol. 41, no. 10, pp. 74-80, 2003. [2] G. Dimic, N. D. Sidiropoulos, and R. Zhang, "Medium access control - physical crosslayer design," IEEE Signal Processing Magazine, vol. 21, no. 5, pp. 40-50, 2004. [3] V. Srivastava and M. Motani, "Cross-layer design: a survey and the road ahead," IEEE Communications Magazine, vol. 43, no. 12, pp. 112-119, 2005. [4] V. Naware, G. Mergen, and L. Tong, "Stability and delay of finite-user slotted aloha with multipacket reception," IEEE Transactions on Information Theory, vol. 51, no. 7, pp. 2636-2656, 2005. [5] J. C. Spall, Introduction to Stochastic Search and Optimization, ser. WileyInterscience series in Discrete Mathematics and Optimization. Wiley-Interscience, 2003. [6] D. M. Topkis, Supermodularity and Complementarity. Princeton University Press, 1998. [7] H. S. Wang and N. Moayeri, "Finite-state Markov channel - A useful model for radio communications channels," IEEE Trans. Vehicular Technology, vol. 44, no. 1, pp. 163-171, 1995. [8] A. Zhang and S. A. Kassam, "Finite-state Markov model for Rayleigh fading channels," IEEE Transactions on Communications, vol. 47, pp. 1688-1692, November 1999. [9] R. A. Berry and E. M. Yeh, "Cross-layer wireless resource allocation," IEEE Signal Processing Magazine, vol. 21, no. 5, pp. 59-68, 2004. [10] S. Adireddy and L. Tong, "Exploiting decentralized channel state information for random access," IEEE Transactions on Information Theory, vol. 51, no. 2, pp. 537561, February 2005. [11] T. Rapport, Wireless Communications: Principles and Practice, 2nd ed. Prentice Hall, 2001. 190 Bibliography [12] L. Tong, V. Naware, and P. Venkitasubramaniam, "Signal processing in random access," IEEE Signal Processing Magazine, vol. 21, no. 5, pp. 29-39, 2004. [13] Q. Liu, S. Zhou, and G. B. Giannakis, "Queuing with adaptive modulation and coding over wireless links: cross-layer analysis and design," IEEE Transactions on Wireless Communications, vol. 4, no. 3, pp. 1142-1153, 2005. [14] M. A. Haleem and R. Chandramouli, "Adaptive downlink scheduling and rate selection: a cross-layer design," IEEE Journal on Selected Areas in Communications, vol. 23, no. 6, pp. 1287-1297, 2005. [15] G.-M. Su, Z. Man, M. Wu, and K. J. R. Liu, "Multiuser cross-layer resource allocation for video transmission over wireless networks," IEEE Network, vol. 20, no. 2, pp. 2127, 2006. [16] M. Bohge, J. Gross, A. Wolisz, and M. Meyer, "Dynamic resource allocation in ofdm systems: an overview of cross-layer optimization principles and techniques," IEEE Network, vol. 21, no. 1, pp. 53-59, 2007. [17] S. Shamai and E. Telatar, "Some information theoretic aspects of decentralized power control in multiple access fading channels," in Proceedings of Information Theory and Networking Workshop, Metsovo, Greece, June 1999. [18] X.Qin and R.Berry, "Exploiting Multiuser Diversity in Wireless ALOHA Networks," in Proceedings of 2001 Allerton Conference on Communication, Control and Computing, October 2001. [19] , "Exploiting Multiuser Diversity for Medium Access Control in Wireless Networks," in Proceedings of IEEE INFOCOM, San Francisco, CA, 2003. [20] M. K. Tsatsanis, R. Zhang, and S. Banerjee, "Network-assisted diversity for random access wireless networks," IEEE Transactions on Signal Processing, vol. 48, no. 3, pp. 702-711, 2000. [21] L. Tong, Q. Zhao, and G. Mergen, "Multipacket reception in random access wireless networks: from signal processing to optimal medium access control," IEEE Communications Magazine, vol. 39, no. 11, pp. 108-112, 2001. [22] R. Gallager, "A perspective on multiaccess channels," IEEE Transactions on Information Theory, vol. 31, no. 2, pp. 124-142, 1985. [23] A. Ephremides and B. Hajek, "Information theory and communication networks: an unconsummated union," IEEE Transactions on Information Theory, vol. 44, no. 6, pp. 2416-2434, 1998. [24] S. Ghez, S. Verdu, and S. C. Schwartz, "Stability properties of slotted ALOHA with Multipacket reception capability," IEEE Transactions on Automatic Control, vol. 33, no. 7, pp. 640-649, July 1988. 191 Bibliography [25] ^ , "Optimal decentralized control in the random access Multipacket channel," IEEE Transactions on Automatic Control, vol. 34, no. 11, pp. 1153-1163, November 1989. [26] Q. Zhao and L. Tong, "A multiqueue service room mac protocol for wireless networks with multipacket reception," IEEE/ACM Transactions on Networking, vol. 11, no. 1, pp. 125-137, 2003. [27] R.-H. Gau and K.-M. Chen, "Predictive multicast polling for wireless networks with multipacket reception and queuing," IEEE Transactions on Mobile Computing, vol. 5, no. 6, pp. 725-737, 2006. [28] X. Wang and J. K. Tugnait, "A bit-map-assisted dynamic queue protocol for multiaccess wireless networks with multiple packet reception," IEEE Transactions on Signal Processing, vol. 51, no. 8, pp. 2068-2081, 2003. [29] R. Knopp and P. A. Humblet, "Information capacity and power control in single-cell multiuser communications," in Communications, 1995. ICC 95 Seattle, Gateway to Globalization, 1995 IEEE International Conference on, vol. 1, 1995, pp. 331-335 vol.1. [30] X. Lin, N. B. Shroff, and R. Srikant, "A tutorial on cross-layer optimization in wireless networks," IEEE Journal on Selected Areas in Communications, vol. 24, no. 8, pp. 1452-1463, 2006. [31] A. J. Goldsmith and P. P. Varaiya, "Capacity of fading channels with channel side information," IEEE Transactions on Information Theory, vol. 43, no. 6, pp. 19861992, 1997. [32] H. Wang and N. B. Mandayam, "Opportunistic file transfer over a fading channel under energy and delay constraints," IEEE Transactions on Communications, vol. 53, no. 4, pp. 632-644, April 2005. [33] L. Johnston and V. Krishnamurthy, "Opportunistic file transfer over a fading policies," IEEE Transactions on Wireless Communications, vol. 5, no. 2, pp. 394-405, February 2005 [34] B. Collins and R. Cruz, "Transmission policies for time varying channels with average delay constraints," in Proc. 1999 Allerton Conf. on Commun., Control^Comp., Monticello, IL, 1999. [35] M. H. Ngo and V. Krishnamurthy, "Game Theoretic Cross-Layer Transmission Policies in Multipacket Reception Wireless Networks," IEEE Transactions on Signal Processing [see also Acoustics, Speech, and Signal Processing, IEEE Transactions on], vol. 55, no. 5, pp. 1911-1926, 2007. 192 Bibliography [36] M. H. Ngo, V. Krishnamurthy, and L. Tong, "Optimal Distributed Channel-aware MAC protocol for Wireless Networks with Multipacket Reception and Decentralized Channel State Information," IEEE Transactions on Signal Processing, 2007, accepted. [37] M. H. Ngo and V. Krishnamurthy, "Game theoretic optimal transmission strategies in multipacket reception sensor networks," in Proc. Asilomar 2006, California, USA, October-November 2005, pp. 225-229. [38] , "Non-cooperative transmission game in wireless networks with multipacket reception and packet priority," in Proc. International Conference on Communications, Istanbul, Turkey, vol. 9, June 2006, pp. 4351-4356. [39] A. B. MacKenzie and S. B. Wicker, "Game theory and the design of self-configuring, adaptive wireless networks," IEEE Communications Magazine, vol. 39, no. 11, pp. 126-131, Nov. 2001. [40] J. von Neumann and 0. Morgenstern, Theory of Games and Economic Behaviour. Princeton University Press, 1944. [41] M. J. Osborne and A. Rubinstein, A Course in Game Theory. The MIT Press, 1994. [42] R. J. Aumann, "Subjectivity and correlation in randomized strategies," Journal of Mathematical Economics, vol. 1, no. 1, pp. 67-96, 1974/3. [43] D. Fudenberg and J. Tirole, Game Theory. Cambridge, MA: MIT Press, 1996. [44] K. G. Binmore, Fun and games: a text on game theory. Lexington, Mass.: D.C. Heath, 1992. [45] T. Heikkinen, "A potential game approach to distributed power control and scheduling," Computer Networks, vol. 50, no. 13, pp. 2295-2311, 2006. [46] A. B. MacKenzie and S. B. Wicker, "Selfish users in ALOHA: A game theoretic approach," in Proceedings of IEEE Vehicular Technology Conference, 2001, vol. 3, Nov. 2001, pp. 1354-1357. [47] , "Game theory in communications: Motivation, explanation, and application to power control," in Proceedings of IEEE Global Telecommunications Conference, 2001, vol. 2, Nov. 2001, pp. 821-826. [48] , "Stability of multipacket slotted ALOHA with selfish users and perfect information," in Proceedings of Twenty-Second Annual Joint Conference of the IEEE Computer and Communications Societies, INFOCOM 2003, vol. 3, April 2003, pp. 1583-1590. [49] F. Meshkati, M. Chiang, H. V. Poor, and S. C. Schwartz, "A game-theoretic approach to energy-efficient power control in multicarrier cdma systems," IEEE Journal on Selected Areas in Communications, vol. 24, no. 6, pp. 1115-1129, 2006. 193 Bibliography [50] T. Heikkinen, "Distributed scheduling via pricing in a communication network," p. 850, 2002, tY: CHAPTER. [51] E. Altman, T. Basar, and R. Srikant, "Nash equilibria for combined flow control and routing in networks: asymptotic behavior for a large number of users," IEEE Transactions on Automatic Control, vol. 47, no. 6, pp. 917-930, 2002. [52] D. Grosu and A. T. Chronopoulos, "Algorithmic mechanism design for load balancing in distributed systems," Systems, Man and Cybernetics, Part B, IEEE Transactions on, vol. 34, no. 1, pp. 77-84, 2004. [53] C. U. Saraydar, N. B. Mandayam, and D. J. Goodman, "Efficient power control via pricing in wireless data networks," IEEE Transactions on Communications, vol. 50, no. 2, pp. 291-303, 2002. [54] D. Fudenberg and D. K. Levine, The theory of learning in games. MIT Press, 1998. [55] D. P. Foster and R. V. Vohra, "Calibrated learning and correlated equilibrium," Games and Economic Behavior, vol. 21, no. 1-2, pp. 40-55, 1997. [56] S. Hart and A. Mas-Colell, "A simple adaptive procedure leading to correlated equilibrium," Econometrica, vol. 68, no. 5, pp. 1127-1150, Sep. 2000. [57] R. Bellman, Dynamic Programming. Princeton University Press, 1957. [58] R. A. Howard, Dynamic Programming and Markov Processes. Cambridge, Mass: The MIT Press, 1960. [59] M. Puterman, Markov Decision Processes. John Wiley, 1994. [60] R. S. Sutton, "On the significance of markov decision processes," in ICANN '97: Proceedings of the 7th International Conference on Artificial Neural Networks. London, UK: Springer-Verlag, 1997, pp. 273-282. [61] K. R. Krishnan, "Joining the right queue: a state-dependent decision rule," IEEE Transactions on Automatic Control, vol. 35, no. 1, pp. 104-108, 1990. [62] P. D. Roorda and V. C. M. Leung, "Dynamic control of time slot assignment in multiaccess reservation protocols," IEE Proceedings-Communications, vol. 143, no. 3, pp. 167-175, 1996. [63] G. del Angel and T. L. Fine, "Optimal power and retransmission control policies for random access systems," IEEE/ACM Transactions on Networking, vol. 12, no. 6, pp. 1156-1166, 2004. [64] F. Yu, V. Krishnamurthy, and V. C. M. Leung, "Cross-layer optimal connection admission control for variable bit rate multimedia traffic in packet wireless cdma networks," IEEE Transactions on Signal Processing, vol. 54, no. 2, pp. 542-555, 2006. 194 Bibliography [65] Y. Chen, Q. Zhao, V. Krishnamurthy, and D. Djonin, "Transmission scheduling for optimizing sensor network lifetime: A stochastic shortest path approach," IEEE Transactions on Signal Processing, vol. 55, no. 5, pp. 2294-2309, 2007. [66] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, Massachusetts: The MIT Press, 1998. [67] D. Bertsekas, Dynamic Programming and Optimal Control. Belmont, Massachusetts: Athena Scientific, 1995, vol. 1 and 2. [68] D. Bertsekas and J. Tsitsiklis, Neuro-Dynamic Programming. Belmont, MA.: Athena Scientific, 1996. [69] X.-R. Cao and H.-F. Chen, "Perturbation realization, potentials, and sensitivity analysis of markov processes," IEEE Transactions on Automatic Control, vol. 42, no. 10, pp. 1382-1393, 1997. [70] S. Bhatnagar and S. Kumar, "A simultaneous perturbation stochastic approximationbased actor-critic algorithm for markov decision processes," IEEE Transactions on Automatic Control, vol. 49, no. 4, pp. 592-598, 2004. [71] F. V. Abad, V. Krishnamurthy, K. Martin, and I. Baltcheva, "Self learning control of constrained markov chains - a gradient approach," in Proceedings of the 41st IEEE Conference on Decision and Control, 2002, vol. 2, 2002, pp. 1940-1945 vol.2. [72] J. Baxter and P. L. Bartlett, "Direct gradient-based reinforcement learning," in The 2000 IEEE International Symposium on Circuits and Systems, 2000, vol. 3, 2000, pp. 271-274 vol.3. [73] E. Altman, Constrained Markov Decision Processes. London: Chapman and Hall, 1999. [74] F. J. Beutler and K. W. Ross, "Optimal Policies for Controlled Markov Chains with a Constraint," Journal of Mathematical Analysis and Applications, vol. 112, pp. 236252, 1985. [75] L. I. Sennott, "Constrained average cost Markov decision chains," Probability in the engineering and informational sciences, vol. 7, pp. 69-83, 1993. [76] ^, "Constrained discounted cost Markov decision chains," Probability in the engineering and informational sciences, vol. 5, pp. 463-475, 1991. [77] H. Kushner and G. Yin, Stochastic Approximation Algorithms and Applications, 2nd ed. New York: Springer-Verlag, 2003. [78] A. Benveniste, M. Metivier, and P. Priouret, Adaptive Algorithms and Stochastic Approximations, ser. Applications of Mathematics. Springer-Verlag, 1990, vol. 22. 195 Bibliography [79] H. Robbins and S. Monro, "A stochastic approximation method," The Annals of Mathematical Statistics, vol. 22, no. 3, pp. 400-407, Sep. 1951. [80] J. Kiefer and J. Wolfowitz, "Stochastic estimation of the maximum of a regression function," The Annals of Mathematical Statistics, vol. 23, no. 3, pp. 462-466, Sep. 1952. [81] V. Krishnamurthy, M. Maskery, and M. H. Ngo, Wireless Sensor Networks: Signal Processing and Communications Perspectives. Wiley, 2007, ch. Game Theoretic Activation and Transmission Scheduling in Unatttended Ground Sensor Networks: A Correlated Equilibrium Approach, pp. 349-388. [82] V. Krishnamurthy and G. G. Yin, "Recursive algorithms for estimation of hidden markov models and autoregressive models with markov regime," IEEE Transactions on Information Theory, vol. 48, no. 2, pp. 458-476,2002. [83] L. I. Sennott, "Value iteration in countable state average cost Markov Decision Processes with unbounded costs," Annals of Operations Research, vol. 28, pp. 261-272, 1991. [84] M. H. Ngo and V. Krishnamurthy, "On Optimal Transmission Algorithms for Slotted ALOHA Sensor Networks with Multi-Packet Reception," in Proc. International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, USA, March 2005, pp. 669-672. [85] W. Szpankowski, "Stability conditions for some distributed systems: buffered random access systems," Advances in Applied Probability, vol. 26, pp. 498-515,1994. [86] B. Tsybakov and V. Mikhailov, "Ergodicity of slotted ALOHA system," Problemy Peredachi Inform,atsii, vol. 15, pp. 73-87, Oct-Dec 1979. [87] R. Rao and A. Ephremides, "On the stability of interacting queues in a multipleaccess system," IEEE Transactions on Information Theory, vol. 34, pp. 918-930, September 1988. [88] W. Luo and A. Ephremides, "Stability of n interacting queues in random access systems," IEEE Transactions on Info!^ ation Theory, vol. 45, pp. 1579-1587, July 1999. [89] M. Zorzi and R. Rao, "Capture and retransmission in mobile radio," IEEE Journal on Selected Areas in Communications, vol. 12, no. 8, pp. 1289-1298, October 1994. [90] B. Hajek, A. Krishna, and R. LaMaire, "On the capture probability for a large number of stations," IEEE Transactions on Communications, vol. 45, no. 2, pp. 254-260, February 1997. 196 Bibliography [91] D. Tse and S. Hanly, "Linear multiuser receivers: effective interference, effective bandwith and user capacity," IEEE Transactions on Information Theory, vol. 45, pp. 641-657, March 1999. [92] G. Mergen and L. Tong, "On the asymptotic stable throughput of opportunistic random access," in Conference on Information Sciences and Systems, Princeton University, March 2004. [93] D. Luenberger, Optimization by Vector Space Methods. New York: Wiley, 1969. [94] G. Pflug, Optimization of Stochastic Models: The Interface between Simulation and Optimization. Kluwer Academic Publishers, 1996. [95] F. V. Abad and V. Krishnamurthy, "Self learning control of constrained Markov decision processes- a valued gradient approach," GERAD-HEC Montreal, http://www.gerad.ca/fichiersjcahiers/G-2003-51.pdf, Tech. Rep. G-2003-51, August 2003. [96] F. V. Abad, V. Krishnamurthy, I. Baltcheva, and K. Martin, "Self learning control of constrained Markov decision processes," in IEEE Conference on Decision and Control, Las Vegas, 2002. [97] H. Kushner and G. Yin, Stochastic Approximation Algorithms and Applications. New York: Springer-Verlag, 1997. [98] A. Pakes, "Some conditions for ergodicity and recurrence of markov chains," Operations Research, vol. 17, pp. 1058-1061, 1969. [99] V. Krishnamurthy and M. Ngo, "A game theoretical approach for transmission strategies in slotted ALOHA networks with multi-packet reception," in Proc. International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, USA, March 2005, pp. 653-656. [100] F. Davoli and P. Maryni, "A two-level stochastic approximation for admission control and bandwidth allocation," IEEE Journal on Selected Areas in Communications, vol. 18, no. 2, pp. 222-233, 2000. [101] B. Hajek, "Stochastic approximation methods for decentralized control of multiaccess communications," IEEE Transactions on Information Theory, vol. 31, no. 2, pp. 176184, 1985. [102] H. J. Kushner and P. A. Whiting, "Convergence of proportional-fair sharing algorithms under general conditions," IEEE Transactions on Wireless Communications, vol. 3, no. 4, pp. 1250-1259, 2004. [103] H. Zhang, W. S. Wong, W. Ge, and P. E. Caines, "A stochastic approximation approach to the power-control problem," IEEE Transactions on Communications, vol. 55, no. 5, pp. 878-886, 2007. 197 Bibliography [104] D. E. Kirk, Optimal Control Theory: An Introduction. Prentice-Hall, Inc., 1970. [105] G. Debreu, "A social equilibrium existence theorem," Proceedings of the National Academy of Sciences of the United States of America, vol. 38, no. 10, pp. 886-893, October 1952. [106] K. Fan, "Fixed-point and minimax theorems in locally convex topological linear spaces," Proceedings of the National Academy of Sciences of the United States of America, vol. 38, no. 2, pp. 121-126, February 1952. [107] I. L. Glicksberg, "A further generalization of the kakutani fixed point theorem, with application to nash equilibrium points," Proceedings of the American Mathematical Society, vol. 3, no. 1, pp. 170-174, February 1952. [108]V. Solo and X. Kong, Adaptive Signal Processing Algorithms - Stability and Performance. N.J.: Prentice Hall, 1995. [109] G. B. Folland, Real Analysis: modern techniques and their applications. WileyInterscience, 1999. [110] G. Yin and V. Krishnamurthy, "LMS algorithm for tracking slow markov chains with applications to hidden markov estimation and adaptive multiuser detection," IEEE Transactions on Information Theory, vol. 51, no. 7, pp. 2475-2490, July 2005. [111] S. Kakutani, "A generalization of brouwer's fixed point theorem," Duke Mathematical, vol. 8, pp. 457-459, 1941. [112] M. H. Ngo and V. Krishnamurthy, "Optimality of Threshold Policies for Transmission Scheduling in Correlated Fading Channels," IEEE Transactions on Communications, 2007, submitted. [113] , "On the optimality of threshold scheduling policies for video transmission in markovian fading wireless channels with channel-aware arq," in Proc. Globecom, San Francisco, USA, December 2006, pp. 1-5. [114] Y. Guan and L. Turner, "Generalised FSMC model for radio channels with correlated fading," in IEE Proceedings Communications, vol. 146, 1999, pp. 133-137. [115] J. Razavilar, K. J. R. Liu, and S. Marcus, "Jointly optimized bit-rate/delay control policy for wireless packet networks with fading channels," IEEE Transactions on Communications, vol. 50, pp. 484-494, March 2002. [116] Q. Zhang and S. A. Kassam, "Hybrid ARQ with Selective Combining for Video Transmission over Wireless Channels," in International Conference on Image Processing, October 1997, pp. 692-695. 198 Bibliography [117] Q. Zhang, W. Zhu, and Y.-Q. Zhang, "Channel-adaptive resource allocation for scalable video transmission over 3g wireless network," IEEE Transactions on Circuits and Systems for Video Technology, vol. 14, no. 8, pp. 1049-1063, 2004. [118] C. Derman, G. Lieberman, and S. Ross, "Optimal system allocations with penalty cost," Management Science, vol. 23, no. 4, pp. 399-403, December 1976. [119] S. Ross, Introduction to Stochastic Dynamic Programming. San Diego, California.: Academic Press, 1983. [120] N. Arulselvan and R. Berry, "Energy-throughput optimization for wireless ARQ protocols," in Proceedings of ICASSP, IEEE 2005, March 2005, pp. V797—V800. [121] J. Lu, K. B. Letaief, and M. L. Liou, "Robust video transmission over correlated mobile fading channels," IEEE Transactions on Circuits and Systems for Video Technology, vol. 9, no. 5, pp. 737-751, 1999. [122] C. Iskander and P. Mathiopoulos, "Fast simulation of diversity nakagami fading channels using finite-state markov models," IEEE Trans. Broadcasting, vol. 49, no. 3, pp. 737-751, August 1999. [123] F. Borgonovo and A. Capone, "Efficiency of Error-Control Schemes for Real-Time Wireless Applications on the Gilbert Channel," IEEE Trans. Vehicular Technology, vol. 54, no. 1, pp. 246-258, January 2005. [124] H. Jiang, W. Zhuang, and X. Shen, "Dynamic Resource Allocation for Video Traffic over Time-Varying CDMA Wireless Channels," in Proceedings of IEEE WCNC'05, March 2005, pp. 1323-1328. [125] P. Bucciol, E. Masala, and J. D. Martin, "Perceptual ARQ for H.264 Video Streaming over 3G Wireless Networks," in Proceedings of ICC'04, June 2004, pp. 1288-1292. [126] D.P.Heyman and M.J.Sobel, Stochastic Models in Operations Research. McGrawHill, 1984, vol. 2. [127] M. H. Ngo and V. Krishnamurthy, " Monotonicity of Constrained Optimal Transmission Policies in Correlated Fading Channels with ARQ," IEEE Transactions on Mobile Computing, 2007, submitted. [128] M. Ngo and V. Krishnamurthy, "On optimality of monotone channel-aware transmission policies: A constrained markov decision process approach," in Proc. International Conference on Acoustics, Speech, and Signal Processing, Honolulu, Hawaii, USA, April 2007, pp. 653-656. [129] D.-J. Ma, A. M., and A. Shwartz, "Estimation and optimal control for constrained markov chains," in Decision and Control, .1986 25th IEEE Conference on, vol. 25, 1986, pp. 994-999. 199 Bibliography [130] L. I. Sennott, "Average Cost Optimal Stationary Policies in Infinite State Markov Decision Processes with Unbounded Costs," Operations Research, vol. 37, no. 4, pp. 626-633, July-August 1989. [131] M. Goyal, A. Kumar, and V. Sharma, "Power constrained and delay optimal policies for scheduling transmission over a fading channel," in INFOCOM 2003, vol. 1, 2003, pp. 311-320 void. [132]D. Djonin and V. Krishnamurthy, " MIMO Transmission Control in Fading Channels A Constrained Markov Decision Process Formulation with Monotone Policies," IEEE Transactions on Signal Processing, 2006, to appear. [133] L. I. Sennott, "On Computing Average Cost Optimal Policies with Application to Routing to Parallel Queues," Mathematical Methods of Operations Research, vol. 45, pp. 45-62, 1997. [134]R. Cavazos-Cadena, "Undiscounted Value Iteration in Stable Markov Decision Chains with Bounded Rewards," Journal of Mathematical Systems, Estimation, and Control, vol. 6, no. 2, pp. 1-34, 1996. [135] C. J. C. H. Watkins and P. Dayan, "Q-learning," Machine Learning, vol. 8, no. 3, pp. 279-292, 05/01 1992. [136] A. G. Barto, S. J. Bradtke, and S. P. Singh, "Learning to act using real-time dynamic programming," Artificial Intelligence, vol. 72, no. 1-2, pp. 81-138, 1995. [137]L. P. Kaelbling, M. L. Littman, and A. P. Moore, "Reinforcement learning: A survey," Journal of Artificial Intelligence Research, vol. 4, pp. 237-285, 1996. [138] D. V. Djonin and V. Krishnamurthy, "Q-learning algorithms for constrained markov decision processes with randomized monotone policies: Application to mimo transmission control," IEEE Transactions on Signal Processing, vol. 55, no. 5, pp. 2170-2181, 2007. [139] P. Bremaud, Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues. Springer-Verlag, 1999. [140] J. C. Spall, "Multivariate stochastic approximation using a simultaneous perturbation gradient approximation," IEEE Transactions on Automatic Control, vol. 37, no. 3, pp. 332-341, 1992. [141] , "Stochastic optimization and the simultaneous perturbation method," in Winter Simulation Conference, 1999, pp. 101-109. [142] R. J. Aumann, "Correlated equilibrium as an expression of bayesian rationality," Econometrica, vol. 55, no. 1, pp. 1-18, Jan. 1987. 200 Bibliography [143] S. Hart and D. Schmeidler, "Existence of correlated equilibria," Mathematics of Operations Research, vol. 14, no. 1, pp. 18-25, Feb. 1989. [144] A. Neyman, "Correlated equilibrium and potential games," International Journal of Game Theory, vol. 26, no. 2, pp. 223-227,1997. [145] E. Altman, N. Bonneau, and M. Debbah, "Correlated equilibrium in access control for wireless communications," in 5th IFIP International Conferences on Networking, May 15-19, 2006, Coimbra, Portugal, May 2006. [146] M. Maskery and V. Krishnamurthy, "Decentralized activation in a zigbee-enabled unattended ground sensor network: A correlated equilibrium game theoretic analysis," in IEEE International Conference on Communications, 2007, 2007, pp. 3915-3920. 201
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Cross-layer adaptive transmission scheduling in wireless...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Cross-layer adaptive transmission scheduling in wireless networks Ngo, Minh Hanh 2007
pdf
Page Metadata
Item Metadata
Title | Cross-layer adaptive transmission scheduling in wireless networks |
Creator |
Ngo, Minh Hanh |
Publisher | University of British Columbia |
Date Issued | 2007 |
Description | A new promising approach for wireless network optimization is from a cross-layer perspective. This thesis focuses on exploiting channel state information (CSI) from the physical layer for optimal transmission scheduling at the medium access control (MAC) layer. The first part of the thesis considers exploiting CSI via a distributed channel-aware MAC protocol. The MAC protocol is analysed using a centralized design approach and a non-cooperative game theoretic approach. Structural results are obtained and provably convergent stochastic approximation algorithms that can estimate the optimal transmission policies are proposed. Especially, in the game theoretic MAC formulation, it is proved that the best response transmission policies are threshold in the channel state and there exists a Nash equilibrium at which every user deploys a threshold transmission policy. This threshold result leads to a particularly efficient stochastic-approximation-based adaptive learning algorithm and a simple distributed implementation of the MAC protocol. Simulations show that the channel-aware MAC protocols result in system throughputs that increase with the number of users. The thesis also considers opportunistic transmission scheduling from the perspective of a single user using Markov Decision Process (MDP) approaches. Both channel state information and channel memory are exploited for opportunistic transmission. First, a finite horizon MDP transmission scheduling problem is considered. The finite horizon formulation is suitable for short-term delay constraints. It is proved for the finite horizon opportunistic transmission scheduling problem that the optimal transmission policy is threshold in the buffer occupancy state and the transmission time. This two-dimensional threshold structure substantially reduces the computational complexity required to compute and implement the optimal policy. Second, the opportunistic transmission scheduling problem is formulated as an infinite horizon average cost MDP with a constraint on the average waiting cost. An advantage of the infinite horizon formulation is that the optimal policy is stationary. Using the Lagrange dynamic programming theory and the supermodularity method, it is proved that the stationary optimal transmission scheduling policy is a randomized mixture of two policies that are threshold in the buffer occupancy state. A stochastic approximation algorithm and a Q-learning based algorithm that can adaptively estimate the optimal transmission scheduling policies are then proposed. |
Extent | 10249140 bytes |
Subject |
Wireless network optimization Channel state information Markov Decision Process |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2008-09-05 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0066600 |
URI | http://hdl.handle.net/2429/1626 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2008-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2008_spring_ngo_minh_hanh.pdf [ 9.77MB ]
- Metadata
- JSON: 24-1.0066600.json
- JSON-LD: 24-1.0066600-ld.json
- RDF/XML (Pretty): 24-1.0066600-rdf.xml
- RDF/JSON: 24-1.0066600-rdf.json
- Turtle: 24-1.0066600-turtle.txt
- N-Triples: 24-1.0066600-rdf-ntriples.txt
- Original Record: 24-1.0066600-source.json
- Full Text
- 24-1.0066600-fulltext.txt
- Citation
- 24-1.0066600.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0066600/manifest