Game Theoretic Methods for Networked Sensors and Dynamic Spectrum Access by Michael Maskery B.Sc., Queen’s University, 2000 M.Sc., The University of Ottawa, 2003 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in The Faculty of Graduate Studies (Electrical and Computer Engineering) The University Of British Columbia December, 2007 c Michael Maskery 2007 ° Abstract Automated devices enabled by wireless communications are deployed for a variety of purposes. As they become more ubiquitous, their interaction becomes increasingly important for coexistence when sharing a scarce resource, and for leveraging potential cooperation to achieve larger design goals. This thesis investigates the use of game theory as a tool for design and analysis of networked systems of automated devices in the areas of naval defence, wireless environmental monitoring through sensor networks, and cognitive radio wireless communications. In the first part, decentralized operation of naval platforms deploying electronic countermeasures against missile threats is studied. The problem is formulated as a stochastic game in which platforms independently plan and execute dynamic strategies to defeat threats in two situations: where coordination is impossible due to lack of communications, and where platforms hold different objectives but can coordinate, according to the military doctrine of Network Enabled Operations. The result is a flexible, robust model for missile deflection for advanced naval groups. Next, the problem of cooperative environmental monitoring and communication in energyconstrained wireless sensor networks is considered from a game-theoretic perspective. This leads to novel protocols in which sensors cooperatively trade off performance with energy consumption with low communication and complexity overhead. Two key results are an on-line adaptive learning algorithm for tracking the correlated equilibrium set of a slowly varying sensor deployment game, and an analysis of the equilibrium properties of threshold policies in a game with noisy, correlated measurements. Finally, the problem of dynamic spectrum access for systems of cognitive radios is considered. A game theoretic formulation leads to a scheme for competitive bandwidth allocation which respects radios’ individual interests while enforcing fairness between users. An online adaptive learning scheme is again proposed for negotiating fair, equilibrium resource allocations, while dynamically adjusting to changing conditions. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Statement of Co-Authorship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 1.2 1.3 1.4 Principles of Game Theoretic Design and Analysis . . . . . . . . . . . . . . 2 1.1.1 Game Theoretic Equilibrium . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Stochastic Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Benefits and Challenges of Game-Theoretic Design and Analysis . . 5 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.1 Contributions to Decentralized Force Protection . . . . . . . . . . . 8 1.2.2 Contributions to Decentralized Sensor Activation Control . . . . . . 10 1.2.3 Contributions to Decentralized Dynamic Spectrum Access . . . . . 11 1.2.4 Contributions to Self-Configuration of Sensors with Global Games . 11 Review of Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.1 Missile Deflection and Network Centric Warfare . . . . . . . . . . . 12 1.3.2 Wireless Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.3 Cognitive Radio Dynamic Spectrum Access . . . . . . . . . . . . . . 14 Contents of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 iii Table of Contents Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2 Decentralized Force Protection: Nash Equilibrium . . . . . . . . . . . . . 24 2.1 Introduction 2.2 Motivation and Review of Stochastic Games 2.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 2.5 26 Decentralized Architecture for Implementing Countermeasures with a Common Air Picture 2.3 . . . . . . . . . . . . . . . . . 24 . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.2 Motivation: Pulse Sequence Jamming over a Link 16 Network . . . 27 2.2.3 Review of Stochastic Shortest Path Games . . . . . . . . . . . . . . 29 Decentralized Missile Deflection as a Stochastic Shortest Path Game . . . . 32 2.3.1 State Space of the Multiple Missile Threat . . . . . . . . . . . . . . 34 2.3.2 Dynamics of a Radar-Guided Multiple Missile Threat . . . . . . . . 35 2.3.3 Countermeasure Actions 38 2.3.4 Platform Objectives and State Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Computation of Decentralized Missile Deflection Policies 39 . . . . . . . . . . 44 2.4.1 Computation of an Optimal Response for a Single Ship . . . . . . . 44 2.4.2 Exploiting Transient Classes in the MDP . . . . . . . . . . . . . . . 47 2.4.3 Nash Equilibrium Solution for Multiple Ships . . . . . . . . . . . . 50 2.4.4 Pareto Optimality and Nash Equilibrium . . . . . . . . . . . . . . . 52 . . . . . . . . . . . . . . . . . 53 2.5.1 Limiting CM availability via State Space Augmentation . . . . . . . 54 2.5.2 Limiting CM Directly in the Linear Program . . . . . . . . . . . . . 55 2.5.3 Constraining CM Sequences . . . . . . . . . . . . . . . . . . . . . . 57 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.6.1 Maximizing Minimum Approach of a Missile with Simple Targeting 58 2.6.2 Minimizing Probability of Hit for a Missile with Forward Momentum 60 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3 Decentralized Force Protection: Correlated Equilibrium . . . . . . . . . 73 2.6 2.7 3.1 3.2 Introducing CM Limitations via Constraints Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.1.1 Network Enabled Operations . . . . . . . . . . . . . . . . . . . . . . 76 3.1.2 The Case for Correlated Equilibrium in NEOps . . . . . . . . . . . 78 Network Enabled Operations in Dynamic Missile Deflection . . . . . . . . . 82 3.2.1 83 Single Missile Process . . . . . . . . . . . . . . . . . . . . . . . . . . iv Table of Contents 3.2.2 . . . . . . . . . . . . . . . . . . . . . . . . 85 3.3 Decentralized Missile Deflection as a Transient Stochastic Game . . . . . . 86 3.4 Characterization of the NEOps Equilibrium Strategy 88 3.4.1 3.5 3.6 Multiple Missile Process . . . . . . . . . . . . Correlated Equilibrium as the Solution to a Bilinear Program . . . 92 . . . . . . . . . . . . . . . . . . . 95 3.5.1 Optimizing Information Flow in NEOps . . . . . . . . . . . . . . . . 95 3.5.2 Implementation Architectures for NEOps Equilibrium Strategies . . 99 Information Theoretic Metrics in NEOps . . . . . . . . . . . . . . . . . . . . . . . . . 104 3.6.1 Inner Game (Single Missile in Attack Mode) . . . . . . . . . . . . . 105 3.6.2 Outer Game (Selection from Three Missiles) . . . . . . . . . . . . . 109 3.6.3 Outer Game (Myopic Selection from Ten Missiles) . . . . . . . . . . 111 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4 Decentralized UGSN Activation Control . . . . . . . . . . . . . . . . . . . 118 3.7 Numerical Example of NEOps 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.2 Decentralized Energy-Aware Node Activation . . . . . . . . . . . . . . . . . 121 4.2.1 Measurement Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.2.2 Activation Control Utility Function 122 4.2.3 Nonlinear Adaptive Filtering for Sensor Activation 4.2.4 Performance of Regret Tracking 4.3 4.4 4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 . . . . . . . . . . . . . . . . . . . . 128 . . . . . . . . . . . . . . . . . . . 129 4.3.1 Definition of Correlated Equilibrium . . . . . . . . . . . . . . . . . . 129 4.3.2 Convergence Analysis of Algorithm 4.2.2 . . . . . . . . . . . . . . . 131 4.3.3 Tracking Performance of Algorithm 4.2.2 . . . . . . . . . . . . . . . 134 4.3.4 Algorithm Improvements and Pricing . . . . . . . . . . . . . . . . . 136 Test Bed and Numerical Results for ZigBee-enabled UGSN . . . . . . . . . 138 4.4.1 Tracking Dynamic Targets: Comparison of Performance . . . . . . . 140 4.4.2 Parameter Selection and Sensor Estimation . . . . . . . . . . . . . . 142 Correlated Equilibrium and Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5 Game Theoretic Dynamic Spectrum Access . . . . . . . . . . . . . . . . . 150 5.1 Discussion Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 v Table of Contents 5.2 Opportunistic Spectrum Access Model 5.3 Estimating Channel Characteristics 5.4 5.5 5.6 . . . . . . . . . . . . . . . . . . . . 153 . . . . . . . . . . . . . . . . . . . . . . 154 5.3.1 Channel Contention Estimate . . . . . . . . . . . . . . . . . . . . . 154 5.3.2 Other Channel Characteristics . . . . . . . . . . . . . . . . . . . . . 158 System Performance and Radio Utility . . . . . . . . . . . . . . . . . . . . 159 5.4.1 Global System Utility . . . . . . . . . . . . . . . . . . . . . . . . . . 159 5.4.2 Local Radio Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Decentralized Adaptive Channel Allocation . . . . . . . . . . . . . . . . . . 161 5.5.1 Convergence of Adaptive Channel Allocation . . . . . . . . . . . . . 163 5.5.2 Other Adaptive Strategies . . . . . . . . . . . . . . . . . . . . . . . 165 Numerical and Simulation Results . . . . . . . . . . . . . . . . . . . . . . . 166 5.6.1 Effect of Utility Pricing on Performance of Regret Tracking . . . . . 166 5.6.2 Comparison of Algorithms in a Static Environment 168 5.6.3 Performance of Regret Tracking in a Dynamic Environment . . . . . . . . . . . . . 171 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 6 Self-Configuration of Sensors: Global Games . . . . . . . . . . . . . . . . 176 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 5.7 6.1 Introduction 6.2 A Global Game for Sensor Mode Selection 6.3 . . . . . . . . . . . . . . . . . . 179 6.2.1 Characteristics of Global Games . . . . . . . . . . . . . . . . . . . . 179 6.2.2 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 6.2.3 Predicting the Proportion of Active Sensors . . . . . . . . . . . . . 183 . . . . . . . . . . . . . . . . . . . . 185 Nash Equilibrium Threshold Strategies 6.3.1 Nash Equilibrium Threshold Strategies for Uniform Observation Noise 187 6.3.2 6.4 6.5 Threshold Strategies for Gaussian Priors and Noise . . . . . . . . . 190 Further Analysis of Mode Selection Thresholds . . . . . . . . . . . . . . . . 191 6.4.1 Adaptive Self-Configuration of Mode Selection Thresholds . . . . . 192 6.4.2 Robustness of Threshold under Measurement Averaging . . . . . . . 193 6.4.3 Robustness of Threshold under Variance-based Measurement Quality 196 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 6.5.1 Single Utility, Weak Congestion 197 6.5.2 Multiple Utilities, Mixed Congestion 6.5.3 Convergence of Algorithm 6.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 . . . . . . . . . . . . . . . . . . . . 203 vi Table of Contents 6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 7 Conclusion 208 7.1 7.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Common Approach to Game Theoretic Design and Analysis . . . . . . . . 208 . . . . . . . . . . . . . . . . . . 208 7.1.1 System Objectives and Capabilities 7.1.2 Component Awareness . . . . . . . . . . . . . . . . . . . . . . . . . 210 7.1.3 Local Incentives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 7.1.4 Equilibrium Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 213 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 7.2.1 Analytical Tools Used . . . . . . . . . . . . . . . . . . . . . . . . . . 216 7.2.2 Strengths and Weaknesses 217 7.2.3 Characteristics of Appropriate Applications Evaluation of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 7.3 Appendices A UGSN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 A.1 Sensor Network Model and Architecture . . . . . . . . . . . . . . . . . . . . 227 A.2 ZigBee/IEEE 802.15.4 Communication Model . . . . . . . . . . . . . . . . 228 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 B Global Games Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 vii List of Tables 3.1 Missile hit probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 3.2 Missile deflection game costs . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.1 Simulation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 4.2 Regret-based vs. greedy sensor activation . . . . . . . . . . . . . . . . . . . 142 A.1 IEEE 802.15.4 Standard Values . . . . . . . . . . . . . . . . . . . . . . . . . 230 A.2 IEEE 802.15.4 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . 231 viii List of Figures 1.1 Organization of Thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1 Overview of decentralized missile deflection system . . . . . . . . . . . . . . 28 2.2 Missile transition probabilities for example 1 . . . . . . . . . . . . . . . . . 59 2.3 Expected Nash equilibrium costs for example 1 . . . . . . . . . . . . . . . . 61 2.4 Expected Pareto optimal costs for example 1 . . . . . . . . . . . . . . . . . 62 2.5 Sample path of Nash equilibrium for missile deflection example 1 . . . . . . 63 2.6 Missile transition probabilities for example 2 . . . . . . . . . . . . . . . . . 65 2.7 Sample path of Nash equilibrium for missile deflection example 2 . . . . . . 65 2.8 Performance of Nash equilibrium for missile deflection . . . . . . . . . . . . 67 3.1 Decentralized missile deflection using NEOps . . . . . . . . . . . . . . . . . 77 3.2 Communication architecture for NEOps . . . . . . . . . . . . . . . . . . . . 100 3.3 Missile transition probabilities for NEOps example . . . . . . . . . . . . . . 106 3.4 Mutual Information for NEOps example . . . . . . . . . . . . . . . . . . . . 108 4.1 Activation control as adaptive filtering . . . . . . . . . . . . . . . . . . . . . 123 4.2 Sensor utility and target tracking . . . . . . . . . . . . . . . . . . . . . . . . 140 4.3 Sensor lock time and target tracking . . . . . . . . . . . . . . . . . . . . . . 141 4.4 Sensor network pricing performance . . . . . . . . . . . . . . . . . . . . . . 143 4.5 Sensor network timer performance . . . . . . . . . . . . . . . . . . . . . . . 145 5.1 Block diagram of cognitive radio system. . . . . . . . . . . . . . . . . . . . . 155 5.2 Performance of contention estimate for cognitive radio. . . . . . . . . . . . . 157 5.3 Cognitive radio performance vs. utility parameters . . . . . . . . . . . . . . 167 5.4 Performance of cognitive radio decentralized channel allocation . . . . . . . 169 5.5 Regret analysis of cognitive radio decentralized channel allocation . . . . . . 170 5.6 Performance of cognitive radio channel allocation for dynamic conditions . . 172 ix List of Figures 6.1 Block diagram for global game . . . . . . . . . . . . . . . . . . . . . . . . . 184 6.2 Global game example 1: sensor utility . . . . . . . . . . . . . . . . . . . . . 198 6.3 Global game example 1: threshold vs. noise . . . . . . . . . . . . . . . . . . 199 6.4 Global game example 1: thresholds for diverse sensors . . . . . . . . . . . . 200 6.5 Global game example 2: sensor utility . . . . . . . . . . . . . . . . . . . . . 201 6.6 Global game example 2: equilibrium conditions . . . . . . . . . . . . . . . . 202 6.7 Convergence of iterative threshold computation . . . . . . . . . . . . . . . . 204 A.1 IEEE 802.15.4 Transmission Energy . . . . . . . . . . . . . . . . . . . . . . 231 A.2 IEEE 802.15.4 Success Probability . . . . . . . . . . . . . . . . . . . . . . . 232 x Acknowledgements This work would not have been possible without the support, enthusiasm and guidance of my supervisor, Vikram Krishnamurthy, to whom I owe a debt of thanks. I would also like to acknowledge the support and feedback from my fellow graduate students at UBC, and from the scholars who visited UBC or whom I met while working on this material. Finally I would like to thank Christina O’Regan, Qing Zhao and George Yin, dedicated researchers and scholars with whom I collaborated to produce the manuscripts contained herein. The work produced here was funded in part by Defence Research and Development Canada, and by NSERC Postgraduate Doctoral Scholarship. xi Dedicated to my family, my friends, and to Jacqueline. xii Statement of Co-Authorship This thesis is based on several manuscripts, resulting from collaboration between multiple researchers. A version of Chapter 2 will appear in 2007 in IEEE Transactions on Aerospace and Electronics Systems. This was co-authored with Vikram Krishnamurthy and Christina O’Regan. A version of Chapter 3 will also appear in 2007 in IEEE Transactions on Aerospace and Electronics Systems, and was co-authored with Vikram Krishnamurthy. A version of Chapter 4 was submitted to IEEE Transactions on Signal Processing. This was co-authored with Vikram Krishnamurthy and George Yin. A version of Chapter 5 was submitted to IEEE Transactions on Communications. This was co-authored with Vikram Krishnamurthy and Qing Zhao. A version of Chapter 6 was submitted to IEEE Transactions on Signal Processing. This was co-authored with Vikram Krishnamurthy. The author’s specific contributions in these collaborations are detailed below. • Identification and design of research program: The research program was designed jointly based on ongoing discussion between the author and Vikram Krishnamurthy. Identification and design in specific research topics also depended on significant input from collaborators as follows: – The overall problem formulation and physical scenario for decentralized missile deflection in Chapters 2 and 3 depended on input from Christina O’Regan and other researchers at Defence Research and Development, Canada. – The results on strong and weak convergence of the regret-based learning algorithms in Chapters 4 and 5 are due in part to George Yin. – The formulation of the cognitive radio model in Chapter 5 was performed in collaboration with Qing Zhao. • Performing the research: All major research, including detailed problem specification, model design, performance of analysis and identification of results was performed by the author, with assistance from Vikram Krishnamurthy. xiii Statement of Co-Authorship • Data analyses: All numerical examples, simulation, and data analysis were performed by the author. • Manuscript preparation: The author prepared the majority of all manuscripts, with the exception of the following: – Qing Zhao wrote the section on related work in the introduction of Chapter 5, and was responsible for suggestions and improvements throughout the chapter. – George Yin was responsible for the technical treatment of convergence and tracking for the regret-based learning algorithms in Chapter 4. – Vikram Krishnamurthy outlined the proposed extensions in the conclusion of Chapter 4. He was also responsible for various suggestions and improvements throughout this thesis. – Christina O’Regan and Pierre Lavoie were responsible for editing and suggestions in Chapters 2 and 3. xiv Chapter 1 Introduction Since the first microcontroller was released in 1976, embedded electronics have become an integral part of modern civilization. Even today, some thirty years later, new uses are still being found for embedded electronics, which interact ever more seamlessly with their environment and their human users. The Case for Inter-Device Interaction: With electronics embedded everywhere, it is unavoidable that embedded devices will encounter one another as they perform their core functions. The prevalence of these interactions is compounded by the recent trend of incorporating wireless communications into devices to enhance their functionality. Indeed, the wireless environment is inherently far-reaching, linking together devices that may be separated by large physical distances. This thesis comprises a study of the interaction among such “networked” devices, which must coordinate and coexist in this wireless landscape. In particular, interaction is studied in networks of wireless-enabled missile deflection systems, wireless sensor networks for environmental monitoring, and cognitive radio wireless communication networks. As each device we consider is an independent entity, these networks require new tools for engineering design and analysis. For example, although devices might be capable of wireless communication, the bandwidth originally allocated for coordination may be coopted to satisfy other, more pressing needs. In addition, decentralization may be required to support practical features such as self-organization and robustness, since the device network may be too complex or dispersed to coordinate centrally. Consequently, devices comprising modern systems may exhibit a high level of autonomy, and may need to operate based only on what local information can be easily gathered. Decentralized Systems Approach: Since the independently operating devices considered in this thesis interact, they may considered as a group to comprise a decentralized system, which can be analyzed for global patterns of behaviour. One natural and emerging theoretical framework for studying the behaviour of such “multi-agent systems” [55] is 1 Chapter 1. Introduction game theory, and indeed the major theme of this thesis is the use of noncooperative game theory in a cooperative way, to analyze independent and locally aware devices in a decentralized system set up to achieve desirable overall behaviour. Two important tasks must be completed for a proper game theoretic study. First, the individual device incentives must be identified, and analyzed to determine whether they favour a fair balance between the device’s own objectives and the objectives of others (or the system). This is the problem of determining an appropriate utility function for each device, mapping its awareness and action to a real value. Second, recognizing that awareness may be limited and incentives may differ, a mechanism must be found which allows devices to adapt to each other, such that their combined effort reaches a stable operating point, in which each device’s actions are an optimal response (from their perspective) to their perceived environment and the actions of others. This is the problem of specifying a mechanism for action that leads to game theoretic equilibrium. The rest of this introductory chapter provides a brief overview of the proposed game theoretic approach to decentralized systems, along with the various applications appearing in this thesis. Since subsequent chapters are intended to be self-contained, more detailed literature reviews, motivations, and descriptions of each topic are contained within each chapter. Notation is also self-contained within each chapter. 1.1 Principles of Game Theoretic Design and Analysis Game theory is a complexity-based theory, along with percolation theory, cellular automata and some aspects of advanced physics and ecology modeling. Such theories derive rich dynamics through the interaction of simple components, and are used either as descriptive tools, to predict the outcome of complex interactions, or as prescriptive tools, to design systems around given interaction rules. Game theory is used for both design and analysis in this thesis, to contribute to a new, rich theory for decentralized systems engineering. The essential goal of game theory is to describe the “optimal” outcome of a process involving multiple decision makers, or players. Each player’s outcome is affected not only by their own decision, but also by the decisions of others. Solving the game, therefore, involves a complex process of players guessing what each other will do, while revealing as little information as possible about their own strategies. Canonical examples are numerous, including the famous “Prisoner’s Dilemma,” strictly competitive games, such as “Rock Paper Scissors,” coordination games, and mixed games [25]. 2 Chapter 1. Introduction Tools of Game Theory: The basic game theoretic approach in this research is well documented in the mathematics, economics and engineering literature. Comprehensive texts on the subject include [25, 46]. The theory was originally developed as a tool for studying spontaneous organization of behaviour among self-interested agents in an interactive environment by mathematicians and economists, notably von Neumann and Morgernstern [56]. Each player is treated as a decision maker with a set of possible actions and a utility for each action, which is to be maximized. The utility, however, is also a function of others’ actions, which makes the maximization problem nontrivial. 1.1.1 Game Theoretic Equilibrium The central concept in noncooperative game theory is the equilibrium, which represents a joint strategy for selecting actions such that no agent can improve their utility by deviating from their part of the strategy. Several variations on the equilibrium concept stand out, including the Nash [44], correlated [5, 6], and communication equilibria [23, 29]. Each variation has its own place, and both the Nash and correlated equilibria are central to this thesis. 1 The Nash equilibrium describes situations where each player must select their actions independently; the covariance between the random variables representing individual player actions is zero. The Correlated equilibrium relaxes this condition, allowing the nonzero covariance and consequently players select actions based on some common information. Several mechanisms have been proposed to realize the correlated equilibrium, using either a central authority who recommends or verifies proposed actions [5, 25, 37], or using a verification scheme among the players [7, 9, 10, 22, 27, 54, 60]. Relationship between Nash and Correlated Equilibrium: The correlated equilibrium is an interesting generalization of the Nash equilibrium, and we study it extensively in Chapters 3, 4 and 5. In fact the correlated equilibria comprise a convex polytope in the space of all randomized strategies, with Nash equilibria as extreme points [45]. Consequently, correlated equilibria are particularly simple to compute, as they are solutions to a linear feasibility problem, or a linear programming problem if one seeks to find the “best” correlated equilibrium according to some linear criteria. Learning in Games: An equilibrium strategy is a probability distribution over the joint actions of all players. However, in practical situations awareness is often distributed such 1 The communication equilibrium is a straightforward extension to the correlated equilibrium, but typically requires more coordination than is allowed for in the applications considered here. 3 Chapter 1. Introduction that no player can possibly compute such an equilibrium on their own. In this case, researchers have proposed adaptive procedures to approximate the equilibrium, with each player learning their part of the equilibrium by repeated interactive play and strategy adjustment. The most well-known such procedure, fictitious play, was introduced in 1951 [48], and has been extensively studied, see [24]. More recently, generalizations have been proposed, in the form of regret-based learning [11, 16, 30, 31] etc., and it is these that are the focus of our work. We use tools from stochastic approximation [36] to further modify these regret-based procedures to an adaptive version, which is more useful in practical applications. Game theory can also be interpreted as an analytical tool for determining the effectiveness and stability of simple, adaptive learning rules. This is the approach taken in Chapters 4 and 5, which sets up rules of behaviour, then analyzes them to show convergence to equilibrium, implying competitive optimality between devices. 1.1.2 Stochastic Games Traditional game theory addresses a static decision problem. However, many practical problems require a sequence of decisions, each of which affects future performance. The problem of trading off present gain and future reward is captured, for a single decision maker, by the well-known Hamilton-Jacobi-Bellman equation of dynamic programming and optimal control theory [13, 47]. The analogous dynamic programming model for multiple decision makers is the subject of the theory of stochastic games [21, 51], which, interestingly, was first published slightly before the famous single-decision maker theory of Bellman. In a stochastic game, multiple players face a sequence of static noncooperative games according to a Markov chain. Each joint action provides some reward (or cost) to each player, and determines a probability distribution, according to which the next game is selected. Standard practice is to formulate and simultaneously solve a number of coupled “stage games” which give the expected total reward associated with each game and action, given the expected actions in the other games. These stage games can be formulated for discounted, finite-horizon, and average cost variants and the equilibrium concept extended to hold for each stage game simultaneously. Solving Stochastic Games: Solutions of stochastic games are extremely complex and generally less well understood than their static counterparts. For special, simplified cases, including zero-sum games, several methods for computing a stationary Nash equilibrium based on linear programming are proposed in [21], and a homotopy method is proposed in [33]. In Chapters 2 and 3, we propose other computational methods, including a sequence of 4 Chapter 1. Introduction linear programs which converges for “team” games, and a bilinear programming formulation for computing correlated equilibria in general stochastic games. 1.1.3 Benefits and Challenges of Game-Theoretic Design and Analysis The game theoretic approach has several appealing features for system design and analysis. Simple devices with limited awareness can be equipped with pre-configured or configurable utility functions and routines for maximizing their utility in an interactive, even unknown environment. Such devices can then be deployed to organize themselves effectively in a dynamic and unknown environment. As long as the utility functions are properly specified, these “networked” devices can be made to exhibit many desired behaviors. This game theoretic design approach echoes natural systems in a form of biomimicry [12]; biological agents such as insects have long since evolved the proper procedures and utility preferences to accomplish collective tasks. The same can be said of humans, who orient their utilities according to economics; the proper specification of utilities in this case are dictated by the structure of the economic system and realized through such mechanisms as pricing. The self-organizing feature of game-theoretic (decentralized) systems results in a specific set of benefits and challenges. While centralized systems can achieve better overall performance when the application is known and information can be easily distributed, there is a growing interest in developing decentralized systems for situations where this is not the case. By spreading intelligence throughout the system, game theoretic systems can eliminate the need to transport information to and from a central point, while still allowing local information exchange to any desired degree. Benefits: The key benefits of game theoretic design and analysis are listed as follows: • Robustness: Since each device in a decentralized system is independent, the failure of a few devices does not necessarily result in the failure of the entire system. This allows systems to continue functioning in difficult environments, where devices may be maliciously disabled, temporarily cut off from communication, or drained of energy reserves. • Scalability: Likewise, game theoretic systems naturally support the addition of devices with little impact on performance. Moreover, the complexity of each device does not increase with the number of devices in the system, whereas in centralized systems complexity may increase rapidly. 5 Chapter 1. Introduction • Flexible Self-Organization: Since awareness and decision capabilities are co-located, devices in a decentralized system naturally and independently optimize their own behaviour without outside assistance. Since device awareness includes the actions (or their effects) of others, this results in a spontaneous self-organizing behaviour for the system, even in unknown environments. Moreover, awareness and decisions are local in nature, making the overall complexity of the environment irrelevant to the reaction of the system. This feature also facilitates quick response to local environmental changes, since new information is adapted to at its source. • Low Communication Overhead: The self-organization feature eliminates the need for awareness and decision data to be routed through the system to a central point, thus reducing overhead. However, since the end goal may be to report data, communication may still be needed for transferring (aggregated) information through the system. There is also the possibility of “tuning” system parameters on a slow time scale, which requires low but nonzero bandwidth. A final overhead reduction arises since it is not necessary for mutual awareness to be achieved through explicit communication; devices may simply derive mutual awareness from their observation of their “neighbour’s” observable impact on the immediate environment (e.g. wireless MAC interaction in Chapters 4 and 5). • Modular Analysis: Game theory is a beneficial analytical tool, since it allows for a decomposition of a large-scale problem into a number of sub-problems. Specifically, one can find an equilibrium operating point by considering the optimization of each component (player) in turn, assuming the others are acting optimally. (This is similar to dynamic programming, in which one chooses an optimal action assuming optimal actions will be taken in the future.) This replaces the complex problem of global optimization with a number of smaller optimizations, which are solved simultaneously to provide an equilibrium point, representing a competitive version of optimality. Such an approach is often useful from a computational perspective, and also provides insight into the fundamental mechanics of the problem at hand. Challenges: The main drawback of game theoretic systems derives from the lack of consensus and coordination among system devices. This gives rise to some key challenges, listed as follows: • Limited Awareness: A typical device in a game theoretic system will only be aware of its immediate environment. (If it were aware of everything, a centralized system 6 Chapter 1. Introduction would also be feasible.) Thus, the system often behaves in a myopic fashion, since distant conditions cannot be anticipated. • Limited Coordination: Similarly, only the behaviour of other nearby devices can be observed, and often not directly. Moreover, there is a coupled prediction problem when devices repeat actions over time; to work well together, each device must predict the actions of its neighbours. • Limited Intelligence: Because there are a potentially large number of devices, financial constraints in manufacturing may become involved, making it infeasible to equip system devices with the best computing technology. Much of the time, only very simple algorithms can be realistically implemented. • Analytical Assumptions: Game theoretic analysis requires certain assumptions such as common knowledge of rationality [25] to justify the use of an equilibrium in games involving human players. For analysis of automated devices, this translates into a requirement for common knowledge of rationality for the device designers. That is, each device optimizes its own behaviour, based on the assumption that its counterparts operate similarly, and that this fact is common knowledge. In real systems, it may often be difficult to find a model utility function that fits the problem at hand under these assumptions. This is because many systems use a combination of rational behaviour, along with irrational behaviour in the form of automaton-like stimulus/response mappings. Moreover, if many different types of devices interact, the analyst must provide a rational “player” model for each one, a daunting task for open systems. These challenges lead to suboptimal performance, or ill-fitting analytical models. Performance issues may be partially overcome by choosing a utility function at each sensor which makes the best use of the information at hand, and by choosing a mechanism which quickly leads to coordinated joint behaviour. Although the decentralized nature of awareness and action generally keeps optimality out of reach, this is often more than compensated by the benefits described above. Analytical issues must be overcome by an increase in complexity, by considering multiple game scenarios, combining players, or using hybrid analysis techniques. 7 Chapter 1. Introduction 1.2 Main Contributions In this section, we briefly describe the major novel contributions of the chapters comprising this thesis. These contributions fall into two general categories: novel modeling approaches, and novel analytical approaches applied to the game theoretic models defined in each chapter. In this section these contributions are reviewed below, for the reader’s convenience, according to the chapters that comprise the body of this thesis. A more detailed description of the contribution of each chapter and related work is described in individual chapters. 1.2.1 Contributions to Decentralized Force Protection Chapter 2 introduces the idea of game theory as a tool to analyze the problem of decentralized force protection among Naval platforms. Although coordinated force protection (wherein multiple Naval platforms communicate in real time to cooperatively defend against missile threats) is a known problem, such systems are vulnerable to communications failures, which often intentionally accompany a missile attack. Therefore, there exists a need for solution techniques in scenarios when communication is compromised, which is addressed by the game theoretic approach of this chapter. This allows for consideration of platform utilizing pre-computed defensive policies for automatic deployment, which can be implemented in the absence of coordination capabilities. Decentralized force protection is a novel area of research, as inspired by recent interest in network centric warfare. The main contributions of this chapter are: • A novel formulation of the problem of decentralized force protection against anti-ship missiles using the theory of stochastic games, such that game players (countermeasureequipped ships) are not strictly adversaries, but rather cooperative, yet independent entities. • Specification of appropriate defensive objectives for the game, either to maximize the minimum distance of approach of a missile, or to minimize the expected number of missile hits, based on formulation of a shortest path (transient) stochastic game theory. • Augmentation of the state space of the stochastic game to capture the objective of maximizing the minimum distance of approach, along with a method of exploiting the resulting evolution of state trajectories so as to improve efficiency in computing the solution via linear programming techniques. 8 Chapter 1. Introduction • A Nash equilibrium solution based on simple yet effective algorithms for solving Markovian games, representing the best available solution when communication between ships is infeasible. • Examples, with illustration of the sensitivity of the solution to changes in the missile dynamics. Chapter 3 extends Chapter 2 by considering the case in which Naval platforms are now capable of communicating while responding to missile threats. However, the recently proposed doctrines of Network Centric Warfare [4] and Network Enabled Operations (NEOps) [8] specify that each participating Naval platform be self-empowered, that is, an independent decision maker which may possess its own objective, possibly due to unique local conditions or information at each platform. Since there is currently no encompassing theoretical framework for considering such situations, we propose a game theoretic approach incorporating the concept of a correlated equilibrium, which allows game players to coordinate while still retaining their autonomy. Moreover, this correlated equilibrium concept is inserted into the stochastic game framework in a more definite manner than is previously found in the literature. The main contributions here are: • The formulation of a decentralized missile deflection problem as a transient stochastic game with communication. Ships are considered to be self-interested decision makers able to coordinate countermeasure deployment against a fixed number of missiles with known dynamics in real time. • A theoretical foundation for Network Enabled Operations via the definition of NEOps equilibrium strategy, via a novel definition of correlated equilibrium in a stochastic game. • A novel characterization of the correlated equilibrium as the solution to an optimization problem with bilinear constraints. • An information theoretic analysis of the communication overhead required for implementation of the NEOps equilibrium strategy. Communication overhead is shown to be proportional to the mutual information between the random variables describing player actions in the correlated equilibrium strategy. The mutual information may also be used as a constraint in the optimization above. • A comparison of communication architectures for implementation of the correlated equilibrium. 9 Chapter 1. Introduction • Numerical examples illustrating the NEOps equilibrium strategy for a moderate-scale missile deflection games are presented, and the trade-off between communication and performance. 1.2.2 Contributions to Decentralized Sensor Activation Control Decentralized operation is particularly important in wireless sensor networks, since central coordination of a large number of sensors is infeasible due to bandwidth and energy constraints. In Chapter 4, we present a game theoretic analysis for decentralized sensor activation control, based on energy and performance characteristics of the ZigBee communications protocol. The model allows individual sensors to trade-off their contribution to end-user awareness with the energy cost of activation. Besides the relative novelty of the model (existing game theoretic treatments in sensor networks tend to focus on activities such as communication and routing), the contributions of this chapter are as follows: • A computationally efficient (see Section 4.2.3) decentralized, adaptive filtering algorithm is given for activation control. By adapting behaviour to a moving average performance history, the algorithm allows for tracking of the equilibrium as the environment changes (e.g. due to moving targets or sensor outage). • Analysis of the satisfactory overall performance of the adaptive filtering rule, by showing converges to the set of correlated equilibria. Differential inclusions (rather than ordinary differential equations) are employed to analyze the convergence and tracking properties of the algorithm. • An algorithm that exploits the CSMA structure of ZigBee/IEEE 802.15.4 for computing a maximum likelihood estimate of the number of nearby active sensors, to be taken as input by each adaptive filter. This is critical for estimating sensor interaction for the decentralized algorithm. • A timer-based pricing mechanism for load-sharing between sensors, which prolongs effective network lifetime by reducing energy consumption. Fairness is enforced by forcing active sensors into a standby mode after a sufficient period. • Implementation of the decentralized algorithms on a ZigBee test-bed in Matlab to demonstrate performance. 10 Chapter 1. Introduction 1.2.3 Contributions to Decentralized Dynamic Spectrum Access Game theoretic analysis is particularly relevant when the decentralized system under consideration consists of components which are truly self-interested, in the sense that they respond to the motivations of independent, self-interested users. In Chapter 5, we consider such a scenario, with regards to decentralized dynamic spectrum access for a competitive system of cognitive radios. The game theoretic analysis considers cognitive radios which select a number of available channels for transmission, competing via carrier sense multiple access for bandwidth on each one in order to satisfy individual user demands. Besides implementing the decentralized, adaptive filtering algorithm of Chapter 4 in a novel, more complex setting to provide a mechanism for cooperative spectrum access between self-interested users, the chapter provides several contributions, including: • Formulation of the decentralized channel access game for user rate satisfaction, given multiple shared channels whose availability varies in time due to the activity of primary, licensed users. • Modeling of a simple CSMA protocol for channel sharing, including a specification of probability of channel access success, a maximum likelihood estimation procedure for determining channel contention based on channel access success history, and a specification of the probability of channel collisions. • Formulation of a utility function which balances radio performance with a self-incurred social cost for depriving others of bandwidth, based on the quantities specified above and evaluated through the maximum likelihood estimation procedure. The utility function is motivated through analysis of its relationship with an intuitively reasonable measure of global optimality. • Numerical analysis, including a study of how adjusting various parameters of the system affects overall performance, and a comparison with other game theoretic adaptive schemes. 1.2.4 Contributions to Self-Configuration of Sensors with Global Games Sophisticated analysis of decentralized operation in sensor networks must account for the fact that individual sensors possess only imperfect information with which to make operational decisions. A game theoretic analysis can account for such situations by use of either Bayesian games [25], or, more tractably, with global games [17]. Chapter 6 represents, to the 11 Chapter 1. Introduction best of our knowledge, the first application of global games to a decentralized engineering effort in a field such as wireless sensor networks. It models sensors making activation decisions as game players with noisy information about their environment, which predict the actions of nearby sensors and choose actions with imperfect information. The contributions in this chapter are as follows: • Using the theory of global games, we motivate and establish conditions under which a simple threshold mode selection strategy provides competitively optimal (Nash equilibrium) behaviour for each sensor. This result holds even when each sensor receives its information in noise. Unlike previous work on global games, we consider the case where both observation noise and utility functions may vary between game players. • Given a prior probability distribution for sensor information, which is observed in either Uniform or Gaussian noise, we formulate the Nash equilibrium mode selection thresholds as the solution to set of integral equations. The structure of the solution is examined for limiting low noise conditions, which are particularly important when sensors accumulate and average information over time. • Equilibrium tests for the threshold strategy are given for both the Uniform and Gaussian noise cases. We also formulate a novel “near equilibrium” condition which provides a simple test for the threshold strategy under low noise conditions. • A procedure for decentralized computation of the Nash equilibrium threshold mode selection strategy is also given, based on an iterative best strategy response between sensor types. 1.3 Review of Technology Game theoretic design and analysis can potentially be applied to a wide variety of technologies, from the Internet (search, peer-to-peer systems, online auctions) to large-scale resource extraction and manufacturing technologies. In this thesis we apply the theory to automated systems enabled by wireless communications for defence, data gathering, and communication. For the reader’s convenience, these specific technologies are discussed below. 1.3.1 Missile Deflection and Network Centric Warfare Missile deflection and defence technology rely on advanced research in guidance and control, target tracking, and Radar systems. Advanced missile deflection techniques rely on the 12 Chapter 1. Introduction ability to use one’s own sensors (of which Radar is one) to determine the position and intentions of a threat, and then manipulate the missile’s sensors using physical or electromagnetic interference such as decoys or Radar jammers. Threat Detection and Mitigation: The problem of threat detection and assessment using Radar systems is the subject of considerable research, employing a variety of disciplines such as optimal filtering, Bayesian networks, and context free grammars. A comprehensive overview of modern Radar tracking systems and threat assessment can be found in [14, 53], and due to the extensive research this thesis assumes a reliable system is available for use. Some advanced countermeasure techniques are outlined in [15, 20], but due to the military nature of these applications most detailed results are unavailable in the public domain. For this reason, this thesis takes a general approach to the missile deflection problem, using simplified models and missile dynamics to illustrate the underlying theory. Examples of similar approaches can be found in [18, 28, 39, 52]. Network Centric Warfare: One of the key motivations for a game theoretic treatment of missile deflection is the concept of Network Centric Warfare or Network Enabled Operations [4, 8]. This is not so much a technology as a mode of thinking about military operations as a problem in information space. In terms of missile deflection, this involves the integration of Radars and decision makers through real-time communications to maximize information and hence improve defensive capabilities. On the other hand, it prescribes that attacks should be focused not only on physical assets, but also on communication infrastructure for optimal results. This fact particularly motivates Chapters 2 and 3, which addresses how much communication infrastructure is necessary for operation of a missile deflection system, and what can be done if it is compromised. 1.3.2 Wireless Sensor Networks Wireless sensor networks are a recent innovation stemming from advances in electronics miniaturization, environmental sensing technology, wireless communications, and advanced signal processing techniques [3, 26, 61]. The result is large, dispersed systems of inexpensive “sensor nodes” ideal for monitoring environments for which single-point-of-observation technologies fail. Wireless sensor networks have an advantage in that they can translate short-range signals such as ambient acoustic or magnetic activity into radio signals that can be efficiently propagated through multi-hop communication. In-network signal processing also increases efficiency by filtering and aggregating information en route to its destination. 13 Chapter 1. Introduction Power Constraints: The flexibility of wireless sensor networks also provides its greatest challenges. Limited power supplies mean sensors must be extremely conservative in their activities if monitoring is to be done over long time periods. Since wireless communication is an energy-intensive activity, sensors must communicate in a highly efficient manner, only transmitting the most relevant information and using as little overhead as possible. Challenges to Efficiency: There are several particular challenges faced in the quest for efficient operation of sensor networks. For example, collaborative target tracking with multiple sensors, energy-aware medium access control, routing, and querying, as well as sensor synchronization and tasking. Research activity in this area has been very high in recent years, with several conferences, books and journal issues being devoted entirely to the topic. Game theoretic approaches [35, 38, 40, 40, 49, 58, 59] are attracting considerable attention here and have displayed the capacity to provide flexible, effective solutions to many challenges in sensor networks. This thesis builds on these successes by taking a further, in-depth look at the game theoretic approach and extracting key design principles. 1.3.3 Cognitive Radio Dynamic Spectrum Access Cognitive radio [32, 41, 42, 43] is an emerging technology born out of the potential for flexible, universal wireless communication proposed by software defined radio [19]. The general definition of cognitive radio has become quite broad, encompassing any communication features that can be learned by an intelligent computing system. The fundamental purpose of a cognitive radio remains, however, to provide sufficient “always on” communication in any environment, a feature that is addressed specifically by research into spectrum agility and dynamic spectrum access [2, 50]. There are, in fact, two approaches to the spectrum access problem in cognitive radio, which seeks to assign transmission strategies to cognitive radios so as to minimize interference with each other and with other licensed users. These are the spectrum underlay and spectrum overlay approaches. It is likely that these will eventually be combined into a flexible, working protocol for dynamic spectrum access that meets industry requirements. Spectrum Underlay: The spectrum underlay approach involves using extremely lowpower signals to avoid interference with neighbouring users, and is epitomized by ultra wideband technology [34, 57]. To ensure this, an interference temperature is defined for each user, and radios must estimate the extent to which their activity appears as noise to other users, and adjust to keep each others’ interference temperature below a threshold. 14 Chapter 1. Introduction Spectrum Overlay: Spectrum overlay was introduced in [41] as “spectrum pooling” and investigated by the DARPA XG program [1] under the term “opportunistic spectrum access.” This approach involves opening up some radio frequencies to unlicensed access, under certain restrictions such as fairness and granting of precedence to licensed users. Several approaches have been advocated, including market-based allocation systems, priority assignments, and semi-regulated multiple access schemes. Competitive Approach: The game theoretic approach to cognitive radio has a slightly different justification than for missile deflection and wireless sensor networks, as the radio environment is more competitive than cooperative. Each cognitive radio has its own goal: to obtain a required amount of bandwidth, and any overarching system goals are more or less irrelevant to this objective. While it is certainly possible to open up radio spectrum for unrestricted competition, radio spectrum is a scarce resource, and could be subject to unfair monopoly by resourceful users in this case. The careful placement of rules, along with associated penalties, is thus necessary for providing a practical cognitive radio environment, and the aim in this thesis is to appropriately mix cooperation and competition to produce a efficient, fair and flexible communication system. 1.4 Contents of this Thesis The organization of this thesis is depicted in Figure 1.1, and discussed below. The thesis itself is written in a manuscript style as permitted by the Faculty of Graduate Studies at the University of British Columbia, with each chapter representing an independent research effort. Each chapter has its own motivation and literature review and can be read independently. For conveinence all the references are placed in a single bibliography at the end of the thesis. The unifying theme is the use of game theory as an analytical tool to predict the behaviour of systems of autonomously operating system components. The main assumption is that each component can be accurately modelled as a rational player in a game, who maximizes their expected outcome as measured by a locally held utility function, as a response to expectations about the actions of other players. 15 Analytical Tools: Time−constrained, sophisticated sensor/actuator platforms 3. Policy Computation 4. Implementation Issues 5. Fairness 1. Stochastic Game Model 2. Comparison of Policies (Nash, Correlated, Optimal) Force Protection against Anti−Ship Missiles Stochastic Optimization Information Theory Dynamic Programming Figure 1.1: Organization of Thesis. Game Theory Stochastic Approximation and Differential Inclusions Simple, energy−constrained sensors 3. Learning Algorithms and Convergence 4. Global Games and Structural Analysis 2. Energy Aware Intruder Monitoring: Cross−Layer performance with ZigBee 1. Energy Aware Mode Selection Sensor Activation Control in Wireless Sensor Networks Game Theoretic Methods for Networked Sensors and Dynamic Spectrum Access Competitive Mobile Radio Systems 3. Decentralized System Design for Fairness 4. Learning Algorithms and Convergence 1. Competitive, Agile Spectrum Allocation Model 2. Reservation−Based MAC Dynamic Spectrum Allocation for Competitive Cognitive Radios Chapter 1. Introduction 16 Chapter 1. Introduction Networked Missile Deflection: In Chapter 2 a decentralized missile deflection model for Naval defence is analyzed as a stochastic game, with electronic countermeasure platforms filling the role of sensor-actuators. We give conditions for optimality of the equilibrium strategy when platforms are unable to communicate, and propose an optimization problem yielding the best available Nash equilibrium strategy when only offline communication and common observations are available. This Nash equilibrium solution is meant to be precomputed by the platforms, and deployed through a lookup table at each platform to defend against known missile threats. We also discuss implementation issues, such as the problem of communicating situational awareness between platforms. Chapter 3 considers an extended scenario, in which platforms are capable of real-time coordination of effort, but may have differing objectives. In this case a correlated equilibrium best describes the optimal solution, and the improvement in using the correlated equilibrium is measured against the required communication overhead. In this case, the defensive solutions are deployed, for example, by a combination of pre-computed lookup tables; one for each ship, and optionally one for a coordinator which communicates with the task group. Practical schemes for implementation through real-time communication are discussed. Wireless Sensors and Correlated Equilibrium: Chapter 4 applies a static game formulation to the problem of decentralized resource allocation in wireless, ad-hoc networks of energy-constrained sensors. We focus on the problem of activation control, by which sensors can conserve battery resources by entering a sleep mode when they are not needed for critical activities, such as target sensing. In this case, no pre-computed policy is given, but sensors learn an appropriate policy through repeated play. Specifically, each sensor trades off the expected marginal increase in global utility due to a report with the expected energy required to transmit data using the ZigBee protocol in a crowded environment. Since wireless sensors have limited computational resources, we start with a simple adaptive filtering protocol deployed at each sensor, and analyze its global performance in game theoretic terms. The result is that the sensor network is shown to learn and track the set of correlated equilibrium behaviour of the underlying game in a completely decentralized manner, and we give a rigorous convergence analysis, based on a stochastic approximation analysis. Cognitive Radio and Correlated Equilibrium: Chapter 5 uses similar game theoretic techniques to solve a cooperative, decentralized dynamic spectrum allocation problem for a cognitive radio network. In this case, cognitive radios must select and share wireless communication channels when they are temporarily vacated by primary, licensed users. 17 Chapter 1. Introduction Each radio’s objective is to derive a channel allocation that satisfies its demand, while minimizing interference with others. A version of the adaptive learning scheme of Chapter 4 is proposed here for channel allocation. Estimation of the level of competition on each channel is critical to this learning scheme, and is also considered in detail. Wireless Sensors and Global Games: Chapter 6 presents a novel game theoretic approach to the problem of self-configuration of sensors in a wireless sensor network. Decentralized operation and self-configuration in wireless sensor networks is investigated for energy-efficient area monitoring using the theory of global games. Given a large number of sensors which may operate in either an energy-efficient monitoring mode, or a costly tracking and data transmission mode, the problem of computing and executing a strategy for mode selection is formulated as a global game with diverse utilities and quality of observations. Each sensor measures a noisy version of environmental conditions, and determines whether to enter its “high-resolution” mode based on its expected contribution and energy cost. We use a game theoretic analysis to formulate Nash equilibrium conditions under which a simple threshold policy is competitively optimal for each sensor, and propose a scheme for decentralized threshold computation. The threshold level and its equilibrium properties depend on the prior probability distribution of environmental conditions and the observation noise, and we give conditions for equilibrium of the threshold policy as the noise varies. Each chapter contains numerical simulations and analysis. Chapter 7 briefly summarizes the results of the thesis and provides concluding remarks. 18 Bibliography [1] DARPA: the next generation (XG) program. www.darpa.mil/ato/programs/xg/index.htm. [2] I. Akyildiz, W. Lee, M. Vuran, and S. Mohanty. Next generation/dynamic spectrum access/cognitive radio wireless networks: A survey. Computer Networks, 50:21272159, 2006. [3] I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci. Wireless sensor networks: A survey. Computer Networks, 38(4):393–422, 2002. [4] D. Alberts, J. Garstka, and F. Stein. Network centric warfare: developing and leveraging information superiority. CCRP Publications, second edition, 1999. [5] R. Aumann. Subjectivity and correlation in randomized strategies. Journal of Mathematical Economics, 1:67–96, 1974. [6] R. Aumann. Correlated equilibrium as an expression of bayesian rationality. Econometrica, 55(1):1–18, 1987. [7] R.J. Aumann and S. Hart. Long cheap talk. The Hebrew University of Jerusalem, Center for Rationality, 2002. [8] S. Babcock. Canadian network enabled operations initiatives. Proceedings of the 9th International Command and Control Research and Technology Symposium, 2004. [9] I. Barany. Fair distribution protocols or how the players replace fortune. Mathematics of Operations Research, 17(2):327–340, 1992. [10] Elchanan Ben-Porath. Correlation without mediation: Expanding the set of equilibrium outcomes by ”cheap” pre-play procedures. Journal of Economic Theory, 80(1):108–122, 1998. [11] M. Benaim, J. Hofbauer, and S. Sorin. Stochastic approximations and differential inclusions ii: Applications. Levine’s bibliography, UCLA Department of Economics, May 2005. 19 Bibliography [12] J. Benyus. Biomimicry: Innovation Inspired by Nature. Perennial, 2002. [13] D. Bertsekas. Dynamic Programming and Optimal Control, volume 1 and 2. Athena Scientific, second edition, 2001. [14] S. Blackman and R. Popoli. Design and analysis of modern tracking systems. Artech House, 1999. [15] L. Van Brunt, editor. The glossary of electronic warfare. EW Engineering, 1984. [16] A. Cahn. General procedures leading to correlated equilibria. International Journal of Game Theory, 33(1):21–40, 2004. [17] H. Carlsson and E. van Damme. Global games and equilibrium selection. Econometrica, 61(5):989–1018, 1993. [18] J. Cruz and M. Simann et. al. Game-theoretic modeling and control of a military air operation. IEEE Transactions on Aerospace and Electronic Systems, 37(4), 2001. [19] Dillinger, Madani, and Alonistioti. Software defined radio : architectures, systems, and functions. Wiley, 2003. [20] H. Eustace, editor. The International Countermeasures Handbook. EW Communications, sixth edition, 1980. [21] J. Filar and K. Vrieze. Competitive Markov Decision Processes. Springer-Verlag, 1997. [22] Francoise Forges. Universal mechanisms. Econometrica, 58(6):1341–64, 1990. [23] Francoise M Forges. An approach to communication equilibria. Econometrica, 54(6):1375–85, 1986. [24] D. Fudenberg and D. Levine. The theory of learning in games. MIT Press, 1999. [25] D. Fudenberg and J. Tirole. Game theory. MIT Press, 1991. [26] A.J. Goldsmith and S.B. Wicker. Design challenges for energy-constrained ad hoc wireless networks. IEEE Wireless Comm., 9(4):8–27, 2002. [27] Olivier Gossner. Secure protocols or how communication generates correlation. Journal of Economic Theory, 83(1):69–89, 1998. 20 Bibliography [28] S. Gutman. On optimal guidance for homing missiles. Journal of Guidance and Control, 1979. [29] S. Hart and A. Mas-Colell, editors. Cooperation: Game-Theoretic Approaches, pages 199–217. Springer-Verlag, 1997. [30] S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000. [31] S. Hart and A. Mas-Colell. A reinforcement procedure leading to correlated equilibrium, in Economic Essays, pages 181–200. Springer, 2001. [32] S. Haykin. Cognitive radio: Brain-empowered wireless communications. IEEE Journal on Selected Areas in Communications, 23(2):201–220, 2005. [33] P. Herings and R. Peeters. Stationary equilibria in stochastic games: structure, selection and computation. METEOR Working Paper No. RM/00/31., 2000. [34] W. Hirt and D. Porcino. Ultra-wideband radio technology: potential and challenges ahead. IEEE Communications Magazine, 41(7):66–74, 2003. [35] R. Kannan, S. Sarangi, S. S. Iyengar, and Lydia Ray. Sensor-centric quality of routing in sensor networks. IEEE Infocom, 2003. [36] H.J. Kushner and G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer-Verlag, New York, NY, second edition, 2003. [37] Ehud Lehrer and Sylvain Sorin. One-shot public mediated talk. Games and Economic Behavior, 20(2):131–148, 1997. [38] Ning Li and Jennifer Hou. Localized fault-tolerant topology control in wireless ad hoc networks. IEEE Transactions on Parallel and Distributed Computing, 17(4):307–320, 2006. [39] Y. Lipman and J. Shinar. Mixed-strategy guidance in future ship defense. Journal of Guidance, Control, and Dynamics, 1996. [40] A.B. MacKenzie and S.B. Wicker. Game theory and the design of self-configuring, adaptive wireless networks. IEEE Communications Magazine, pages 126–131, November 2001. 21 Bibliography [41] J. Mitola. Cognitive radio for flexible mobile multimedia communications. in: Proc. IEEE International Workshop on Mobile Multimedia Communications (MoMuC), pages 3–10, 1999. [42] J. Mitola. Cognitive radio: an integrated agent architecture for software defined radio. Ph.D Thesis, KTH Royal Institute of Technology, 2000. [43] J. Mitola. Cognitive radio for flexible mobile multimedia communications. Mob. Netw. Appl., 6(5):435–441, 2001. [44] J. Nash. Non-cooperative games. Annals of Mathematics, 54(2):286–295, 1951. [45] R. Nau, S. Gomez Canovas, and P. Hansen. On the geometry of nash equilibria and correlated equilibria. International Journal of Game Theory, 32(4):443–453, 2004. [46] M. Osborne and A. Rubenstein, editors. A course in game theory. MIT Press, 1994. [47] M. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, 1994. [48] J. Robinson. An iterative method of solving a game. Annals of Mathematics, 54:298– 301, 1951. [49] C. Saraydar, N. Mandayam, and D.Goodman. Efficient power control via pricing in wireless data networks. IEEE Transactions on Communications, 50(2), 2002. [50] S. Shankar. Spectrum agile radios: utilization and sensing architecture,. in: Proc. IEEE DySPAN, pages 160–169, 2005. [51] L. Shapley. Stochastic games. Proceedings of the National Acadamy of Sciences USA, 39(10), 1953. [52] H. Sherali, Y. Lee, and D. Boyer. Scheduling target illuminators in naval battle-group anti-air warfare. Naval Research Logistics, 42, 1995. [53] M. Skolnik. Introduction to radar systems. McGraw-Hill, 2002. [54] V. Teague. Selecting correlated random actions. Proceedings of Financial Cryptography, 2004. [55] J. Vidal. Fundamentals of multiagent systems: using netLogo models. Unpublished, 2006. http://www.multiagent.com/fmas. 22 Bibliography [56] J. von Neumann and O. Morgenstern. Theory of game snad economic behavior. Princeton University Press, 1944. [57] M. Win and R. Scholtz. Impulse radio: how it works. IEEE Communications Letters, 2(3):36–38, 1998. [58] Y. Xing and R. Chandramouli. Distributed discrete power control in wireless data networks using stochastic learning. Proc. CISS, Princeton, 2004. [59] Yiping Xing and R. Chandarmouli. Distributed discrete power control for bursty transmissions over wireless data networks. In in Proc. CISS? 2004. [60] T. Rabin Y. Dodis, S. Halevi. A cryptographic solution to a game theoretic problem. Proceedings of CRYPTO, 2000. [61] F. Zhao and L. Guibas. Wireless sensor networks: an information processing approach. Elsevier, 2004. 23 Chapter 2 Decentralized Force Protection Against Anti-Ship Missiles: Nash Equilibrium Approach 2 2.1 Introduction Decentralized decision making in network centric warfare (NCW) is an area of research that seeks to balance integration and autonomy among military platforms. In [62], NCW is presented as a way to obtain and exploit an information advantage through shared awareness and communication; independent platforms share information and synchronize operations to achieve increased combat effectiveness. However, bandwidth limitations and the distributed nature of battlespace information often leaves platforms unable to fully integrate into a network centric force. This chapter considers such a scenario, in which platforms must make critical decisions independently, even while they work towards a common goal. We focus on the problem of decentralized missile deflection in a Naval setting. This involves the rapid deployment of countermeasures (CM) on an extremely compressed time scale, so is an ideal candidate for decentralization. This is true even though missile deflection capabilities could be improved by coordinating CM-equipped ships, since coordination on such a rapid time scale may not be feasible, especially among a large and diverse fleet. There are three important reasons to consider decentralized decision making in NCW. First, transmitting time-critical, high resolution battlespace data from all platforms to a central coordinator is bandwidth intensive, and the computational complexity involved in coordinating multiple ships is prohibitive. Second, decentralized architectures reduce vulnerability attacks in the information domain (jamming or spoofing), since isolated platforms are able to coordinate their actions even without communication. Finally, decentralization 2 A version of this chapter has been accepted for publication. Maskery, M. O’Regan, C. and Krishnamurthy, V. (2007) Decentralized Algorithms for Netcentric Force Protection against Anti-ship Missiles, IEEE Transactions on Aersopace and Electronics Systems 43:4. 24 Chapter 2. Decentralized Force Protection: Nash Equilibrium ensures robustness, since there is no reliance on communication equipment or on a central coordinator. In this chapter we propose game-theoretic decision making algorithms for decentralized force protection against anti-ship missiles. We consider the following scenario. A task group, communicating over a common data network, faces an attack by a fixed number of missiles. The missiles appear randomly on the perimeter of a fixed physical space over time, and move toward the ships (targets) under known targeting and guidance laws. Each CM platform has hardkill and/or softkill weapons (countermeasures) that can be dynamically employed to deflect or destroy missiles in a statistically predictable way. Each platform observes missile states (e.g. bearing, range, etc.) and deploys countermeasures dynamically and independently, in light of the fact that other ships are doing the same. During an attack, weapons may act concurrently and must act in a complementary way. This scenario is formulated as a transient stochastic game, for which ships may compute a joint countermeasure (CM) policy that is in Nash equilibrium. This policy is executable in a decentralized fashion, thereby making it suitable for area defence. The main contributions of this chapter are: • A novel collaborative game theoretic formulation for force protection against anti-ship missiles. • Formulation of a stochastic shortest path (transient stochastic) game and its Nash equilibrium solution. • A Nash equilibrium solution based on state of the art algorithms for solving Markovian games. • Simulation examples, with illustration of the sensitivity of the solution to changes in the missile dynamics. The theory of Markov decision processes, [64, 81], and its extension to stochastic games, [72, 84] is used to formulate and solve the missile deflection problem. For a single platform, Markov decision theory provides the basis for computing optimal, dynamic missile deflection policies. These types of problems are well-studied, particularly in the Operations Research community. Of particular interest are [77, 78], which consider partially observed processes, and derive solutions by considering a fully observed approximate process. This suggests that our approach, which assumes the missile process is observed without error, is a useful one. Stochastic game theory, in turn, prescribes dynamic policies for multiple platforms, with optimality being replaced by the Nash equilibrium concept, which requires that each platform’s policy is optimal given the policies of others. 25 Chapter 2. Decentralized Force Protection: Nash Equilibrium There is a key difference between our approach and the more standard platform-centric approach considered in several papers in the literature. In the simplest instance of the platform centric model, one attacker and one defender play a pursuit-evasion game between defensive and anti-ship missiles, [66, 73, 76]. In this scenario, two missiles are viewed as players with clear targets, but no allowance is made for multiple defensive missiles to coordinate their actions. More generally, [70] models two forces (attacking and defending), engaging in a well-defined general sum stochastic game. Each force consists of multiple platforms, completely controlled by two respective commanders, with unlimited communication. However, since each force is considered a single unit from a decision making perspective, a decentralized model is again inapplicable. Finally, [85] considers a target illumination problem very similar to our multiple-ship, multiple-missile deflection scenario and computes an optimal scheduling solution for defeating the missile threat, which is implemented through full coordination. In contrast, we consider, here and in [79], a stochastic game with multiple platforms, with each platform being an independent decision maker, unable to fully communicate to its allies in real time. That is, ships play a game with each other, instead of with a missile. This perspective naturally allows us to consider decentralized solutions, which may be implemented when full coordination is impossible. This chapter is organized as follows. In Section 2.2, we motivate the decentralized missile deflection problem and review the theory of stochastic shortest path games. In Section 2.3, we model the missile deflection problem as a stochastic game and describe the various dynamics and objectives of the game. In Section 2.4, we define the Nash equilibrium solution to the problem and propose an iterative solution method useful when all players have identical objectives. Section 2.5 considers the addition of CM constraints to the problem. In Section 2.6, two simplified examples are given to demonstrate the effectiveness of a Nash equilibrium policy. Finally, Section 2.7 summarizes the chapter. A glossary of variable definitions is provided in the appendix. 2.2 Motivation and Review of Stochastic Games In this section, we motivate the decentralized missile deflection problem by presenting a decentralized architecture for CM deployment. We then briefly review the theory of stochastic games which will provide the framework for the decentralized scheme. 26 Chapter 2. Decentralized Force Protection: Nash Equilibrium 2.2.1 Decentralized Architecture for Implementing Countermeasures with a Common Air Picture We begin with an overview of the decentralized architecture, in which CM platforms react independently to common observations of the missile threats. In our system, each CM platform uses local sensors to track missile threats, and shares this information with other platforms through a common database (the Track Cache). Using this common information, platforms deploy CM independently, using a local CM control system. The proposed scheme is outlined in Figure 2.1. The use of local sensors to track missile threats allows for a significant reduction in communication. Platforms track the threat that they are currently engaging, and periodically store their information in a track cache, which provides common information to all agents. External track information, eg. from aircraft or sensor networks, is also reported to the track cache. A platform can switch threat engagements rapidly by looking up the required information in the track cache. Communication is limited to track information, and is not present in decision making. This allows each platform to make time-critical decisions without having to wait for consensus or instruction from a centralized controller. The game theoretic approach allows coordination to emerge even though platforms act independently, since each platform attempts to predict and respond to the others’ actions in an optimal fashion. For the reader’s convenience, we summarize the chapter in terms of Figure 2.1. The CM array contains the countermeasures described in Section 2.3.3, while the tracking systems are referred to in this section as well as Section 2.2.2. Missile dynamics will be discussed in Section 2.3.2. The main focus of this chapter, however, is on the CM control units. This includes the definition of the problem in game-theoretic terms (Section 2.3), the formulation of objectives (Section 2.3.4), and the computation of Nash equilibrium policies (Section 2.4 and Algorithm 2.4.1). 2.2.2 Motivation: Pulse Sequence Jamming over a Link 16 Network We motivate our decentralized architecture through an example implementation. Assume a group of three ships implement a centralized CM system using a Link 16 network. Link 16 is a common tactical data link, used by some navy forces for the coordination of threat observations and action coordination. We assume a nominal bandwidth of 100 kbps for the data link. Each platform is targeted by a single missile, and records a threat radar pulse sequence from that missile. The CM mechanism operates by reflecting this pulse sequence 27 Chapter 2. Decentralized Force Protection: Nash Equilibrium Ship 1 Local Tracking CM Array CM Control Ship 2 Local Tracking CM Array CM Control Ship L Local Tracking CM Array CM Control Track Cache (common air picture) External Tracking Figure 2.1: Overview of the decentralized missile deflection system. Agents merge local and external tracking information to ensure a common air picture, but otherwise make all countermeasure (CM) decisions locally. 28 Chapter 2. Decentralized Force Protection: Nash Equilibrium after a small intentional delay, so as to confuse the missile targeting system. This strategy is based on a range gate pulloff CM strategy. The delay value is computed by a central CM system, which is transmitted to the ships. Assuming incoming missile threats travel toward the task group at a speed of 500 m/sec, we assume threat measurements and CM decisions are made every 0.1 seconds. We also assume that each platform has eight possible delay actions. The bandwidth requirement for sending commands from a central CM decision system in the above scenario is about 30 bits/sec/user. This seemingly small data rate is significant, given that the Link 16 network may be supporting 1000-2000 processes of varying priority in an engagement scenario, but is nevertheless achievable using current and future networks. However, if communication between ships is reduced or eliminated due to fleet damage, jamming of the communication channel, or emission control limitations, or if the central CM control unit itself is damaged, the resulting loss of the centralized CM system will leave the ships isolated and unable to coordinate their CM actions to achieve maximum effectiveness. A decentralized architecture eliminates these vulnerabilities, making it much more robust in a realistic engagement setting. 2.2.3 Review of Stochastic Shortest Path Games In this section we provide a brief overview of the components of a stochastic shortest path game. This combines the theory of stochastic games [72, 84], with the theory of transient Markov decision processes [64]. Definition of a Stochastic Shortest Path Game A stochastic shortest path game, derived from stochastic shortest path problems as in [64], is essentially a Markov decision process which reaches a terminal state in an almost surely finite number of steps. The objective is to choose an action policy which guides the game to the terminus with minimum cost (wherein cost is regarded as path length), when each player has independent input into the joint action at each step. An L−player stochastic shortest path game is a 4−tuple (S, {Ul , gl : l = 1, . . . , L}, p) with the following interpretation. The state space S is a finite collection of states, including a terminal state τ. The action set Ul consists of the choices of player l. For each state i ∈ S, define Ul (i) ⊂ Ul to be the set of actions available. Denote the joint action set U = U1 ×U2 ×. . .×UL , with typical element u = (u1 , u2 , . . . , uL ) = (ul , u−l ) for any l ∈ L. Player l’s one-period cost function is given by gl : S × U → R. The transition probability p : S × U → ∆(S) specifies a probability distribution over next states, given the current state and actions, where ∆(S) denotes the space of all probability distributions 29 Chapter 2. Decentralized Force Protection: Nash Equilibrium on S. We write that p(i, j; u) for the probability that the next state is j, given the current state and action are (i, u), and require that p(τ, τ ; u) = 1 for all u, and that p(i, τ ; u) > 0 for some reasonable pairs (i, u) in a sense to be made more precise below. We also require that the cost g l (τ, u) = 0 for all l, u. Stationary Policies Each player in a game implements a policy for choosing actions in each state. As is standard in stochastic shortest path games, we restrict our attention to stationary randomized policies that are independently executable (which are known to be optimal for the problem at hand [63, 64]), as follows: Definition 2.2.1 Let ∆(Ul ) represent the set of probability distributions over Ul . An independently executable stationary randomized policy for platform l, denoted µl , is a function µl : S → ∆(Ul ). For state i, the probability of deploying CM u is µl (i, u), and each distribution µl (i, :) has support at most Ul (i). Since the cost and transition probabilities are a function of joint actions, we also define policy profiles as follows: Definition 2.2.2 Given a collection of policies {µl , µ2 , . . . , µL }. A policy profile µ is a function µ : S → (∆(U1 ) × ∆(U2 ) × . . . × ∆(UL )). For state i, the probability of deploying joint CM u = (u1 , u2 , . . . , uL ) is µ(i, u) = Y µl (i, ul ). l∈L Also define µ−l to be the policy profile of all agents but l, so that for any ν ∈ ∆(Ul ), (ν, µ−l ) denotes a policy profile, with µ = (µl , µ−l ). Total Expected Costs Each policy profile µ induces expected one period costs and transition probabilities given by: g l (i, µ) = X g l (i, u)µ(i, u), and, (2.1) p(i, j; u)µ(i, u). (2.2) u∈U (i) p(i, j; µ) = X u∈U (i) Let gl (µ) denote a (column) vector of the g l (i, µ) over all i 6= τ, and let P (µ) denote a matrix of the p(i, j, µ) over all i, j 6= τ. Let {Xn : n = 1, 2, . . .} denote the evolution of a 30 Chapter 2. Decentralized Force Protection: Nash Equilibrium transient Markov chain with transition matrix P (µ). For any K ⊂ S, the expected number of visits to K given initial distribution α ∈ ∆(S) is given by the occupation measure [63]: ftc (α.µ; K) = ∞ X P r(Xn ∈ K) (2.3) n=1 The total expected cost to Player l under initial distribution α is defined as: vαl (µ) = X ftc (α.µ; j)g l (j, µ). (2.4) j∈S For initial distribution δi with unit mass at i ∈ S, we write: vl (i, µ) = X ftc (δi .µ; j)g l (j, µ). (2.5) j∈S To ensure vl is finite, we restrict our attention to proper policies µ such that max(P r(X|S|−1 6= τ |X0 = i, µ)) < 1. i∈S (2.6) It follows that P (µ) is substochastic (at least one rowsum is less than one) and hence invertible. Define vl (µ) to be a column vector of the vl (i, µ) over i ∈ S. This leads to the following theorem, which gives closed form expressions for the total expected cost under a policy µ : Theorem 2.2.1 The total expected costs under policy profile µ can be expressed as: vl (µ) = (I − P (µ))−1 gl (µ), and, (2.7) vαl (µ) (2.8) = α(I − P (µ)) −1 l g (µ), Moreover, the total expected cost vl (µ) satisfies the recursion vl (µ) = gl (µ) + P (µ)vl (µ). (2.9) Proof See Appendix. Nash Equilibrium In the single player case, the theory of Markov decision processes yields an optimal policy that maximizes the total expected cost v(µ). For multiple players, the equivalent condition is that the policy profile µ is a Nash equilibrium, defined as follows: 31 Chapter 2. Decentralized Force Protection: Nash Equilibrium Definition 2.2.3 A policy profile µ is a Nash equilibrium if, for each l ∈ L, i ∈ S and every alternative ν l , vl (i, µ) ≤ vl (i, (ν l , µ−l )). (2.10) The Nash equilibrium profile is a collection of independently executable policies that are optimal from each player’s perspective, and hence computable and implementable in a decentralized fashion. Thus, players may operate without explicit coordination, but their behaviour is still “competitively optimal” in the sense that each player’s policy is an optimal response to the policies of the others. In Section 2.3, we place the elements of the tuple (S, {Ul , gl : l = 1, . . . , L}, p) into the context of the missile deflection problem. Computation of a Nash equilibrium policy profile is discussed in Section 2.4. 2.3 Decentralized Missile Deflection as a Stochastic Shortest Path Game In this section we carefully construct the components of a stochastic game for the missile deflection problem. The game is described briefly as follows: 1. A multiple missile threat evolves in discrete time on the discrete state space S, with the state at time n = 0, 1, 2, . . . given by Xn . 2. A set of CM-equipped platforms L observe the threat process, and deploy countermeasures to protect the task group. At time n, platform l ∈ L deploys countermeasure Unl from a given set Ul (Xn ), without knowledge of the CM actions chosen by others. This results in a joint countermeasure Un = (Un1 , Un2 , . . . , UnL ), which belongs to an appropriate product space U(Xn ). 3. Each platform l ∈ L associates a cost g l (i, u) with each non-terminal state i ∈ S and action u ∈ U(i), and seeks a competitively optimal policy µl in the form of a Nash equilibrium. 4. The missile threat reacts to the joint countermeasure, defining a stochastic process {Xn , n = 0, 1, 2, . . .} on S, such that P r(Xn+1 = j|Xn = i, Un = u) = p(i, j; u). (2.11) 32 Chapter 2. Decentralized Force Protection: Nash Equilibrium These probabilities are assumed known for every i, j ∈ S and u ∈ U(i). We assume that the missile threat process reaches a terminal state τ in (almost surely) finite time, which corresponds to a termination of the game either through missile hits or a successful evasions, or a combination of both. The key assumptions of our model are as follows: • The theater of operations is modeled as a discrete space. Since a platform obtains its awareness through digital sensors which necessarily quantize threat information, this assumption is, in a sense, natural. Moreover, if sensor measurements are noisy, a coarser discrete state space serves (heuristically) to reduce this noise by “rounding” measurements to their nearest discrete state. • Missiles move in discrete time. This reflects the fact that known sensor systems return threat observations only at discrete, periodic intervals. • The task group is considered to be stationary, due to the relative speed difference between missiles and ships, the short engagement time, and the fact that ship maneuvers may be restricted. • Sensors are assumed to yield accurate information about the missile position and velocity. Considering the quality of service of sophisticated radar tracking systems, and the fact that tracking often occurs on a faster time scale than decision making, thus averaging out noise, this is reasonable. • Missiles do not coordinate their actions. This is a reasonable simplifying assumption given the scope of this chapter, as it allows us to avoid complications in the missile dynamics. It is quite possible that missiles will coordinate their actions in reality, in which case the techniques in this chapter still apply, but the dynamics become more complex. When sensor measurements are subject to noise, the scenario presented here provides us with an upper performance bound. In addition, one can use the results here to construct a certainty equivalent (suboptimal) controller for partially observed problems [64]. We also refer the reader to the sensitivity analysis of Section 2.6.2, which suggests that measurement errors are not necessarily detrimental to performance. In the remainder of this section, we describe the components (S, Ul , gl , p) of the stochastic game in detail. 33 Chapter 2. Decentralized Force Protection: Nash Equilibrium 2.3.1 State Space of the Multiple Missile Threat We begin by describing the state space S of the stochastic missile deflection game. The state space of a single missile m is described first, which is then extended to the multiple missile threat. At time n, missile m has state given by: Ã Ynm Znm ! Ã = Position of missile m at time n Target of missile m at time n ! . (2.12) This state is an element of the space: Sm = {B × T, r, τ }. (2.13) Here, either Ynm ∈ B and Znm ∈ T or the missile is in one of two special states r, τ. The battlespace B represents the physical missile location, while T represents its target. Since the missile target is also a location, we take T ⊂ B. We also define G ⊂ T to be the set of true target positions occupied by the task group. Finally, the pre-engagement state r corresponds to missile m being a potential but as-yet untracked threat, while the terminal state τ, corresponding to a missile has reached a target, terminated its flight, or has acquired a trajectory that ensures it is no longer a threat. For states i ∈ B × T, we denote i by (iy , iz ), where iy and iz denote position and target components, respectively. The space B is taken to be a two-dimensional grid, sufficient for sea-skimming missiles. A 600 × 600 grid can describe the position of a missile over a 30 km range with 50 m precision, suitable for speeds of 500 meters per second and threat measurements made 10 times per second. To extend the threat to multiple missiles, let M denote the number missile threats in a given scenario, including potential missiles (in state r). For ease of notation, M also denotes the set of missiles. The threat state at time n is defined as Xn = (Yn1 , Zn1 , Yn2 , Xn2 , . . . , YnM , ZnM ). (2.14) This is simply an element of product state space S, defined as S = S1 × S2 × . . . × SM . (2.15) For simplicity, we label state (τ, τ, . . . , τ ) ∈ S as simply τ, corresponding to the termination M of all missile threats. A generic state i ∈ S is written as (i1y , i1z , i2y , i2z , . . . , iM y , iz ), or simply 34 Chapter 2. Decentralized Force Protection: Nash Equilibrium as (i1 , i2 , . . . , iM ) when positions and targets do not need to be distinguished. 2.3.2 Dynamics of a Radar-Guided Multiple Missile Threat The dynamics of the missile process, described by the transition probability function p is described in this section. We assume that each missile evolves independently, so the state (Ynm , Znm ) as in (2.12) evolves according to a Markov process for each missile m, controlled in part by CM actions. The transition probabilities specify the dynamics of the process, which can be separated into two parts: the physical dynamics and the targeting dynamics. These are defined as follows: Definition 2.3.1 For each joint CM action Un at time n, the physical and targeting transition probabilities of missile m are separately given by m m m pm y (i, j; z, u) = P r(Yn+1 = j|Yn = i, Zn = z, un = u), m m m pm z (i, j; y, u) = P r(Zn+1 = j|Zn = i, Yn = y, un = u). (2.16) These probabilities are assumed to be conditionally independent, hence the overall transition probabilities are given by m m pm (i, j; u) = P r((Yn+1 , Zn+1 ) = (jy , jz )|(Ynm , Znm ) = (iy , iz ), un = u) m = pm x (iy , jy ; iz , u)pz (iz , jz ; iz , u). (2.17) (2.18) The transition probabilities are assumed given, and the following natural conditions can be imposed on them, for all controls u, and all active states i, j : pm (i, τ ; u) = 1 whenever ix ∈ G, (missile terminates upon reaching a target) pm (τ, τ ; u) = 1, (terminated missiles do not recover) (2.19) pm (r, j; u) = β m (j), (activation probabilities are independent of control). β m is the transition probability distribution out of the pre-engagement state r. For example, 1 − β m (r) is the probability that a potential missile will become active at the next time index. Three example probabilistic models for the missile processes are given below. The proper policy assumption (2.6) requires that the missile may reach state τ from any state with positive probability. This is not too restrictive a condition, and we assume it holds here. It simply states that the missile threat cannot carry on forever without terminating. 35 Chapter 2. Decentralized Force Protection: Nash Equilibrium Example Probabilistic model for the physical guidance system. Assume the missile operates according to proportional guidance [65]; the missile moves on an intercept course towards a target by predicting its position at the time of impact. However, because the missile may only make noisy position and velocity estimates of its target, occasional deviations from the proper course may occur. Suppose a missile at a is on a straight-line intercept course that should take it next to point b(a, z) on its way to target position z. A simple model for introducing random guidance errors is to specify: pm y (a, b(a, z); z, u) = 1 − q + pm y (a, b (a, z); z, u) = q/2 (2.20) − pm y (a, b (a, z); z, u) = q/2 for a given z, u and some small error probability q, where b+ and b− are points in B to the immediate left and right of b. Example Probabilistic model for the targeting system. The targeting transition probabilities depend on some programmed decision rules of the missile, but targets can also “drift” if the missile is viewed as a receiver operating in a noisy environment. Suppose the missile has chosen a target at location z, and the radar receives information about the target position over a discrete memoryless channel [69]. The channel randomly reports the position as z 0 0 when in fact it is z with probability pm z (z, z ; y, u) and the missile updates z accordingly. The distortion may or may not vary with the missile position. Effective countermeasures may be modeled as systems in which pm z (z, z; y, u) decreases with u. Example Probabilistic model for missile decoys It is conceivable that some of the missiles in M may be decoys, intended to reduce the effectiveness of CM by forcing agents to spread their resources too thinly. If decoys cannot be discriminated, then agents will be forced to treat every missile as real. However, if decoys are detectable, then agents may remove them from consideration as they are detected, thereby increasing the effectiveness of the defence system. Assume that the dynamics of the missile deflection problem are given by the transition probabilities pm (i, j; u) for all m ∈ M, i, j ∈ Sm , and u ∈ U (i), and assume that all agents independently discover missile m to be a decoy with probability P r(missile m is detected to be a decoy in state i|missile m is a decoy) = f (i). (2.21) It is assumed that no false detections occur. If missile m has an a priori probability of 36 Chapter 2. Decentralized Force Protection: Nash Equilibrium being a decoy given by P r(missile m is a decoy) = g(m). (2.22) Then, we have the following proposition: Proposition 2.3.1 Given that decoy detections are independent, and assuming that false decoy detections are negligible, the dynamics of the missile deflection problem can be replaced with: pm (i, j; u) := pm (i, j; u) (1 − g(m)f (i)) , j 6= τ, and, pm (i, τ ; u) := 1 − (1 − pm (i, τ ; u) (1 − g(m)f (i)) . (2.23) Proof If a missile is revealed to be a decoy, it ceases to be a threat, and hence jumps to state τ. Since no false detections occur, we have P r(missile m is detected to be a decoy in state i) = g(m)f (i). (2.24) The missile transitions to j only if it is not revealed to be a decoy. Hence the transition probabilities are discounted to m P r(Yt+1 = j|Ytm = i) = pm (i, j; u)(1 − g(m)f (i)). (2.25) The second part of (2.23) is then derived from the requirement that probabilities sum to one: 1− X [pm (i, j; u)(1 − g(m)f (i))] = 1 − (1 − pm (i, τ ; u) (1 − g(m)f (i)) . (2.26) j6=τ Once decoy detection is incorporated into the model via the modified transition probabilities (2.23), agents will automatically account for the possibility of decoys in their optimal policies. In essence, agents will adjust their optimal policies to slightly favor states in which detection probability is high, in hopes of driving some missiles to termination. We assume that each missile evolves independently (they do not communicate). In this case the transition probabilities of the threat process may be written as p(i, j; u) = M Y pm (im , j m ; u). (2.27) m=1 37 Chapter 2. Decentralized Force Protection: Nash Equilibrium This formulation allows us to consider a variety of threat situations. Not only does each distinct value of M represent a fundamentally different threat situation, but the potential for mixtures of different missile types leads to a combinatorially large number of situations for each M. In practice, it is recommended that only a small number of threat situations be considered, and that these are chosen to be representative of the types of situations anticipated. 2.3.3 Countermeasure Actions The action set Ul of the stochastic game will be described in this section. The action set of each ship depends on the type and number of CM capabilities available. We assume that the CM actions available for any threat state are enumerated from one to |U l (i)|. The effects of each CM action are specified by the transition probabilities p as in Section 2.3.2. In this section, we consider the CM actions in a qualitative way. There are a wide range of countermeasures (hardkill and softkill weapons) for deflecting active radar homing anti-ship missiles. Softkill weapons target the missile radar, deceiving or disorienting the missile to so as to counteract its targeting system. Chaff, offboard active or passive decoys, and onboard active jammers are some of the examples considered, the details of which can be found in [68, 71, 80]. Hardkill weapons are directed to intercept the missile and actively destroy (kill) it through direct impact or explosive detonation in the proximity of the threat missile. In this section we briefly outline how each of these can be modeled, in a general way, as a CM choice affecting the transition probabilities of the missile process. Passive Decoys and Chaff Chaff consists of thin metallic strips launched into the air, which supply false target echoes to the missile radar. Similarly, discrete passive decoys are lightweight structures that float and have a high radar cross section. The decoys are designed with a number of corner reflectors which mimic target radar signatures. In this model, chaff is restricted in the sense that once a ship launches a chaff cloud in a desired location, no further control choices can be made, although the chaff may drift with the wind. This is a CM constraint, which may be incorporated into the model by the methods of Section 2.5.3. Another model would allow a series of chaff launches while the ship moves away from the resulting blooms. Chaff bloom points are range limited. If a chaff or passive decoy launch is part of the current CM joint action, the transition probabilities p(i, j; u) are affected by increasing the probability that missiles will target the chaff cloud or decoy. 38 Chapter 2. Decentralized Force Protection: Nash Equilibrium Off-board Active Decoys and Onboard Active Jammers These countermeasures also affect the missile transition probabilities by providing false targets, but can be dynamically controlled. Off-board active decoys may be placed on moving platforms and contain radio frequency repeaters to radiate a ship-like radar signal. The decoy is directed by control of the platform motion. The decoy moves slowly away from the vessel hopefully drawing the attacking anti-ship missile away from its intended target at the same time. The limits on velocity can be modeled by the control constraints discussed in Section 2.5.3. Active onboard jammers emit electromagnetic radiation designed to confuse a missile radar, so that its target position is lost or effectively moved. For example, cross-polarized and crosseye jamming cause angle errors. Range gate pulloff acts similarly but may be modeled as causing only range errors instead of angle errors. Hardkill Measures Hard kill measures, such as anti-missile rockets and close-in weapon systems, affect a threat missile by driving the missile directly to termination. To model a hard kill measure against missile m, the transition probability pm (i, τ ; u) is set roughly equal to the probability of success of the hardkill measure, in a similar manner to (2.23). To reduce the number of possible control choices, hard kill measures can be assumed to be directed at the most immediate threat. 2.3.4 Platform Objectives and State Costs In this section we formulate two defensive objectives for deploying countermeasures, and associate these with the cost function gl of the stochastic game. Two special categories of objectives are positional objectives, in which g l (i, u) is independent of u, and collaborative objectives, in which g l (i, u) is identical for all l. For simplicity, we consider only positional objectives, and later we present an algorithm designed for collaborative ones. Collaborative objectives are especially attractive, since agents will not be tempted to save themselves by endangering others. Positional objectives are sufficient when the goal of defending targets outweighs considerations of the types of CM used. For active missiles, we define the following: Definition 2.3.2 For l ∈ L, define G(l) to be the set of targets defended by platform l. We assume all targets are protected by at least one platform, so that G= [ G(l). (2.28) l∈L 39 Chapter 2. Decentralized Force Protection: Nash Equilibrium The target sets give us a context in which to consider the defensive objectives below. Minimizing the Probability of Hit A reasonable defensive goal for platform l is to minimize the probability that the missiles will hit its target set G(l) from any initial state. Consider the cost function g l (i) = I(im y ∈ G(l), some m), (2.29) where I is the indicator function. Minimizing (2.29) results in a policy that minimizes the probability of hit. This is established by the following proposition: Proposition 2.3.2 If gl is as in (2.29), then the ith entry of (I − P (µ))−1 gl is exactly the probability of reaching G(l) from state i under joint policy µ = (µ1 , µ2 , . . . µL ). Proof As in Theorem 2.2.1, the ij th entry of (I − P (µ))−1 gives the expected number of visits to state j starting from state i and following joint policy µ. By (2.19), states j such that g l (j) = 1 are visited at most once, so the expected number of visits to such states is also the probability that they will ever be reached. The matrix multiplication (I − P (µ))−1 gl sums the probabilities, which are mutually exclusive events, hence the result is simply the probability of reaching G(l) under policy µ. The quantity (I − P (µ))−1 gl can be obtained by solving the equation qµl = P (µ)qµl + gl (2.30) for qµl . If the G(l) are disjoint, summing the probabilities of hit over all agents will give the overall hit probability (the probability that the target set G is reached). Thus, a measure of global performance can be obtained even when the platform objectives are not collaborative, for example as the solution qµ to qµ = P (µ)qµ + X gl . (2.31) l∈L Multiplication of qµ by initial distribution α gives a real-valued performance measure for a given policy. 40 Chapter 2. Decentralized Force Protection: Nash Equilibrium Maximizing the Minimum Distance Another objective is to maximize the minimum distance achieved between missiles and targets over the entire history of the process. Under this objective, agents endeavor to keep each missile at least as far away from its target set as it has been in the past. This is not immediately a Markovian objective; if a missile threatens to enter some state j, platform l will react differently depending on the threat history. If missiles have previously come closer than j, then the platform will not expend as much effort repelling this particular threat, since it has survived worse threats and likely has more immediate ones. The practice of reacting differently depending on the threat history requires some explanation. A policy which seeks to prevent missiles from coming closer than they ever have in the past is prone to defend ships less aggressively than a policy which simply seeks to prevent missiles from coming closer than they currently are. (This second policy is easily implementable without the state augmentation of this section, simply by penalizing a player whenever a missile moves closer than it was in the previous state.) However, a policy which depends on past as well as present states opens up rich possibilities for stochastic games. For example, as a missile attack progresses, the primary objective may not be to keep missiles as far away as possible, but to maintain a previously established safe perimeter, in which maneuvers can be performed. The practice of maximizing the minimum distance can serve to spread defensive efforts among different missile threats, providing a more robust perimeter at the possible cost of a temporary loss of space. A further justification for maximizing the minimum distance is that there may be a perimeter at which countermeasures are particularly effective, however this perimeter may be relatively close to the task group. By relaxing the objective to account for historical distances, the task group can ”draw in” missiles to this effective range before engaging aggressive countermeasures, which may be a more strategic use of countermeasures than the alternative, non-history dependent policy. To account for historical distances, we must augment the state space of the problem. For i ∈ S and all sets A ⊂ B, where B is the battlespace as in (2.13), define the distance from i to A by d(i, A). A particularly convenient distance function is defined as follows. Definition 2.3.3 Fix ε > 0. Partition S into {D0l , D1l , . . . , }, with i ∈ Dkl if and only if G(l) can be reached from i in k steps with some minimal probability εk . Define d(i, G(l)) to be the minimum number of transitions it takes to reach G(l) from i with this minimal 41 Chapter 2. Decentralized Force Protection: Nash Equilibrium probability. That is, d(i, G(l)) = k : i ∈ Dkl ³ ´ = min k : P r(Xk ∈ G(l)|X0 = i, µ) > εk . (2.32) µ Note that the partition can be defined recursively, with D0l = {i ∈ S : im y ∈ G(l), for some m}, and n o m m m l Dkl = i ∈ S : pm (i , j ; z, u) > ε, for some , m, z, u, some j ∈ D x y y x k−1 . (2.33) If d(i, G(l)) = ∞, set d(i, G(l)) = 0 instead. Let Dl = {0, 1, 2, . . . , K l } represent the possible values that d(i, G(l)) can take on. Remark From the proper policy assumption (2.6) in Section 2.2.3, it is always possible reach τ in at most |S| steps. It follows that the maximum distance cannot exceed |S|, since the case d(i, G(l)) = ∞ is excluded. There is no harm in this exclusion, since Platform l is safe whenever d(i, G(l)) = ∞, so the policy here does not matter. Remark The distance d(i, G(l)) measures the minimum number of steps it would take for one of the missiles in state i to reach either the set G(l) or the terminal state. It depends on the transition probabilities of the missile process, but not on the particular CM sequence used. The distance function defined in this way also has the attractive property of being integer valued. Define a new stochastic process on the state space S := S × D1 × . . . × DL . (2.34) We define a new process to track both the missile state Xn and the previously achieved minimum distance to each target set G(l). Writing iDl as the component of i on Dl , the new process evolves such that jDl = min(d(i, G(l)), iDl ) (2.35) for all i, j such that p(i, j; u) > 0. To efficiently formulate the new process, we group agents with the same objective together to form a reduced state space. That is, we define L̄ by removing from L any platform 42 Chapter 2. Decentralized Force Protection: Nash Equilibrium l if G(l) = G(m) for some m < l. If we denote the original state space by S̄, the corresponding states by ī, and the original transition probabilities in (2.27) by p̄(ī, j̄; u), then the transition probabilities on the new state space S reflecting the history of minimum distances are given by p(i, j; u) = p̄(ī, j̄; u) Y I [jDl = min(d(i, G(l)), iDl )] , (2.36) l∈L̄ iDl is interpreted as the historic minimum distance achieved by the process before it reached state i. d(i, G(l)) is the current distance of the process in state i, so the indicator functions in (2.36) simply update the historic minimum distance to target set l. The new stochastic process is non-increasing on each new dimension Dl . To maximize the minimum distance, the costs are defined so that a platform is penalized whenever the missile reaches a historic new minimum distance. A platform will then compute a policy which minimizes these penalties, and hence maximizes the minimum distance. Specifically, we set ( l g (i, u) = max(0, iDl − d(i, G(l))), i 6= τ, 0, i = τ. (2.37) Since the number of states in S increases exponentially with both the number of target sets and the number of missiles, the objective of maximizing the minimum distance is most useful in a purely collaborative game setting, so that (2.34) becomes simply S = S × D. Remark The distance concept suggests a useful heuristic for constructing candidate optimal policies under multiple missile threats. If an platform is able to resolve the distance from each individual missile to its target set, then a likely good decision is to concentrate countermeasures first on the missile that will be first to arrive. This results in a significant reduction in problem complexity, without significantly degrading performance. As a final note, we mention that since the extra dimensions are artificial, the distance function can be coarsened, refined, or changed without losing information on the missile dynamics, although the optimal policies will in general be changed. The resulting trade off between complexity and performance is a model design decision. Priority and Sacrifice In some situations, it may be desirable to draw missiles toward a less valuable or more robust asset (e.g. frigate), so as to protect a more valuable asset (e.g. aircraft carrier). 43 Chapter 2. Decentralized Force Protection: Nash Equilibrium One way to assign value to assets is to include it in the target sets G(l) of more or fewer platforms, so that higher valued assets are protected by more platforms. Another way is to assume each platform l assigns value v l (i) to each asset location i ∈ G(l) and use the objective of Section 2.3.4. Agent l can then use the weighted cost function g l (i) = v l (i)I(im y ∈ G(l), some m) (2.38) to produce a policy that defends assets differently in proportion to their value. Hence some assets will be readily sacrificed for the benefit of others. The situation is more complex when the objective of Section 2.3.4 is considered. In this case, assets may be discarded from the target set G(l) as the minimum distance decreases. However, the location of assets within the task force will then play an important role in how ships are sacrificed. Regardless of the method used, it should be emphasized that task groups will almost always have assets of differing priorities. Sacrificing some assets to defend others is a valuable tactic in task force protection, and is the inevitable outcome of any priority assignment. 2.4 Computation of Decentralized Missile Deflection Policies In this section we consider the decentralized computation and execution of a joint CM policy. In particular, we consider policies that are optimal responses to each other, and hence in Nash equilibrium. In this case, the joint policy will also be independently executable, so agents need not communicate during a missile attack. Section 2.4.3 presents the main algorithm for computing Nash equilibrium policies, while Sections 2.4.1 and 2.4.2 describe the efficient linear programming techniques used to carry out the main algorithm. 2.4.1 Computation of an Optimal Response for a Single Ship Since a Nash equilibrium is a collection of optimal responses, we first consider the singleplatform optimization problem. If µ−l is held fixed, platform l can compute an optimal stationary policy µ∗l by viewing the problem as a stochastic shortest path Markov decision process. This will specify which CM actions should be used for a given threat state, in order to minimize the total expected cost to the platform. Such problems have been extensively studied, for example in [64, 72, 81], and dynamic programming-based solution methods are well known. 44 Chapter 2. Decentralized Force Protection: Nash Equilibrium We have already discussed stationary policies and total expected costs in Section 2.3. In this section we summarize known results for computing optimal CM policies in terms of stochastic shortest path problems. We focus on the linear programming solution, but many other methods, such as value iteration, policy iteration, Monte Carlo methods and approximation can also be used to obtain suitable policies. The optimal response can be obtained via the following proposition. Proposition 2.4.1 Let α be an initial probability distribution on S, such that α(i) > 0 for all i 6= τ, and fix µ−l . The optimal response of l, µ∗l , has costs vl (i, (µ∗l , µ−l )) given by the solution to maximize X α(i)J l (i) i6=τ subject to (∀ u ∈ U l (i), i 6= τ ) X J l (i) ≤ g l (i, (u, µ−l )) + p(i, j; (u, µ−l ))J l (j). j6=τ (2.39) Furthermore, µ∗l can be obtained by solving the dual linear program for π l : minimize X g l (i, (u, µ−l ))π l (i, u) i6=τ,u∈U l (i) subject to (∀j ∈ S) X X π l (j, u) = u∈U l (j) p(i, j; (u, µ−l ))π l (i, u) + α(j), i6=τ,u∈U l (j) π l (j, u) ≥ 0 ∀ u ∈ U l (j). Setting (2.40) π ∗l (i, ur ) , ∗l u∈U l (i) π (i, u) µ∗l (i, ur ) = P (2.41) where π ∗l solves (2.40), yields the optimal policy. It is important that the linear programs (2.39) and (2.40) are duals of each other, [64, 72, 81]; they essentially solve the same problem from different directions. This is a fundamental observation common to most Markov decision process analysis. The following proof is due to [72]. 45 Chapter 2. Decentralized Force Protection: Nash Equilibrium Proof For any feasible vector J l of (2.39) and any policy ν l , we can combine the constraints of (2.39) (since ν l (i, ·) is a probability distribution) to obtain: J l (i) ≤ g l (i, (ν l , µ−l )) + X p(i, j; (ν l , µ−l )))J l (j), (2.42) j6=τ That is, J l ≤ gl (ν l , µ−l ) + P (ν l , µ−l )J l . From Theorem 2.2.1, it follows that the solution J ∗l of (2.39) satisfies J ∗l ≤ vl (ν l , µ−l ) for all ν l . (2.43) That the linear program (2.40) is dual to (2.39) implies the following complementary slackness condition on the solutions of (2.39) and (2.40) (see [72], Appendix G): π ∗l (i, u) > 0 ⇒ J ∗l (i) = g l (i, (u, µ−l )) + X p(i, j; (u, µ−l )))J ∗l (j). (2.44) j6=τ Since µ∗l (i, u) > 0 if and only if π ∗l (i, u) > 0, multiplying (2.44) by µ∗l (i, u) and summing over u ∈ Ul yields J ∗l (i) = g l (i, (µ∗l , µ−l )) + X p(i, j; (µ∗l , µ−l )))J ∗l (j). (2.45) j6=τ That is, J ∗l = gl (µ∗l , µ−l ) + P (µ∗l , µ−l )J ∗l = vl (µ∗l , µ−l ). Therefore, policy µ∗l has cost J ∗l , which is optimal by (2.43). A few details are omitted. Proposition 2.4.1 presents the primal-dual linear programming method for solving a stochastic shortest path problem. The fundamental concept of duality arises often in optimization, and helps to provide much insight into the problem at hand. Moreover, knowing the solution to a linear programming problem automatically tells us much about the solution to the dual problem. For example, if (2.40) is solved, the solution to (2.39) is readily obtained through (2.45). Note that (2.45) is critical; it shows that µ∗l implements the optimal policy. In Proposition 2.4.1, J ∗l (i) gives the total expected cost to player l starting from state i and following the optimal policy. π ∗l (i, u) gives the expected number of times the threat P state is i and Player l uses control u. It follows that u π l (i, u) represents the expected number of times state i is visited, so the policy µ∗l implements π ∗l by randomizing amongst control choices in an appropriate way. The objective of (2.40) has a natural interpretation in terms of minimizing the total ex- 46 Chapter 2. Decentralized Force Protection: Nash Equilibrium pected cost. Its constraints ensure feasibility, by requiring that π is a stationary occupation measure for the induced Markov chain. Optimization problem (2.39) is less intuitive. The values J l (i) represent the total expected cost in state i, and the constraint reflects the fact that these costs must be minimized by the choice of control. The maximization objective ensures that the J l (i) do in fact represent the cost associated with some policy, ruling out trivial solutions. If a single platform has control over all countermeasures, then the Proposition 2.4.1 yields an optimal centralized policy. This corresponds the case where all agents yield their autonomy to a central authority, which dictates actions in real time. As previously mentioned, such a solution is expensive to implement, and leaves the task group vulnerable in case of communication breakdown. Hence the need for decentralized, game-theoretic solutions. 2.4.2 Exploiting Transient Classes in the MDP When the objective is to maximize the minimum distance, the state space augmentation of Section 2.3.4 introduces special properties into the Markov decision process, which we may take advantage of. In this section we assume, for simplicity, that each platform shares the common objective of maximizing the minimum distance to a target set G, and show how to exploit the existence of transient subsets of S to increase computational efficiency. Transience may also arise in other ways (see Section 2.6.2), in which case a similar exploit may be used. The ideas of this section are based on the following lemma: Lemma 2.4.1 Suppose the state space S of a transient Markov decision process can be subdivided into N transient classes, {S1 , S2 , . . . , SN }, and one absorbing state τ, such that p(i, j; u) = 0 ∀u ∈ U (i) if i ∈ Sk , j ∈ Sl , and k < l. (2.46) Define Sn = {S1 , S2 , . . . , Sn , τ }. Then the optimal policy µ∗ for the entire problem is equivalent to the optimal policy µ∗Sn obtained by considering the Markov decision process only on Sn . Equation (2.46) denotes a requirement for a steady one-way dynamic in the Markov process; once the process reaches a set Sk , it is impossible to go back to any state that belongs to a “higher” set. Thus, for planning purposes, a decision maker faced with a state in Sk can ignore these higher states. For example, in the state space of Section 2.3.4, if a missile threat has reached a historic minimum distance of 1000 metres to the task group, it is impossible that in the future it will reach a historic minimum distance of 2000 metres, since 47 Chapter 2. Decentralized Force Protection: Nash Equilibrium the historic minimum distance is nonincreasing. Proof This is simply a consequence of Bellman’s principle of optimality [64]. Consider the problem on S. We have vl (i, µ∗ ) ≤ vl (i, µ∗Sn ) for i ∈ S by optimality. Similarly, for the subproblem on Sn , define the total expected cost vSl n (i, µ) (with µ restricted to Sn if appropriate). Again by optimality we have vSl n (i, µ∗Sn ) ≤ vSl n (i, µ∗ ) for i ∈ Sn . Now, by Theorem 2.2.1, we have vSl n (µ∗ ) = gSl n (µ∗ ) + PSn (µ∗ )vSl n (µ∗ ), where gSl n and PSn are just gl and P suitably restricted to Sn . That is, for i ∈ Sn , vSl n (i, µ∗ ) = g l (i, µ∗ ) + X p(i, j, µ∗ )vSl n (j, µ∗ ) (2.47) p(i, j, µ∗ )vSl n (j, µ∗ ), (2.48) j∈Sn l ∗ = g (i, µ ) + X j∈S where the last equality follows by hypothesis. By uniqueness of the total expected cost, it follows that vSl n (i, µ∗ ) = vl (i, µ∗ ) for all i ∈ Sn . Repeating the argument for vSl n (i, µ∗Sn ), we have vl (i, µ∗Sn ) ≤ vl (i, µ∗ ) for all i ∈ Sn . Therefore vl (i, µ∗Sn ) = vl (i, µ∗ ) on Sn . Since two policies are equivalent if they have the same cost, the proof is completed. Since the previous minimum distance of the missile process of Section 2.3.4 is nonincreasing, each set {s ∈ S : d(s, G) = k} constitutes a transient class for the process, and each set {s ∈ S : d(s, G) < k} likewise constitutes an absorbing class. Lemma 2.4.1 suggests a recursive method for computing an optimal policy when faced with these transient classes. First, we find an optimal policy over states such that d(i, G) = 1. Then we consider the problem over states such that d(i, G) ≤ 2; since the optimal policy and costs is fixed for some states by Lemma 2.4.1, we may simply substitute these values into the new optimization problem. Continuing in this fashion we compute optimal policies for all states up to the maximum possible distance d(i, G) = N. This is summarized in the following corollary: Corollary 2.4.1 Suppose the Markov decision process for a single platform in the missile deflection problem can be divided into transient classes as in Lemma 2.4.1. Then the optimal policy can be computed via the following algorithmic sequence of linear programs: • Initialize n = 1. • For all i ∈ Sn , set g ∗l (i, u) = g l (i, u) + X X p(i, j; (u, µ−l ))v∗l (j). m<n j∈Sm 48 Chapter 2. Decentralized Force Protection: Nash Equilibrium • Solve the linear program: X minimize g ∗l (i, (u, µ−l ))π l (i, u) i∈Sn ,u∈U l (i) subject to (∀j ∈ Sn ) X X π l (j, u) = u∈U l (j) p(i, j; (u, µ−l ))π l (i, u) + α(j), i∈Sn ,u∈U l (j) π l (j, u) ≥ 0 ∀ u ∈ U l (j). (2.49) • For j ∈ Sn , compute µ∗l (j, u) as in (2.41), and set v∗l (j) = vl (µ∗l , µ−l ) as the solution to (2.45). • Increment n and continue until n = N + 1. Proof For n = 1, (2.49) directly gives π ∗l on S1 by Lemma 2.4.1 and (2.40), since S1 = S1 . Thus (2.41) and (2.45) also yield optimal µ∗l (j, u) and v∗l (j), respectively. For n > 1, assume that we have the optimal costs v∗l (j) for j ∈ Sn−1 . Consider (2.39) restricted to Sn . Since Lemma 2.4.1 fixes J ∗l (j) = v∗l (j) for j ∈ Sn−1 , we may rewrite this as: maximize X α(i)J l (i) + X α(i)v∗l (i) i∈Sn−1 i∈Sn l subject to (∀ u ∈ U (i)) J l (i) ≤ g l (i, (u, µ−l )) + X X p(i, j; (u, µ−l ))v∗l (j) + m<n j∈Sm v∗l (i) ≤ g l (i, (u, µ−l )) + X X X p(i, j; (u, µ−l ))J l (j), i ∈ Sn j∈Sn p(i, j; (u, µ−l ))v∗l (j), i ∈ Sn−1 , i 6= τ m<n j∈Sm (2.50) Note that we have set the appropriate p(i, j; u) = 0 in the second constraint; it is then satisfied (with equality) by definition, so it may be dropped. In addition, the second sum in the objective is constant, so it may also be dropped. Substituting g∗l (·) into the first constraint gives us a linear program exclusively on Sn , the solution of which gives the optimal costs on Sn . Now, it can be shown that (2.49) is dual to (2.50), once the above omissions and substitutions are made. We may now repeat the argument in Proposition 49 Chapter 2. Decentralized Force Protection: Nash Equilibrium 2.4.1: the complementary slackness condition can be used to show that µ∗l , obtained from (2.49), implements a policy with costs J ∗l obtained from (2.50). Thus (2.49) computes the optimal policy on Sn . The details are left for the reader. Corollary 2.4.1 offers a significant reduction in computational complexity. First, note that obtaining v∗l (j) via (2.45) at each iteration may not be necessary, since many primaldual methods for linear programming automatically produce the v∗l (j) as a by-product of calculation. Next, suppose each SN is equal in size, then instead of solving one linear program with N |S| × U decision variables, we solve N linear programs with |S| × U decision variables. Under the best known linear programming algorithms, the number of calculations is thus reduced from order (N |S| × U )3.5 to order N × (|S| × U )3.5 , a computational savings of order N 2.5 . This is important for the objective of Section 2.3.4, since we initially increased the size of the state space dramatically. The results here indicate that this increase isn’t so bad; the complexity only increases linearly with the number of different distances considered. 2.4.3 Nash Equilibrium Solution for Multiple Ships The computation above relies on the assumption that µ−l is known and held fixed, but when all agents are simultaneously attempting to perform optimally, the situation is much more complex. In this case a reasonable option is to find and execute stationary policies which are in Nash equilibrium. Recall the Nash equilibrium from Definition 2.2.3. Using Theorem 2.2.1, we rewrite this as follows: Definition 2.4.1 A Nash equilibrium in stationary policies is a joint policy µ such that for each player l and all ν ∈ ∆(Ul ), (I − P (µ))−1 gl (µ) ≤ (I − P ((ν, µ−l )))−1 gl ((ν, µ−l )). (2.51) That is, µ is a collection of individual policies {µl , l ∈ L} that are optimal responses to each other. Such Nash equilibrium policies always exist (see [72]), and are locally optimal but generally inferior to centralized (Pareto optimal) policies because they suffer from coordination issues, essentially due to their restriction to the space of product distributions. However, this restriction is beneficial insofar as it requires no communication during the execution of the policy, and effectively distributes computation among all agents. 50 Chapter 2. Decentralized Force Protection: Nash Equilibrium Nash policies are based on the notion that each platform’s actions are a “best response” to the others’ actions, so computation is naturally viewed as an iterative process, with candidate policies being improved upon by one or more players at each iteration until a fixed equilibrium point is reached. Indeed, iteration is easily seen to lead to equilibrium in the purely collaborative case, as the following proposition shows. Proposition 2.4.2 Iterated sequential best reply leads to Nash equilibria in collaborative games. Proof Since the expected costs coincide for each platform, and since each iteration necessarily reduces these costs (which are bounded below by zero), Monotone Convergence immediately yields the result. Proposition 2.4.2 suggests the following algorithm for calculating a Nash equilibrium in a purely collaborative game. Algorithm 2.4.1 Agents adopt initial policies µl0 and i is initialized to zero. Repeat the following: 1. For l = 1, . . . , |L| repeat: 2. Agent l computes a stationary policy µli+1 via optimization problem (2.40), assuming the others use policies |L| l+1 (µ1i+1 , . . . , µl−1 i+1 , µi , . . . , µi ). 3. If l = |L|, check whether the resultant joint policy is a Nash equilibrium. If it is not, reset l, increment i, and continue. Although a Nash equilibrium policy does not require communication during implementation, its computation does require that players communicate before an attack situation, as in Algorithm 2.4.1. This offline communication is unavoidable in this type of game, but since a task group is not so tightly bound by time and communication constraints before a missile attack, such communication is feasible. A new Nash equilibrium needs to be negotiated only when the task group configuration, threat type, or defensive capabilities change. That said, we observe that the efficiency of Algorithm 2.4.1 is rather poor, since players must continuously communicate their latest policies, and each player spends most of its time waiting for the others to finish their computations. To increase efficiency, players can 51 Chapter 2. Decentralized Force Protection: Nash Equilibrium initially compute best responses in parallel until a suitably good policy is found, so that Algorithm 2.4.1 may be initialized close to equilibrium, so that fewer serial iterations are required. In the non-collaborative case, the best reply dynamic is not guaranteed to converge to a Nash equilibrium, but may still yield acceptable results. Guaranteed methods for computing Nash equilibria in general sum stochastic games are still being developed. For example, stochastic tracing procedure using Homotopy theory has recently been developed in [74], while new nonlinear programming methods have been put forward in [83]. More established nonlinear programming methods, such as those in [75] and [67] are also useful, although not guaranteed to converge. 2.4.4 Pareto Optimality and Nash Equilibrium The Nash equilibrium policy is optimal from each individual perspective, but what can be said about optimality from the collective point of view? The accepted measure of collective performance is Pareto optimality, which requires us to define joint policies as follows: Definition 2.4.2 Let U(i) = U1 (i) × U2 (i) × . . . × UL (i) be the vector space of possible joint actions given threat state i, and let ∆U(i) represent the set of probability distributions over U(i). A joint policy µ is a function µ(i) : i ∈ S → ∆(U(i)); for each i ∈ S, the probability of deploying joint CM vector u is µ(i, u). Definitions 2.4.2 and 2.2.1 are very similar. Moreover, policy profiles, as in Definition 2.2.2, are a subset of joint policies; they are simply joint policies restricted to a product space. The concept of Pareto optimality, in general, can be stated as follows: Definition 2.4.3 A joint policy µ ∈ ∆(U) is Pareto optimal if there is no alternative joint policy ν that improves the utility of one individual without decreasing the utility of another. That is, if: vl (i, µ) ≤ vl (i, ν) (2.52) for all i ∈ S, l ∈ L and all ν ∈ ∆(U). A Nash equilibrium need not be Pareto optimal. In fact, in the presence of constraints, the Pareto optimal policy may not even belong to the set of policy profiles, so that no independently executable policies can implement it, simply for coordination reasons. (See the example in Section 2.6.1.) Thus, Pareto optimality may be out of reach of the class of independently executable policies. 52 Chapter 2. Decentralized Force Protection: Nash Equilibrium We are thus led to consider the weaker property of Pareto optimality restricted to independently executable policies, as follows: Definition 2.4.4 A joint policy µ ∈ ∆(U1 )×∆(U2 )×. . .×∆(Ul ) is Pareto optimal among independently executable policies if there is no alternative independently executable policy ν that improves the utility of one individual without decreasing the utility of another. That is, if: vl (i, µ) ≤ vl (i, ν) (2.53) for all i ∈ S, l ∈ L and all ν ∈ ∆(U1 ) × ∆(U2 ) × . . . × ∆(Ul ). This definition leads us to the following proposition: Proposition 2.4.3 In the collaborative game case, at least one Nash equilibrium is Pareto optimal among independently executable policies. Proof Since all player objectives are the same, Pareto optimality is equivalent to the condition that no policy exists that improves the common utility. Consider the independently executable policy with the highest utility (such a policy exists since the utility is bounded). This policy is Pareto optimal among independently executable policies by definition. Also, since the utility is maximal, no unilateral deviation can improve any player’s utility, hence the policy is also a Nash equilibrium. When the game is not collaborative, a pricing scheme may be useful to promote Pareto optimality. Essentially, another phase would be added to the problem in which agents would choose which targets in G to include in their target sets G(l). The targets would represent assets which would be traded on an open market by the agents. Once the targets were set, a Nash equilibrium would be computed. This process is continued with further rounds of bidding and policy computation. It is well-known that such free market schemes lead to Pareto optimality. However, this is beyond the scope of the current chapter, and is relegated to future work. 2.5 Introducing CM Limitations via Constraints So far, we have formulated the missile deflection problem under just one simple constraint type; i.e. that l’s CM choices are restricted by the threat state to U l (i), which may be a subset of all available CM choices. This may be used to keep ships from interfering with each other by violating doctrinal rules (e.g. deploying CMs in undesired directions), but cannot 53 Chapter 2. Decentralized Force Protection: Nash Equilibrium handle dynamic constraints. For example, we may wish to model limitations on the number of countermeasures available, or limitations on allowable sequences of countermeasures. The purpose of this section is to outline methods for adding such constraints. We outline three such approaches. The first two approaches yield an optimal policy when a limit is placed on the number of times each type of CM may be used, while the third approach yields an optimal policy when a limit is placed on how CM choices may be used in sequence. Among these, the first approach yields a policy in which the actual number of uses of each CM type is bounded, while the second approach yields a policy in which only the average number of uses is bounded. Despite the weaker nature of the second approach, it is still preferred since the state space is more manageable. All three approaches can be combined as desired for use in the game-theoretic algorithms. 2.5.1 Limiting CM availability via State Space Augmentation For each platform l, partition ∪i∈S U l (i) into {ulk : k = 1, 2, . . . I(l)}. Each ulk corresponds to a certain type of CM available to platform l, with set limit η(ulk ) on the number of times it may be deployed. To reflect the depletion of countermeasures, we perform a state space augmentation similar to (2.34) in Section 2.3.4. As in that section, define a new stochastic process on the state space S := S × D1 × . . . × DL . (2.54) where each Dl is a vector-valued set Dl = {d ∈ ZI(l) : 0 ≤ d ≤ (η(ul1 ), . . . , η(ulI(l) ))} (2.55) The k th component of Dl represents the number of CMs of type k left to platform l. Denote the original state space by S̄, the corresponding states by ī, and the original transition probabilities in (2.27) by p̄īj̄ (u). The transition probabilities on the new state space S are: p(i, j; u) = p̄(ī, j̄, u) ´ Y ³ I ul ∈ ulk and jDl = iDl − ek , (2.56) l∈L where iDl represents the portion of i restricted to Dl , and ek is an appropriate unit vector. The quantity iDl gives the number of countermeasures of each type left to player l in state i, so the indicators in (2.56) pick out the state j that represents the fact that the appropriate CM reserve of Player l is decreased by one. The stochastic process so defined 54 Chapter 2. Decentralized Force Protection: Nash Equilibrium is non-increasing in each new dimension. We also redefine the allowable CM choices by setting U l (i) := U l (ī) ∩ [ ulk , (2.57) k∈K(i) where K(i) = {k ∈ Z : k th component of iDl > 0}. (2.58) That is, CM choices are restricted to those that have not yet run out. Since the optimal policy follows the principles of dynamic programming, Optimization problem (2.40) will automatically account for the limited supply of countermeasures when applied to the augmented problem. That is, µl will yield an optimal joint or best response policy that minimizes the cost while respecting the CM limitations. Furthermore, the newly defined state space can clearly be subdivided into transient classes, in the same manner as the augmented process incorporating the previous minimum distance was in Section 2.4.2. Hence there exists a sequence of linear programs that yields the optimal solution to this problem. Remark As the η(uli ) increases, we expect the constrained optimal policy to approach the unconstrained policy in a smooth manner. The size of the augmented state space also increases, however, so that the problem becomes more complex. If the optimal policy does indeed change smoothly, we may approximate the optimal policies for various values of the η(uli ) by first computing both the unconstrained optimal policy and the constrained optimal policy for η(uli ) all small (1 or 2 perhaps), and then using some weighted mixture of these two optimal policies for all other values of the η(uli ),. A complete investigation of this method is beyond the scope of this chapter. 2.5.2 Limiting CM Directly in the Linear Program In this section, we place an upper bound on the expected number of uses of different CM types, instead of the actual number of uses. Define uli as in Section 2.5.1, and place a limit η(uli ) on the expected number of times each CM type is used. We begin with the following lemma. Lemma 2.5.1 In the context of Optimization Problem (2.40), X X π(j, ul ) ul ∈uli j∈S 55 Chapter 2. Decentralized Force Protection: Nash Equilibrium is equal to the expected number of times platform l will use CM type i. This follows given the interpretation of π(i, u) as the expected number of times CM u will be used in state i. However, we give a formal proof of this assertion. Proof We drop the superscripts separating platforms for brevity. For a fixed CM u, define row vectors α, πu , and µu over S with ith entry α(i), π(i, u), and µ(i, u), respectively. Let xT represent the transpose of x. Let Aµu = diag(µu ) be a diagonal matrix with entries Aµu (i, i) = µ(i, u); the inverse has entries A−1 µu (i, i) = 1/µ(i, u). (Actually this is a formalization, since µ(i, u) may zero, but the argument holds if we set 0/0 = 0.) From (2.41), the P definition of µ, we have that π(i, u)/µ(i, u) = z π(i, z) for all i, u, or, πu A−1 µu = X πz , for all u. (2.59) z Letting Pu represent the transition matrix under control u, we may write the constraint of (2.40) as α= X πu (I − Pu ) u = X πu A−1 µu Aµu (I − Pu ) u = ! Ã X πz z = Ã X ! πz X (Aµu − Aµu Pu ) u (I − P (µ)), (2.60) z P P P where the last equality follows since u Aµu (i, i) = u µ(i, u) = 1, and u Aµu (i, :)Pu (: P , j) = u µ(i, u)p(i, j; u) = p(i, j; µ). Now, we may find the desired expectation, E[times ur is used] = X ftc (α.µ; j)µ(j, ur ) j = α(I − P (µ))−1 µT (ur ) Ã ! X = π(z) (I − P (µ))(I − P (µ))−1 µT (ur ) z = XX z π(i, z)µ(i, ur ) i 56 Chapter 2. Decentralized Force Protection: Nash Equilibrium = XX z = X i π(i, ur ) π(i, z) P z π(i, z) π(i, ur ). (2.61) i Summing over all controls ur in a given class uli gives the result. In light of Lemma 2.5.1, adding the constraints X π(j, ul ) < η(uli ), ∀ l ∈ L, i = 1, . . . , I l (2.62) ul ∈uli ,j∈S to Optimization Problem (2.40) should yield an optimal policy that restricts the average use of each CM type appropriately. 2.5.3 Constraining CM Sequences The final constraint addresses the possibility that the allowable CM choices at time k + 1 may be limited by the choice at time k. Such a constraint may arise due to mechanical reasons, for example if the countermeasures are decoys which can only move at a limited rate, or if the countermeasure consists of a one-time “launch” command. The constraint may also be used to limit the rate of change in false target positions in light of the fact that a missile would most likely recognize any erratically moving target as false, thus negating its effectiveness. We include this constraint type by incorporating the current CM action as part of the state space. That is, redefine S := S × U, so that state i ∈ S reflects not only the missile state, but also the current joint CM action in U (i). We then redefine each set U l (i) to be the new, restricted set of actions available to platform l given that the current action is incorporated into state i. Once these spaces are defined, the optimal solution is computed exactly as in the unconstrained case. Example (Chaff.) For each possible chaff launch, the control choices can be thought of as belonging to the set {Hold, Launch, Null} . (2.63) For any state i = (ix , iz , iu ), where iu is the control choice in state i, simply set ( l U (i) = {Hold, Launch} , iu = Hold, { Null} , iu ∈ {Launch, Null} . (2.64) 57 Chapter 2. Decentralized Force Protection: Nash Equilibrium Running the optimization algorithm over the new state space S := S × U will yield a joint policy in which agents are aware of and restricted to one-time “launch” controls for chaff. Example (Decoy velocity constraint.) For any state i = (is , iu ), where iu is the control in state i specifying the physical position of a decoy, define d(a, b) to be the distance between two decoy positions and set U l (i) = {ju : d(iu , ju ) ≤ dmax } . (2.65) That is, future decoy positions are restricted to be sufficiently close to the present decoy position. Running the optimization algorithm over the new state space S := S × U will again yield a joint policy in which agents are aware of and restricted to rate-limited changes in decoy positions. The cost of such a formulation is an exponential increase in the size of the state space. However, this is at least partially offset by an increased sparsity in the transition matrices, since many state-control combinations will never occur. In addition, the number of possible controls in each state is reduced substantially in the new formulation, reducing the computational complexity involved in calculating an optimal policy significantly. 2.6 Numerical Results In this section we investigate two simple but illustrative missile deflection examples. In the first example, two ships jointly influence the motion of a randomly moving missile by presenting a single false target. In the second, three ships jointly influence the targeting system of a missile by slowly introducing angle errors through electromagnetic jamming. 2.6.1 Maximizing Minimum Approach of a Missile with Simple Targeting In the first example, two stationary ships are threatened by a single missile moving on an 11 × 11 battlespace B = {(x, y) ∈ Z 2 : −5 ≤ x, y ≤ 5}. The two ships are located at (0, 0) and (0, −2), with G = G(1) = G(2) = {(0, 0), (0, −2)}. We seek a Nash equilibrium policy with the objective of Section 2.3.4; maximizing the minimum distance from the missile to the pair of ships. The CM capabilities involve the joint control of a single decoy. We set U 1 = U 2 = {(1, 0), (−1, 0), (0, 1), (0, −1)}, indicating the x−y displacement of the decoy from the Origin 58 Chapter 2. Decentralized Force Protection: Nash Equilibrium 0.1 0.1 0.7 Direction of Target 0.1 Figure 2.2: Missile transition probabilities pm y (i, j; z, u) relative to target position z. by each ship. The joint control is a superposition; Un = Un1 + Un2 . The missile may target either the decoy or Ship One, hence T = {(x, y) ∈ Z 2 : |x| + |y| ≤ 2}. Define f (i, z) to be the next grid point along the shortest path between i and z. Define g(i, z) = (0, 1) or (1, 0), depending on whether |ix − zx | > |iy − zy | or vice-versa. Then the missile dynamics are given by: ( pm z (i, j; y, u) = 0.2, j = (0, 0), 0.8, j = u. 0.5, i = z, m py (i, τ ; z, u) = 1, i = (0, 0) or (0, −2), 0.02, otherwise . 0.7, j = f (i, z), m m py (i, j; z, u) = (1 − py (i, τ ; z, u)) · 0.1, j = f (i, z) ± g(i, z), 0.1, j = i. (2.66) (2.67) (2.68) See Figure 2.2 for an illustration of the last equation. The missile targets either Ship One or the decoy; Ship Two is not targeted, but may still be hit if reached. At time zero, the missile appears randomly on the edge of the grid and subsequently moves according to the above dynamics, as influenced by the chosen CM policy. Based on the transition probabilities derived from the above dynamics, Algorithm 2.4.1 was used to compute a stationary Nash equilibrium policy. This policy is characterized as follows. For states (x, y ∈ G), (the shaded “area of active control” in the figures), the agents attempt to lure the missile as far away from Ship One as possible by placing the perceived 59 Chapter 2. Decentralized Force Protection: Nash Equilibrium decoy at the nearest state such that |x| = 2 or |y| = 2. This maximizes the probability that the missile will hit the decoy, and minimizes the probability that it will hit a ship. Since there is no coordination, this is not always possible. For example, when the missile is at (1, 1), Ship One may decide to move the decoy east while Ship Two may move the perceived position north. Each has attempted to lure the missiles to (2, 0) and (0, 2), respectively, but the outcome is a decoy position of (1, 1). This coordination problem cannot be resolved without communication or convention. For states outside the active control area, the agents randomly choose CMs that cause the missile to move toward the nearest state such that |x| = 2 or |y| = 2. For example, if the missile is at (5, 0), then the goal is to guide the missile to (2, 0), which is accomplished by placing the decoy at (x, 0), |x| ≤ 2. This policy is quite reasonable. As a check, an optimal policy was also computed by assuming a single ship had control over both countermeasures. The Pareto optimal costs are shown in Figure 2.4. Except for the coordination problems associated with independently randomizing decisions, the Nash equilibrium policy was almost identical to the optimal one. Figure 2.3 shows the costs associated with the equilibrium policy, that is the expected difference between the initial and minimum distances achieved over the lifetime of the missile process. Figure 2.5 shows two sample missile paths from two separate simulations, along with the false target positions used at different times during the attack. 2.6.2 Minimizing Probability of Hit for a Missile with Forward Momentum In the second example, three stationary ships equipped with CM lie at the edges of a task group. A single missile threatens the task group, and the objective of the three ships is to minimize the probability that the missile reaches any point in the interior of the triangle defined by their positions. This example has two simplifying features. First, the missile moves one unit from left to right at each step, so the threat terminates by a definite time, and only has one pass at the targets. Second, CM actions are independent in the sense that the jamming success probabilities of each ship are multiplied to determine the overall CM effects. Define the battlespace B = {(x, y) ∈ Z 2 : 1 ≤ x, y ≤ 11} (see Figure 2.7). The true target set G is the shaded area in that figure. We deviate from our setup slightly and define the missile target set T = {(x, y) ∈ Z 2 : x = 11, 1 ≤ y ≤ 11}. That is, we assume the missile target is along the right-hand edge of the state space, instead of in the task group. Define 60 Chapter 2. Decentralized Force Protection: Nash Equilibrium 3.50 3.51 3.46 3.38 3.33 3.32 3.33 3.38 3.46 3.51 3.50 3.53 2.62 2.61 2.52 2.43 2.41 2.43 2.52 2.61 2.62 3.53 3.53 2.65 1.71 1.66 1.52 1.47 1.52 1.66 1.71 2.65 3.53 3.48 2.63 1.75 0.78 0.66 0.46 0.66 0.78 1.75 2.63 3.48 3.40 2.53 1.67 0.97 0.24 0.24 0.24 0.97 1.67 2.53 3.40 3.37 2.47 1.51 0.48 0.25 0 S1 0.25 0.48 1.51 2.47 3.37 3.37 2.47 1.54 0.68 0.25 0.28 0.25 0.68 1.54 2.47 3.37 3.41 2.54 1.69 0.80 0.21 0 S2 0.21 0.80 1.69 2.54 3.41 3.49 2.63 1.74 0.84 0.29 0.60 0.29 0.84 1.74 2.63 3.49 3.54 2.66 1.74 0.89 1.12 1.40 1.12 0.89 1.74 2.66 3.54 3.54 2.63 1.77 1.87 2.01 2.20 2.01 1.87 1.77 2.63 3.54 Figure 2.3: Expected costs vl for the Nash equilibrium policy computed by Algorithm 2.4.1, arranged over a spatial grid of possible missile positions. The shaded area indicates the possible false target positions and the marked squares indicate ship positions S1 and S2. The values are the expected cost associated with the process when the missile is in each position, i.e. the difference between the initial and the expected minimum future distance. 61 Chapter 2. Decentralized Force Protection: Nash Equilibrium 3.48 3.49 3.45 3.38 3.33 3.32 3.33 3.38 3.45 3.49 3.48 3.49 2.60 2.59 2.52 2.43 2.41 2.43 2.52 2.59 2.60 3.49 3.45 2.59 1.69 1.65 1.52 1.46 1.52 1.65 1.69 2.59 3.45 3.38 2.52 1.65 0.77 0.65 0.46 0.65 0.77 1.65 2.52 3.38 3.33 2.43 1.52 0.65 0.24 0.23 0.24 0.65 1.52 2.43 3.33 3.33 2.41 1.47 0.46 0.24 0 S1 0.24 0.46 1.47 2.41 3.33 3.33 2.43 1.53 0.66 0.25 0.28 0.25 0.66 1.53 2.43 3.33 3.39 2.53 1.67 0.78 0.21 0 S2 0.21 0.78 1.67 2.53 3.39 3.47 2.62 1.73 0.83 0.28 0.60 0.28 0.83 1.73 2.62 3.47 3.53 2.65 1.73 0.88 1.11 1.40 1.11 0.88 1.73 2.65 3.53 3.53 2.61 1.76 1.86 2.00 2.19 2.00 1.86 1.76 2.61 3.53 Figure 2.4: Expected costs vl for the Pareto optimal policy for the first example, arranged over a spatial grid of possible missile positions. The shaded area indicates the possible false target positions and the marked squares indicate ship positions S1 and S2. The values are the expected cost associated with the process when the missile is in each position, i.e. the difference between the initial and the expected minimum future distance. 62 Chapter 2. Decentralized Force Protection: Nash Equilibrium Y B C S1 X A Z Z C S2 Y X B A Figure 2.5: Sample missile and CM paths under the Nash policy from two initial points. The grid represents the space of possible missile positions. The shaded area indicates the possible false target positions and the marked squares indicate ship positions. The false target positions applied for each portion of the sample missile trajectories (the bold arrows) are denoted by A,B,C for the first simulated trajectory, and by X,Y,Z for the second simulated trajectory. These letters denote both the false target positions and the portions of the missile trajectories for which they were employed. 63 Chapter 2. Decentralized Force Protection: Nash Equilibrium the target angle θ as in Figure 2.6. That is, for missile state (Yn , Zn ) = ((ax , ay ), (11, by )), cos(θ(Yn , Zn )) = p q (11 − ax )2 / (11 − ax )2 + (by − ay )2 . The physical transition probabilities are given by: cos(2θ(i, z)), j = (ix + 1, iy ), m py (i, j; z, u) = 1 − cos(2θ(i, z)), j = (ix + 1, iy + sign(θ(i, z))) 1, j = τ, ix = 11 or i ∈ G. (2.69) The target angle update is the point at which the missile is vulnerable to jamming. At each time index, the agents are able to independently attempt to shift the equivalent target position up or down by one unit. Define the probability of success of each CM attempt to be ps ; ps = P r(CM action of ship l is effective at time index k). (2.70) ps is an empirical value based on factors such as missile type, environment, etc. The CM control choices for each ship are Ul = {−1, 0, 1}, in keeping with the discussion on constrained CM sequences in Section 2.5.3. We may quantify this as follows. Define ei to be the ith basis vector in eleven dimensions (since there are eleven target positions). Then we may define a probability distribution: F (i, u) = X 3−|A| p|A| e(i+Pl∈A ul ) , s (1 − ps ) (2.71) A⊂{1,2,3} where |A| means the number of components of A, and the sum is over all possible subsets A of successful players. This gives the transition probabilities; pm z ((11, iy ), (11, jy ); y, u) is the jyth component of F (iy , u). As before, a Nash equilibrium solution was obtained, using the minimal probability of hit objective of Section 2.3.4. We present a representative trajectory in Figure 2.7 for CM success probability ps = 0.6. The probability of hit for this case, when the missile is initially distributed uniformly along the left edge of the state space Z1 = (11, 6), is 0.13, whereas the probability of hit is 0.91 if no countermeasures are used. The joint Nash equilibrium for the second example also appears reasonable. Roughly speaking, the agents coordinate their efforts to drive the target to the side during the initial stages of the attack, and then as the danger level decreases, they become more indifferent as to their CM choices. The iterated best response algorithm leads to an independent 64 Chapter 2. Decentralized Force Protection: Nash Equilibrium Direction of Target 1−cos(2θ) cos(2θ) θ 0 Figure 2.6: Missile transition probabilities pm y (i, j, z, u) relative to target position z for second example. The probabilities are shown for a positive angle θ. MISSILE AND TARGET TRAJECTORIES 11 10 9 A2 8 7 y 6 A1 5 A3 4 3 2 1 AGENT CONTROLS +1 0 −1 Figure 2.7: Sample trajectory of missile, target location, and CM choices under Nash policy for ps = 0.6. The solid line indicates the missile flight path Yn , while the broken line indicates the y-axis position of the missile’s current target Zn . The control choices of all three agents are depicted below the graph for each missile position. The shaded area represents the task group, to be defended, with the three agents A1, A2, A3 at its vertices. 65 Chapter 2. Decentralized Force Protection: Nash Equilibrium synchronization of actions, for example all agents will attempt to move the target down at the beginning of the attack, as opposed to one ship working to cancel the efforts of another. Sensitivity Analysis of Nash Equilibrium to Missile Dynamics The missile characteristics may not be perfectly known by the controller, in which case one may wish to measure how the performance of a joint policy varies with modeling errors. In this section, we introduce an error between the true CM success probability ps , defined in (2.70) as the probability of successfully influencing the missile target, and the estimate of ps used by agents when computing their optimal CM policies, p̂s . We then analyze the performance of Nash equilibrium policies under various values of ps and p̂s . In particular, Nash equilibrium policies were computed for values of p̂s = 0.05, 0.1, . . . , 1.0. These policies were then evaluated under models with true parameters ps = 0.05, 0.1, . . . , 1.0. The resulting probability of hit, for an initial distribution as before (missile uniformly distributed along the left edge of the state space, with initial target y = 6), are presented in Figure 2.8. The first characteristic to note is that the costs associated with the policies changes very little as the model parameter p̂s is varied slightly from the true parameter ps . That is, the system is not particularly sensitive to small modeling errors. The exception to this is that the system appears to be relatively more sensitive to differences in p̂s between 0.15 and 0.25, where a small modeling error such as p̂s = 0.25 while ps = 0.20 can result in a 10% difference between predicted and actual costs. In general though, small modeling errors result in negligible performance differences (i.e. the graph is flat in the direction of changing p̂s ). More interesting is the fact that the costs decrease as both ps and p̂s increase, with the exception of the region where p̂s > 0.95. The reason for the decrease along the direction of ps is clear: as the agents are better able to influence the missile’s targeting system, the performance of the CM policies improve. The fact that costs decrease along the direction of p̂s as well is more surprising, indicating that the agents perform better when they are more confident about their actions. That is, if a ship computes a policy based on the assumption that it and its fellows have strongly effective CM, it will do better than if it computed its policy based on the true, weaker effectiveness. It is doubtful that this is a general phenomenon, and the appearance of it here is most likely due to the underlying independence of CM effects. Nevertheless, the monotonicity suggests a method for computing improved Nash equilibrium policies in this example. Namely, the agents compute a Nash equilibrium policy 66 Chapter 2. Decentralized Force Protection: Nash Equilibrium Sensitivity of Hit Probability to ECM Effectiveness 0.8 0.7 Probability of Hit 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0 0.2 0.4 0.2 0.4 0.6 0.6 Estimate of ps 0.8 0.8 1 True Value of ps 1 Figure 2.8: Performance of Nash equilibrium policies when ps , the CM success probability as defined in (2.70), and its estimate, p̂s , vary. As ps increases, CM becomes more effective and hence the probability of hit is reduced. As p̂s deviates from ps , the CM policy and hence the probability of hit varies, although the true missile dynamics remain unchanged. 67 Chapter 2. Decentralized Force Protection: Nash Equilibrium assuming an inflated value of p̂s , then use this as a starting point for the computation of a Nash equilibrium based on the true value p̂s = ps . Since the iterated best response technique is monotonic, the resultant equilibrium policy will have a better performance than the original. Interestingly, when this procedure was carried out, it was found that policies based on inflated values of p̂s were already equilibrium policies for lesser values. It is suspected that this surprising fact is again due to the underlying independence of CM effects. 2.7 Conclusion Decentralized missile deflection policies, computed offline, give task groups the ability to effectively defend against attacks even without real-time communication. This ability allows a task group to capitalize on the benefits of NCW while minimizing its vulnerability to attack on the infostructure that lies at its heart. In this chapter, we formulated a decentralized missile deflection problem, including a discussion of missile and countermeasure dynamics and the development of appropriate defensive objectives. This formulation leads to an interpretation of the problem as a stochastic shortest path game, for which Nash equilibrium solutions are known to exist. Algorithms to compute a Nash equilibrium solution via linear programming and iterated best response were proposed and shown to be effective when player objectives coincide. Finally, numerical studies and sensitivity analysis were presented to show how these results can be applied. Nash equilibrium solutions will in general be less effective than Pareto optimal solutions. However, in our simulations the degradation is minimal. In particular, when player objectives coincide and there are no extra constraints, there exists a pure optimal policy [81]. It follows that this optimal policy is also a pure Nash equilibrium. Therefore, it is no surprise that Nash policies perform quite well in the team games considered here. Game theory presents a promising avenue of approach for solving complex decision problems in a decentralized fashion. By distributing the problem among several decision makers, the decision space is significantly reduced, so that near-optimal solutions can be computed with far less complexity than that required for an optimal centralized solution. The framework presented here applies in a very general setting. Provided that the missile dynamics are known and agreed upon, that each ship is known to have perfect observations of missile threats, and that ship capabilities and objectives are common knowledge, both globally optimal and Nash equilibrium solutions can in theory be computed for any threat/defense combination, for any set of player objectives. 68 Chapter 2. Decentralized Force Protection: Nash Equilibrium Proof of Theorem 2.2.1 Proof As in [82], define a matrix C with entries cij = ftc (δi .µ; j); the expected number of time periods that the Markov chain is in j given that it starts in i. Conditioning on the first transition out of i, we obtain cij = I{i = j} + X p(i, k; µ)ckj , (2.72) k6=τ where I{·} is the indicator function. Since the Markov chain induced by P (µ) is transient, this sum is finite. In matrix form we have C = P (µ)C + I, (2.73) C = (I − P (µ))−1 . (2.74) Moreover, since ftc (α.µ; j) = X ftc (δi .µ; j)α(i) (2.75) i∈S by linearity, we have, for α a row vector on S, ftc (α.µ; j) = αC = α(I − P (µ))−1 . (2.76) Equation (2.4) and the definition of gl (µ) directly yield the result for vαl (µ). The result for vl (µ) follows by setting α = δi for each i ∈ S. Equation (2.9) follows directly by substituting the Taylor series expansion (I − P (µ))−1 = I + P (µ) + (P (µ))2 + . . . into (2.7). 69 Set of incoming missile threats. Set of countermeasure (CM)-equipped ships (platforms). State space of missile m; {position × target, initial, terminal}. M L Sm Missile target transition probability (Znm ), given position. Joint missile threat state at time n, (Yn1 , Zn1 , . . . , YnM , ZnM ). Transition probability of joint threat Xn , given joint CM u. Randomized CM deployment policy of Platform l. Xn ∈ S p(i, j; u) µl (i, u) Cost to platform l when Xn = i and Un = u. g l (i, u) Total expected cost starting from state i under policy µ. Glossary of terms for Chapter 2. vl (i, µ) Joint randomized CM deployment policy. ∆(Ul (i)) µ(i, u) ∈ CM action of Platform l at time n; joint CM action. Unl ∈ Ul ; Un pm y (i, j; z, u) pm z (i, j; y, u) Missile position transition probability (Ynm ), given target. Set of true targets; Set of true targets defended by platform l. G ⊂ T ; G(l) ⊂ G ∈U Missile m threat state; (position, target). (Ynm , Znm ) ∈ Sm = {B × T, r, τ } Interpretation Variable Glossary Chapter 2. Decentralized Force Protection: Nash Equilibrium 70 Bibliography [62] D. Alberts, J. Garstka, and F. Stein. Network centric warfare: developing and leveraging information superiority. CCRP Publications, second edition, 1999. [63] E. Altman. Constrained Markov decision processes. Chapman and Hall/CRC, 1999. [64] D. Bertsekas. Dynamic Programming and Optimal Control, volume 1 and 2. Athena Scientific, second edition, 2001. [65] S. Blackman and R. Popoli. Design and analysis of modern tracking systems. Artech House, 1999. [66] J. Breakwell and A. Merz. Football as a differential game. Journal of Guidance, Control, and Dynamics, 1992. [67] M. Breton. Algorithms for stochastic games. In Stochastic games and related topics, Series C; Game theory, mathematical programming and operations research, 1991. [68] L. Van Brunt, editor. The glossary of electronic warfare. EW Engineering, 1984. [69] T. Cover and J. Thomas. Elements of Information Theory. Wiley, 1991. [70] J. Cruz and M. Simann et. al. Game-theoretic modeling and control of a military air operation. IEEE Transactions on Aerospace and Electronic Systems, 37(4), 2001. [71] H. Eustace, editor. The International Countermeasures Handbook. EW Communications, sixth edition, 1980. [72] J. Filar and K. Vrieze. Competitive Markov Decision Processes. Springer-Verlag, 1997. [73] S. Gutman. On optimal guidance for homing missiles. Journal of Guidance and Control, 1979. [74] P. Herings and R. Peeters. Stationary equilibria in stochastic games: structure, selection and computation. METEOR Working Paper No. RM/00/31., 2000. 71 Bibliography [75] F. Thuijsman J. Filar, T. Schultz and O. Vrieze. Nonlinear programming and stationary equilibria in stochastic games. Mathematical Programming, 50, 1991. [76] Y. Lipman and J. Shinar. Mixed-strategy guidance in future ship defense. Journal of Guidance, Control, and Dynamics, 1996. [77] W.S. Lovejoy. Computationally feasible bounds for partially observed markov decision processes. Operations Research, 39(1), 1991. [78] W.S. Lovejoy. A survey of algorithmic methods for partially observed markov decision processes. Annals of Operations Research, 28(1), 1991. [79] M. Maskery and V. Krishnamurthy. Game-theoretic missile deflection in network centric warfare. Proceedings of the IEEE International Conference on Networking, Sensing and Control, 2005. [80] Joint Chiefs of Staff, editor. Dictionary of Military and Associated Terms. Department of Defense, 1987. [81] M. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, 1994. [82] S. Ross. Introduction to probability models. Academic Press, sixth edition, 1997. [83] A. Sen and D. Shanno. Computing nash equilibria for stochastic games. Working Paper, 2004. [84] L. Shapley. Stochastic games. Proceedings of the National Acadamy of Sciences USA, 39(10), 1953. [85] H. Sherali, Y. Lee, and D. Boyer. Scheduling target illuminators in naval battle-group anti-air warfare. Naval Research Logistics, 42, 1995. 72 Chapter 3 Decentralized Force Protection Against Anti-Ship Missiles: Correlated Equilibrium Approach 3.1 3 Introduction Steady advances in communications technology, combined with advances in the theory of modern warfare, have combined to make high-bandwidth communications equipment widely available aboard military platforms. This equipment is highly advanced, capable of automatically and seamlessly synchronizing observations and actions among a large number of decision makers. The resulting potential for improved mission performance presents many challenges and opportunities, and the need for theoretical models for such endeavors is clear. Such models must consider the use of secure, wireless data networks to allow modern platforms to assist each other in the implementation of time-critical tasks, while respecting all the various goals and objectives present among members of the collective. Nowhere is this situation more clear and immediate than in the problem of cooperatively and dynamically synchronizing countermeasures against anti-ship missiles among members of a Naval task group. In this scenario, several ships, communicating over a common data network, face an attack by a number of radar guided missiles, and must deploy electronic countermeasures (CM) to deflect or confuse the missiles, in an attempt to defend both themselves and other members of the task group. This situation is characterized by the need to make rapid, decentralized decisions guided by real-time communication, that simultaneously balance the requirements for self and group preservation. In this chapter, we consider such a scenario, with missile threats modeled as a discrete Markov process. Missiles appear randomly on the perimeter of a fixed physical space, and 3 A version of this chapter has been accepted for publication. Maskery, M. and Krishnamurthy, V. (2007) Network-Enabled Missile Deflection: Games and Correlated Equilibrium, IEEE Transactions on Aersopace and Electronics Systems 43:3. 73 Chapter 3. Decentralized Force Protection: Correlated Equilibrium move toward the ships under known targeting and guidance laws. Electronic countermeasure (CM) capabilities, which are either decoys or electromagnetic jamming signals, are modeled as ship action that are deployed dynamically to reduce the threat. Each platform represents a decision maker who observes the missile positions and uses local CMs to maximize own safety, while coordinating with other platforms doing the same. The problem is formulated as a transient stochastic game with cost-minimizing players. The fact that players coordinate their actions through a network is reflected by considering the relevant solution to be a correlated equilibrium [88], [89], allowing players to coordinate actions in real time through some player-defined communication scheme, such that global and local objectives are simultaneously optimized. Such a strategy is referred to as a NEOps equilibrium strategy. The use of a correlated equilibrium is the major innovation of this chapter, distinguishing it from the results of Chapter 2. In that paper, players deployed countermeasures to deflect anti-ship missiles, but did not coordinate. The two approaches are complementary; Chapter 2 provides a highly robust and efficient missile deflection solution, since coordination is not required, while the present chapter accounts for the possibility of coordination as it arises in the NEOps framework. Such scenarios arise out of the general framework of Network Centric Warfare (NCW) [86], and similar terms such as Network Enabled Operations, which denote a perspective on military operations emphasizing communication and coordination between friendly forces, in order to gain and leverage information superiority over an enemy. NCW connects independent, intelligent decision makers, allowing them to make joint decisions that are competitively optimal for each interested party. The game theoretic approach, as outlined here, is able to model such a situation, allowing us to define the types of strategies that would be acceptable in a NCW setting through the correlated equilibrium concept. Examples of previous and current technologies enabling NEOps in a naval setting are the Cooperative Engagement Capability, the Link 16 data network and the Single Integrated Air Picture (SIAP). Such technologies fuse and circulate common sensor data to all decision makers in a networked force, providing platforms with the capability to coordinate joint actions in real time. It is important to note that the results in this chapter are not dependent on any specific technologies, but there is a requirement for some enabling technology capable of synchronizing tactical data and coordinating efforts in real time. The technologies cited above are evidence that such a requirement is feasible. The correlated equilibrium of a stochastic game describes a solution for playing a dynamic game in which players are able to communicate but are self-interested. This is a key point to bear in mind; we view a correlated equilibrium as a descriptive object, illuminating 74 Chapter 3. Decentralized Force Protection: Correlated Equilibrium the properties of the set of joint strategies inevitably reached by such decision makers. We take this view throughout the chapter, requiring that players’ joint strategies take the form of a correlated equilibrium without always being compelled to specify the communication structure underlying such a strategy. Particular attention will be paid to the NEOps equilibrium strategy, as such an object describes how well a group of players can perform from a global perspective. In addition to the motivation developed above, recent work in decentralized machine learning has identified the set of correlated equilibrium as the limit set of many decentralized machine learning algorithms [107], [108], [97]. This learning approach, based on regret matching, requires minimal awareness from each player, and can hence be effectively employed in a decentralized setting. Players repeatedly simulate the game together and learn, in a completely decentralized fashion, a correlated equilibrium strategy from only their own observed costs and empirical distribution of past play. The existence of such a simple, convergent, distributed learning solution makes correlated equilibria worthy of study in their own right. Communication in Network Enabled Operations is not unlimited, and a cooperative NEOps equilibrium strategy can be implemented only if channel capacity constraints can be satisfied. One of the key contributions of this chapter is to quantify the amount of communication necessary in the implementation of a NEOps equilibrium strategy for missile deflection. This is done through elementary concepts in information theory. As a consequence, we show how NEOps equilibrium strategies can be computed in the presence of bandwidth constraints, and provide examples showing the trade-off between performance and communication. Main Results The main contributions of this chapter are listed as follows. • The formulation of a decentralized missile deflection problem as a transient stochastic game with communication. In Sections 3.2 and 3.3, we consider a scenario with ships as self-interested decision makers able to coordinate countermeasure deployment against a fixed number of missiles with known dynamics in real time. • A theoretical foundation for Network Enabled Operations via the definition of NEOps equilibrium strategy. Sections 3.1.1 and 3.1.2 define and relate the concepts of Network Enabled Operations and correlated equilibrium. The NEOps equilibrium strategy is defined as the relevant competitive equilibrium strategy for Network Centric Warfare, 75 Chapter 3. Decentralized Force Protection: Correlated Equilibrium formulated as a stationary correlated equilibrium for the stochastic missile deflection game. The strategy is characterized as the solution to an optimization problem with bilinear constraints in Section 3.4. • An information theoretic analysis of the communication overhead required for implementation of the NEOps equilibrium strategy. Communication overhead is shown to be related to the mutual information between the random variables describing player actions in Section 3.5. We also present several implementation architectures in that section. • Numerical examples illustrating the NEOps equilibrium strategy for a moderate-scale missile deflection games are presented, and the trade-off between communication and performance is examined in Section 3.6. 3.1.1 Network Enabled Operations Network Enabled Operations (NEOps) is an adaptation of the Network Centric Warfare concept, as put forth in [86]. In contrast with NCW, NEOps emphasizes the inclusion of operations other than war, and is not so explicitly bound to network technology. A candidate definition for Network Enabled Operations is given as follows [91]: Definition 3.1.1 Network Enabled Operations (NEOps) represent an approach to the conduct of military operations characterized by common intent, decentralized empowerment, and shared information, enabled by appropriate culture, technology and practices. A representative overview of the proposed operation of NEOps with regards to missile deflection is depicted in Figure 3.1. The enabling technology cited in Definition 3.1.1 is present in the form of a secure communication network capable of distributing sensor data and coordinating actions among a large number of military platforms in real time; each decision maker is provided with an identical, high quality picture of the battlespace by sharing sensor, decision, and engagement data between all interested parties. We claim that our work supports Definition 3.1.1 in the following manner: • Common Intent: The NEOps equilibrium strategy is optimal with respect to a single, collective objective, representing the common intent of the networked force. • Decentralized Empowerment: The equilibrium aspect of the NEOps equilibrium strategy empowers each decision maker since it restricts attention to the set of strategies which are independently optimal from each individual point of view. Implicit in this 76 Chapter 3. Decentralized Force Protection: Correlated Equilibrium Missile B Other Tracking Missile D Missile E Missile C Missile F Missile A Ship 1 Local Tracking Counter− measure Systems Ship 2 Local Tracking Ship L Counter− measure Systems Local Tracking Counter− measure Systems ... Decision/ Control Track / Decision Processer Data Distribution System Decision/ Control Track / Decision Processer Decision/ Control Track / Decision Processer Data Distribution System Data Distribution System Central Command and Control Figure 3.1: Decentralized missile deflection using Network Enabled Operations. Ships share sensor measurements and coordinate decisions through the network, with the possible assistance of a central command and control that issues acceptable recommendations. 77 Chapter 3. Decentralized Force Protection: Correlated Equilibrium interpretation is the existence of disparate individual objectives in addition to the common objective mentioned above. Furthermore, decentralized empowerment implies multiple, independent decision makers, justifying the use of game theory. • Shared Information: This is supported in two ways. First, we assume that each decision maker shares common observations of the missile threat through the use a common sensor network. Second, we allow users to share information about their reactions to these threats in real time. In game-theoretic terms, the first assumption is characterized by the fact that we do not use the theory of Bayesian games, as would be required if decision makers had disparate tactical information. The second assumption is characterized by use of the correlated as opposed to the Nash equilibrium solution concept. 3.1.2 The Case for Correlated Equilibrium in NEOps The correlated equilibrium concept, introduced by Aumann in [88] and [89], is a generalization of the Nash equilibrium. For static noncooperative games, it describes the joint strategy of players who can communicate their intentions to each other (possibly through an intermediary), but are self-interested and are able to update conditional probabilities using Bayes’ rule. Aumann showed that players can achieve a global performance improvement by implementing a correlated instead of a Nash equilibrium. The correlated equilibrium requires that actions be coordinated, but guarantees that each player retains an element of choice. In its canonical form, a static correlated equilibrium consists of a game augmented by a correlation device, which sends “private recommendations” to each player. The players, knowing the equilibrium structure, find it rational to follow these recommendations because of the a posteriori action probabilities they imply about other players [88]. A Nash equilibrium is trivially a correlated equilibrium because it is a product distribution; players receiving recommendations from a Nash distribution learn nothing new about others’ actions, but find it rational to follow the recommendation because of the classical “best response” rationale. Although it is a useful tool, both in theory and practice, we do not justify the use of correlated equilibrium in this chapter based on the use of such an explicit correlation device. Instead, we reason, as in [89], that all policies adopted by independent, rational decision makers with the ability to coordinate will take the form of a correlated equilibrium. We then characterize the NEOps equilibrium strategy as a “globally optimal” correlated equilibrium, 78 Chapter 3. Decentralized Force Protection: Correlated Equilibrium and examine the communication required for its implementation. A correlated equilibrium, in this last sense, is defined as follows. Definition 3.1.2 Let Ω be a finite information set, and define partitions P l on Ω for each game player l ∈ L. Define player strategies µl (ω) : ω ∈ Ω, such that µl (ω) = µl (ω 0 ) whenever ω and ω 0 are elements of P l ∈ P l . Also define a probability measure p on Ω. A correlated equilibrium is a set of strategies {µl : l ∈ L} such that, for every τ l such that τ l (ω) = τ l (ω 0 ) whenever ω and ω 0 are in P l ∈ P l , X p(ω)ul (µl (ω), µ−l (ω)) ≤ ω∈Ω X p(ω)ul (τ l (ω), µ−l (ω)). (3.1) ω∈Ω Here, ul represents the cost to Player l, and µ−l represents the collection of strategies of all players but l. At each decision point of the game, each player l receives partial information about ω through the observation ω ∈ P for some P ∈ P l . This information represents the knowledge about the other players’ decisions, along with any other relevant information, which is factored into Player l’s action decision via (3.1). Elements of Ω describe the exact value of all parameters as to which a player may be uncertain, including the exact actions taken by each player in the game. To simplify matters, we assume in this chapter that this uncertainty in others’ actions is the only source present. That is, we restrict our attention to the sets Ωs = ×l∈L U l (s), where L is the set of players, s is a known parameter, and U l (s) is the set of actions available to player l given parameter s. Example If each P l = {Ω, ∅}, then players do not communicate. In this case, the correlated equilibrium is equivalent to a Nash equilibrium. If each P l is identical, the players have fully public knowledge of each others’ actions. It is well known that this last case, corresponding to public communication, is not always optimal. Players can improve their overall performance by restricting themselves to complex private communication schemes. For this reason, we allow the information partition to be a free variable in our formulation, so that we may later analyze its structure under optimal conditions. We do, however, pay special attention to the public communication case, as it is particularly easy to implement. It is useful to make two remarks here, due to Aumann [89]. First, although a correlated equilibrium generally takes the form of a probability distribution over joint actions, its 79 Chapter 3. Decentralized Force Protection: Correlated Equilibrium probabilistic nature reflects only the uncertainty of players about the action choices of others; players themselves do not explicitly randomize their actions. This is an important observation for those who would be uncomfortable entrusting outcomes to a machine that essentially performs coin flips. Second, every correlated equilibrium corresponds to at least one information system. That is, we are justified in characterizing the correlated equilibrium solution before specifying the information structure that enables such an equilibrium. Although this view of a correlated equilibrium is philosophically sound, we acknowledge that there is a gap in current understanding between Aumann’s descriptive view of correlated equilibrium, and the many prescriptive means of implementing it. The NEOps equilibrium strategy can be motivated as a reasonable description of actions, but the implementation method must remain unspecified. Conversely, we may specify an implementation method (see Section 3.5.2), but this affects the claim that the strategy has been freely negotiated. We solve this problem by supporting both the descriptive and prescriptive views separately, with the understanding that in practice either one view or the other will be useful. The information theoretic portion of the chapter is valid from both perspectives. The extension of the correlated equilibrium to stochastic games originated with [103], in which the prototype concept of extensive form correlated equilibrium was defined. The natural progression to stochastic games is apparent in papers such as [113], which defines a stationary correlated equilibrium for a stochastic game, but considers only the special case of symmetric information, and [106], which uses correlated equilibria as the goal of a reinforcement learning procedure for adaptive game-theoretic agents. More recently, [119] shows the existence of a (nonstationary) correlated equilibrium based on retaliation strategies. In contrast to the above examples, this chapter focuses on a stochastic correlated equilibrium that is both stationary and allows for asymmetric information transfer between players. Such a definition is advantageous in that it avoids the performance limitations associated with symmetric information, as is the case in [113], and is not vulnerable to the perpetual performance degradation that may result from retaliation to errors in play in a nonstationary equilibrium of the type described in [119]. On the other hand, we restrict our attention to stochastic game models with finite state and action spaces. The main reason for this, besides ease of exposition, is that the stochastic version of the correlated equilibrium solution concept is only known to exist, in general, for this case. Notwithstanding such existence problems, extension of the concepts presented here to countable and continuous spaces is relatively straightforward. For purposes of analysis, the calculation of an optimal correlated equilibrium for a stochastic game requires the solution of an optimization problem with linear objective and 80 Chapter 3. Decentralized Force Protection: Correlated Equilibrium bilinear constraints. In order to compute a solution to such a problem using standard nonlinear programming methods, an initial feasible solution is needed. Since a Nash equilibrium is also a correlated equilibrium, it will be convenient to use a Nash equilibrium as a starting point for such a calculation. This is true since Nash equilibria can often be computed using decentralized, iterative techniques such as a best response scheme as in Chapter 2. Such techniques are of relatively low dimension and are based on an intuitive approach, giving a simple starting point for correlated equilibrium calculations. In this light, one may regard a NEOps equilibrium strategy as an improvement over a Nash equilibrium resulting from the sudden discovery of coordination abilities; players start off by playing a Nash equilibrium, then gradually improve their situation through communication by jointly computing better and better correlated equilibria, driven by some mutually agreed upon global objective. The optimal correlated equilibrium is a steady state of such an improvement process. In a machine learning context, such a steady state might arise through fictitious play, using methods such as the regret matching schemes discussed in the introduction. Our approach is game-theoretic in nature, which is meant to reflect the decentralized nature of command and facilitate distributed computation. However, other approaches are possible. A comparable approach is found in [118], which considers a target illumination problem very similar to our multiple-ship, multiple-missile deflection scenario. This chapter computes an optimal scheduling solution for defeating the missile threat, which is implemented through full coordination. The optimal scheduling depends heavily on centralized control, but appears quite effective overall. In addition, one can formulate a max-min fair approach [114] in which the objective is to minimize the maximum probability of hit over all players. However, the computational complexity involved is quite high, and the solution again does not reflect decentralized empowerment. Finally, one can formulate a constrained Markov decision problem [87] which directly minimizes the probability of hit, subject to the probability of hit of each player being below a given threshold. This approach has the advantage that only a linear program, as opposed to our bilinear program, must be solved. However, it represents a weaker form of decentralized empowerment, since players may only specify a lower performance bound, instead of insisting on the best available performance. Moreover, it is not clear how the thresholds should be chosen, so an outer loop would also be needed. 81 Chapter 3. Decentralized Force Protection: Correlated Equilibrium 3.2 Network Enabled Operations in Dynamic Missile Deflection In this section, we describe our application domain, as originally formulated in Chapter 2, in a manner amenable to NEOps. We consider a missile threat evolving according to a discrete Markov process with dynamics jointly controlled by the countermeasure (CM) decisions of each ship in a task group. Fundamental to this model are the following: • A set of incoming missiles M. • A set of CM-equipped defensive agents L. • A finite state space S, describing possible states of the multiple missile threat. • A set of countermeasures U l (i) available to each agent l ∈ L, for each state i ∈ S. • A discrete-time Markov process describing the evolution of the aggregate missile threat on S. We slightly abuse notation by allowing M and L to represent both a set and its cardinality. We assume that the theater of operations is a discrete space, and that missiles move in discrete time toward the task group. We also consider the task group to be stationary, due to the relative speed difference between missiles and ships, the short engagement time, and the fact that ship maneuvers may be restricted. A group of autonomous players defends the task group, each player having control over a subset of all countermeasures and charged with defending a given subset of ships. Given the aggregate missile state (positions, directions, etc.) at any time index, the task of each player is to deploy a sequence of CM so as to defend their charges, without exact knowledge of the CM decisions being made simultaneously by the other players, who in turn are carrying out their similar tasks. A key contribution of Network Enabled Operations is to provide common, high quality track information to all decision makers. Hence we make the assumption that all decision makers have complete observability over the threat state at all times. In practice, the missile process will be only partially observable, due to the presence of noisy measurements, uncertainty of missile type, etc. In this case, our assumption of complete observability is equivalent to the use of certainty equivalent control [94]. At any rate, the optimal control results of this chapter must first be understood before tackling the partially observed case. The remainder of this section models the missile threat as a Markov process. The formulation of a single missile threat is considered first, which is then extended to multiple missile threats. 82 Chapter 3. Decentralized Force Protection: Correlated Equilibrium 3.2.1 Single Missile Process We begin with the evolution of a single missile, labelled missile m, indexed by discrete time t. The missile moves through a physical battlespace, guided by radar towards a given target that evolves with time. The state is given by Ã Xtm Ztm ! Ã = Position of missile m at time t ! Target of missile m at time t . (3.2) Define the battlespace B to be the set of all possible positions of an active missile. We take this to be a two-dimensional grid, which is sufficient for sea-skimming missiles. For example a 600 × 600 grid with each coordinate representing a missile’s physical position to within 50 metres is sufficient to reflect the position of a missile with a range of 30 kilometres, a speed of 500 metres per second, and accurate position updates made 10 times per second. Define T ⊂ B to be the set of all possible targets of a missile. We assume that a missile moves more or less directly from its current position towards its current target, but that this target may change according to the missile’s targeting decisions (which may be influenced through ship countermeasures). Since missiles can be misled, not every element of T corresponds to a ship location, and we define T ⊂ T to be the set of true target positions occupied by the task group. An active missile has a physical position Xtm in B and a current target Ztm in T. To this we add two special states, a pre-engagement state r, corresponding to missile m being a potential but as-yet untracked threat, and a terminal state ∆, corresponding to a missile has reached a target, terminated its flight, or has acquired a trajectory that ensures it is no longer a threat. When one of these events occurs, the missile transitions to the unique ∆ with probability one. The state space of the missile is denoted by Sm = {{B × T }, {r}, {∆}}. (3.3) For states s ∈ B × T, we say s = (sx , sz ), where sx and sz denote position and target components, respectively. In this chapter, (Xtm , Ztm ) is assumed to be a Markov process, controlled in part by CM actions, with transition probabilities defined as follows. Definition 3.2.1 Let ut represent some joint CM action of all agents at time t. For i, j ∈ 83 Chapter 3. Decentralized Force Protection: Correlated Equilibrium B × T, the physical and targeting transition probabilities of missile m are given by m m m pm x (i, j; z, u) = P r(Xt+1 = j|Xt = i, Zt = z, ut = u), (3.4) pm z (i, j; x, u) (3.5) = m P r(Zt+1 = j|Ztm = i, Xtm = x, ut = u). These probabilities are assumed to be conditionally independent, hence the overall transition probabilities are given by m m pm (i, j; u) = P r((Xt+1 , Zt+1 ) = (jx , jz )|(Xtm , Ztm ) = (ix , iz ), ut = u) m = pm x (ix , jx ; iz , u)pz (iz , jz ; ix , u). (3.6) (3.7) Furthermore, we define the following transition probabilities: pm (i, ∆; u) = 1 whenever i ∈ T × T, (3.8) (missile terminates upon reaching a target) pm (∆, ∆; u) = 1, (3.9) (terminated missiles do not recover) pm (r, j; u) = β m (j), j ∈ S m , (3.10) (activation probabilities are independent of control), where β m is a given probability distribution. We account for two sources of randomness in the missile process. That is, randomness in the physical guidance system, and randomness in the targeting mechanism. Other sources, such as turbulence and errors in the sensor network’s estimates of the missile position can be included as desired. For the physical guidance system, we assume the missile operates according to proportional guidance as described in [95]. That is, the missile moves on an intercept course towards a target by predicting its position at the time of impact. However, because the missile may only make noisy position and velocity estimates of its target, occasional deviations from the proper course may occur. For example, suppose a missile is at point (Xtm , Ztm ) = (a, z) a time t. Then we can and is on a straight-line intercept course that should take it next to point b(a, z) on its way to target position z. We may account for random guidance errors by assuming that the missile moves to b(a, z) with some high probability q, and otherwise moves to one of the two states adjacent to both a and b(a, z) with probability (1 − q)/2. 84 Chapter 3. Decentralized Force Protection: Correlated Equilibrium The targeting transition probabilities depend on some programmed decision rules of the missile, but targets can also “drift” if the missile is viewed as a receiver operating in a noisy environment. Suppose that Ztm = a, and its radar receives information about the target position over a discrete memoryless channel [99]. The channel randomly reports the new 0 m position as a0 with probability pm z (a, a ; x, u) and the missile updates Zt+1 accordingly. The distortion may or may not vary with the missile position. An effective countermeasure u might minimize pm z (a, a; x, u) when a ∈ T, and maximize it otherwise. 3.2.2 Multiple Missile Process Naval task groups must be prepared not only for isolated missile attacks, but also for the possibility of large-scale, multiple missile attacks. In this section we extend the Markov missile process to the multiple M missile case, which evolves on a much larger, product state space. Define the aggregate missile trajectory indexed over time t by Yt = (Xt1 , Zt1 , Xt2 , Xt2 , . . . , XtM , ZtM ). (3.11) The process evolves on the product state space S, defined as S = S1 × S2 × . . . × SM . (3.12) Since there is a unique absorbing state ∆ for each missile, it is clear that (∆, ∆, . . . , ∆) ∈ S is the unique absorbing state for the aggregate process. For simplicity, we refer to this state simply as ∆; whether this refers to a single or aggregate terminal state will be clear from the context. We also define S0 to represent all states in S except ∆. A generic state s ∈ S is written as (s1 , s2 , . . . , sM ). We assume that each missile evolves independently (they do not communicate). In this case the transition probabilities of the Yt process may be written as p(i, j; u) = M Y pk (ik , j k ; u), (3.13) k=1 All the dynamics of the single missile process carry over to the multiple missile case. This formulation allows us to consider a variety of threat situations. Not only does each distinct value of M represent a fundamentally different threat situation, but the potential for mixtures of different missile types leads to a combinatorially large number of situations 85 Chapter 3. Decentralized Force Protection: Correlated Equilibrium for each M. In practice, only a small number of threat situations would be considered, chosen to be representative of the types of situations anticipated. Countermeasures (CM) involve mechanisms such as physical decoys, chaff, and electromagnetic jamming [100]. The primary function of each CM type is to confuse the targeting system of the missile, possibly by providing a more attractive (and false) target. It is assumed that CM control decisions can be made dynamically, that is, a control decision is made at each time index. This is a reasonable assumption for unconstrained countermeasures such as electromagnetic jamming. It is also straightforward to adjust the problem so as to consider more limited countermeasures, without affecting the theory here, as in Chapter 2. Although we refrain from going into detail, we assume that each joint CM action u affects the missile process by controlling the transition probabilities p(i, j; u) in a known way. This approach allows one to specify varying CM effects for different missile states, and explicitly characterize the uncertainty associated with each countermeasure attempt. For more details, and an extended model of the missile dynamics, with multiple missiles, decoys, noisy missile targeting, and CM constraints, the reader is referred to Chapter 2 and [111]. 3.3 Decentralized Missile Deflection as a Transient Stochastic Game Transient stochastic games [117],[101] consist of a Markov process evolving on a state space S, along with a set L of players, who influence the transition probabilities of the process by choosing one of a finite number of actions in each state. Each player associates a cost with each state-action combination, and chooses actions with the objective of minimizing the total expected cost accumulated over the lifetime of the process. The transience property is satisfied by having a cost-free, absorbing state ∆ ∈ S, such that the probability of reaching ∆ in a finite number of steps from any state i ∈ S is nonzero regardless of actions. It is easy to see that the missile deflection problem is a transient stochastic game (with CM platforms taking the role of players), since a missile will reach its target, fly out of range, or run out of fuel in finite time. The instantaneous costs incurred by player l ∈ L when the missile is in state i and CM action u is taken is defined as g l (i, u). Since state ∆ is cost free, g l (∆, u) = 0 for all l ∈ L, and all action choices u. A particularly simple and appropriate cost function for the missile 86 Chapter 3. Decentralized Force Protection: Correlated Equilibrium deflection problem is obtained as g l (i, u) = I(ikx ∈ Tl , some k). (3.14) It can be shown, using (3.8), that the total expected cost under (3.14) is equivalent to the expected number of missiles that ever reach Tl , Chapter 2, [116],[111]. In the case of a single threat, it is also the probability of a missile hit. (The cost function can be defined in other ways, for example to penalize a player whenever the missile reaches a new minimum distance, [111].) Although (3.14) is independent of u, we maintain generality by including the dependence of g l on u below. In this chapter, we consider the game solution to be a stationary correlated equilibrium from the point of view of an outside observer. From such a perspective, players select joint CM actions according to a stationary probability distribution. This probability distribution is referred to here as either a stationary randomized joint strategy. Let Ul (resp. U) represent the vector of available individual (resp. joint) action sets over all states i. That is, Ul = (U l (1)T , U l (2)T , . . . , U l (|S|)T )T , T T T T U = (U (1) , U (2) , . . . , U (|S|) ) , (3.15) (3.16) (where superscript T indicates vector transpose). Furthermore, let ∆(Ul ) (resp. ∆(U)) represent the set of probability distributions over the rows of Ul (resp. U). Then we are led to the following definition. Definition 3.3.1 A stationary randomized joint strategy µ is mapping from S to ∆(U). That is, for each state i ∈ S, µi (·) is a probability distribution over actions in U (i). The probability of players jointly selecting action u in state i is µi (u). Each joint strategy µ induces a Markov chain on S, with transition probabilities, p(i, j; µ) = X p(i, j; u)µi (u), (3.17) g l (i, u)µi (u), (3.18) u∈U (i) and expected costs, g l (i, µ) = X u∈U (i) for each player l ∈ L. Given a joint strategy, players associate a total expected cost Gl (i, µ) 87 Chapter 3. Decentralized Force Protection: Correlated Equilibrium with each state, defined through the (recursive) set of equations Gl (i, µ) = g l (i, µ) + X p(i, j; µ)Gl (j, µ). (3.19) j∈S0 This dynamic programming equation is well-known for stochastic games, [117], [101], [94]. The reader can easily verify that it is finite for all transient stochastic games. The solution to a stochastic game is an equilibrium; a joint strategy which is simultaneously optimal for all players. The study of equilibria in stochastic games is usually performed by considering a set of auxiliary strategic games [117],[101], defined as follows. Consider a single state i, and assume that some strategy µ will be followed in every state following i. Then each player could, in principle, predict the expected total costs associated with every joint action u in state i. This is done by adding the current cost of action u with the expected cost incurred from the next state onward, assuming strategy µ. These costs define an auxiliary game for i, with costs Gl (i, (u, µ−i )) = g l (i, u) + X p(i, j; u)Gl (j, µ), (3.20) j∈S0 where (u, µ−i ) corresponds to the strategy in which joint action u is taken immediately in state i, and strategy µ is used in every subsequent state, including any return to i. A stationary joint strategy µ that is simultaneously a Nash equilibrium for all auxiliary games is a Nash equilibrium for the stochastic game [101]. (That is, each player strategy is a best response in terms of present and future payoffs in every state.) Similarly, a stationary joint strategy that is simultaneously a correlated equilibrium for all auxiliary games will be a correlated equilibrium for the stochastic game. 3.4 Characterization of the NEOps Equilibrium Strategy In this section we formulate the NEOps equilibrium strategy as the solution to a constrained optimization problem, in which the constraints themselves are dependent on the solution. The correlated equilibrium we characterize is optimal with respect to some common linear objective function, subject to individual “equilibrium” constraints, and so characterizes the strategy form of ships connected through NEOps. This is in contrast to Chapter 2, where communication is unavailable and a Nash equilibrium strategy is the best achievable. The NEOps equilibrium strategy can be computed as the solution to a Markov decision problem [87], in which the search space is confined to policies that are correlated equilibria. 88 Chapter 3. Decentralized Force Protection: Correlated Equilibrium Similarly to the player costs (3.14), we may take the collective costs as g(i, u) = I(ikx ∈ T, some k). (3.21) Here, T = ∪l∈L Tl , where each Tl represents the set of target locations that player l is charged with defending. The collective objective is thus to minimize the expected number of hits to any ship. The linear programming approach to Markov decision problems found in [87] allows us to compute the NEOps equilibrium strategy via an optimization. The idea is that we may formulate the objective and policy in terms of an occupation measure π, which itself represents the decision variables in an optimization problem. We give an intuitive outline of the procedure below; the reader may refer to [87], [115] and [94] for a more rigorous treatment. Define π(i, u) to be the expected number of times state i is reached and joint CM u is deployed. Then clearly the expected total cost incurred during an attack satisfies E[total cost] = X X g(i, u)π(i, u). (3.22) i∈S0 u∈U (i) Moreover, due to (3.21), the expected total cost is equal to the expected number of missile hits. Assume that the initial distribution of the missile threat is uniform over S0 (the initial distribution, it turns out, is largely irrelevant to the optimal policy). It can be shown (Chapter 8 of [87]), due to the nature of ∆ as an absorbing state, that the set of occupation measures π is exactly the set of π satisfying X π(i, u) = X X p(j, i; u)π(j, u) + j∈S0 u∈U (j) u∈U (i) 1 , |S0 | π(i, u) ≥ 0, for all i, u. This can be viewed as a restriction to measures π that satisfy the balance equations of the underlying Markov chain. Moreover, due to ∆, we may implement any π via a stationary policy defined as follows: Definition 3.4.1 When the threat is in state i, choose joint CM u with probability (π) µi (u) = P π(i, u) . v∈U (i) π(i, v) (3.23) 89 Chapter 3. Decentralized Force Protection: Correlated Equilibrium Determining the NEOps equilibrium strategy is thus equivalent to solving the following mathematical program: minimize X X g(i, u)π(i, u) i∈S0 u∈U (i) subject to (∀i ∈ S0 , l ∈ L, ul , dl ∈ U l (i)) : X X X 1 π(i, u) = p(j, i; u)π(j, u) + 0 , |S | 0 u∈U (i) j∈S u∈U (j) µ(π) is a correlated equilibrium. (3.24) We now formalize the last constraint of (3.24). In what follows we drop the superscript (π) for notational simplicity. Definition 3.4.2 A stationary randomized joint strategy µ is a stationary correlated equilibrium for a transient stochastic game if h i µi ((ul , u−l )) Gl (i, ((ul , u−l ), µ−i )) − Gl (i, ((dl , u−l ), µ−i )) ≤ 0, X (3.25) u−l ∈U −l (i) for all i ∈ S0 , l ∈ L, ul , dl ∈ U l (i), where the Gl (i, ·) are defined recursively through (3.19) and (3.20). (Recall that the notation u−l represents the actions all players but l.) The interpretation is that a correlated equilibrium is a probability distribution over joint actions which has the property that unilateral deviations are not profitable. (It is widely accepted that it is sufficient to consider one-shot deviations as above for stochastic games [109].) This definition is a straightforward interpretation based on both (3.20) and (3.1). If, at each decision point and after communication with others, player l decides to take CM action ul , then it must be because no other CM action dl would yield a lower probability of hit to Tl . Implicit in this definition is that the NEOps equilibrium strategy µ constitutes the common prior probability in (3.1) and is thus common knowledge. Substituting (3.23) into (3.25) yields the equivalent expression X h i π(i, (ul , u−l )) Gl (i, ((ul , u−l ), µ−i )) − Gl (i, ((dl , u−l ), µ−i )) ≤ 0. (3.26) u−l ∈U −l (i) The NEOps equilibrium strategy is therefore given by the solution to the following mathe- 90 Chapter 3. Decentralized Force Protection: Correlated Equilibrium matical program: minimize X X g(i, u)π(i, u) i∈S0 u∈U (i) subject to (∀i ∈ S0 , l ∈ L, ul , dl ∈ U l (i)) : X X X 1 p(j, i; u)π(j, u) + 0 , π(i, u) = |S | j∈S0 u∈U (j) u∈U (i) h i X π(i, (ul , u−l )) Gl (i, ((ul , u−l ), µ−i )) − Gl (i, ((dl , u−l ), µ−i )) ≤ 0, u−l ∈U −l (i) π(i, u) ≥ 0 ∀ u ∈ U (i). (3.27) Using (3.20), (3.26) can be rewritten as X h i π(i, (ul , u−l )) g l (i, (ul , u−l )) − g l (i, (dl , u−l )) ≤ u−l ∈U −l (i) X π(i, (ul , u−l )) u−l ∈U −l (i) i Xh p(i, j; (dl , u−l )) − p(i, j; (ul , u−l )) Gl (j, µ). (3.28) j∈S0 (3.28) clearly depicts the nature of the stationary correlated equilibrium. The left hand side represents the immediate benefit (in terms of reduced cost) of following the correlated equilibrium, while the right hand side represents the future benefit of deviating from it. The equation can be satisfied in three ways. If the right hand side is positive, then the proposed deviation is future profitable, but the present loss is too high to justify it. If the right hand side is negative and the left hand side positive, then the proposed deviation is profitable neither in the present nor the future. Finally, if the both sides are negative, then future losses are too high to justify an otherwise immediately profitable deviation. We conclude this section by establishing the feasibility of the NEOps optimization problem (3.27). This requires the following result: Lemma 3.4.1 A stationary Nash equilibrium is a stationary correlated equilibrium. Proof Joint strategy µ is a stationary Nash equilibrium if and only if each µi is a Nash equilibrium for the auxiliary stage game associated with state i. But if µi is a Nash equilibrium, it is also a correlated equilibrium for the auxiliary stage game i, [89]. But this is precisely the definition of a stationary correlated equilibrium, so µ is also a stationary correlated equilibrium. 91 Chapter 3. Decentralized Force Protection: Correlated Equilibrium Lemma 3.4.1 immediately establishes the existence of stationary correlated equilibria. Since stationary Nash equilibria are known to exist for stochastic games of the type described here [101], stationary correlated equilibria must also exist. Thus, there is at least one feasible point of (3.27), so the problem is well-defined. The global objective, defined through costs g(i, u), defines the optimality criteria for the NEOps equilibrium strategy. In the missile deflection context, we have defined a strategy to be optimal if all ships are defended equally. However, optimality can be defined in any number of ways. Some examples (see [106] and reverse the role of cost and reward) are given below. • Utilitarian: g(i, u) = P l∈L g l (i, u). • Republican: g(i, u) = minl∈L g l (i, u). • Egalitarian: g(i, u) = maxl∈L g l (i, u). In this chapter, the global objective (3.21) can be regarded as either utilitarian or egalitarian. Differing player priorities, corresponding to different ship classes for example, can be incorporated into the definitions above by introducing appropriate weighting parameters. 3.4.1 Correlated Equilibrium as the Solution to a Bilinear Program To solve (3.27), we must reformulate the problem to make the constraints independent of the solution. To this end, we introduce dummy decision variables J l (i), defined as the expected state-action costs satisfying (3.19). The optimization problem (3.27) can then be rewritten as minimize X X i∈S0 g(i, u)π(i, u) u∈U (i) subject to (∀i ∈ S0 , l ∈ L, ul , dl ∈ U l ) : X X X 1 π(i, u) = p(j, i; u)π(j, u) + 0 , |S | j∈S0 u∈U (j) u∈U (i) X X X X π(i, z)J l (i) = g l (i, u)π(i, u) + p(i, j; u)π(i, u)J l (j), z∈U (i) X u∈U (i) j∈S0 u∈U (i) h i π(i, (ul , u−l )) g l (i, (ul , u−l )) − g l (i, (dl , u−l )) ≤ u−l ∈U −l (i) 92 Chapter 3. Decentralized Force Protection: Correlated Equilibrium X π(i, (ul , u−l )) i Xh p(i, j; (dl , u−l )) − p(i, j; (ul , u−l )) J l (j), j∈S0 u−l ∈U −l (i) π(i, u) ≥ 0 ∀ u ∈ U (i). (3.29) (3.29) falls into the general class of quadratic programs, which have been studied in various forms. In fact, (3.29) has a more specific structure: it is an optimization problem with a linear objective and bilinear constraints. Standard nonlinear solutions techniques for (3.29) involve starting at an initial feasible point and searching for improved feasible solutions. This initial feasible point may be taken, in some cases, to be a Nash equilibrium. Indeed, it has been observed that any Nash equilibrium of a static game lies on the boundary of the space of correlated equilibria [112]. This result can be straightforwardly extended to stochastic games, as in the following theorem. Proposition 3.4.1 In any finite stochastic game, the stationary Nash equilibria lie on the boundary of the space of stationary correlated equilibria. Proof A stationary Nash equilibrium consists of a Nash equilibrium for each auxiliary state game, that is, for the set of static games with player costs given by Gl (i, (u, µ−i )). According to Proposition 1 of [112], a Nash equilibrium either assigns zero probability to one or more joint actions, or renders every player indifferent among strategies. This implies that, for each i ∈ S, either µi (u) = 0 for some u, or (3.25) is satisfied with equality. Equivalently, for each i, one of the two inequality constraints in (3.29) must be satisfied with equality. Since the constraints of (3.29) describe the space of all correlated equilibria, with equality corresponding to boundaries, every stationary Nash equilibrium lies on the boundary of the space. If one is able to compute an equilibrium in the absence of communication, one may use it as a starting point for computing an improved equilibrium with communication. Of course, a Nash equilibrium may in general be much more difficult to obtain than a correlated equilibrium, but we point out its usefulness here, since it has been the subject of much study, even in the decentralized missile deflection game in Chapter 2, [111]. (3.29) can be written in a more compact form, minimize g T x subject to 93 Chapter 3. Decentralized Force Protection: Correlated Equilibrium Ax = a, xT Bi y − xT bi = 0, i = 1, . . . , n1 , xT Cj y − xT cj ≥ 0, j = 1, . . . , n2 , x ≥ 0. (3.30) Decision variables π and J l have been replaced by x and y, respectively. Because of the structure of the Bi , y is determined uniquely by x. We also note that n1 = |S0 | × |L|, and P n2 = |S0 | × j∈L (|U l |)2 , so n2 > n1 . We remark here that the NEOps equilibrium strategy is Pareto optimal among the class of correlated equilibrium strategies, when the global objective is utilitarian. Since the global objective g is (for any nonzero weighting) a sum of the individual objectives g l , (3.38) corresponds to a scalarized version of the optimization problem: ¡ ¢ minimize (g 1 )T x, (g 2 )T x, . . . , (g L )T x subject to Ax = a, xT Bi y − xT bi = 0, i = 1, . . . , n1 , xT Cj y − xT cj ≥ 0, j = 1, . . . , n2 , (3.31) x ≥ 0. It is well known that the solution to (3.30) is then a Pareto optimal solution to (3.31) [96]. In this chapter (3.30) is solved by standard nonlinear optimization (i.e. line search) methods. This approach requires an initial feasible point to work, which can be, as mentioned above, a Nash equilibrium. Computation of such a feasible point may be quite difficult, and so one might consider alternative approaches. Using the penalty method [96] to penalize solutions in which costs do not reflect the current strategy, and introducing an additional decision variable z such that solutions are within z of equilibrium, (3.30) may be rewritten as, T minimize g x + n1 X ρ1 (xT Bi y − xT bi )2 + ρ2 z i=1 subject to Ax = a, xT Cj y − xT cj ≥ −z, j = 1, . . . , n2 , 94 Chapter 3. Decentralized Force Protection: Correlated Equilibrium z ≥ 0, x ≥ 0. (3.32) For sufficiently large ρ1 and ρ2 , (3.30) and (3.32) should yield equivalent solutions. One potential solution method is to apply a nearly block coordinate descent algorithm [93] to (3.32). In this approach, (x, z) and (y, z) are alternately held fixed, and (3.32) is solved for the remaining decision variable. Additionally, the parameters ρ1 and ρ2 are slowly increased with each iteration. There are two main advantages to such an approach: 1. When ρ1 = ρ2 = 0, the solution to (3.32) is simply the globally optimal strategy, which can be obtained via linear programming. 2. Each iteration corresponds to an optimization problem over a convex feasible region, so standard convex optimization methods [96] apply. The solution algorithm therefore begins at a global optimum, and terminates when a solution is reached that is satisfactorily close to equilibrium. One must be careful with such an approach however. If ρ1 and ρ2 are increased too quickly, the method may converge prematurely. It is also unclear whether a nondecreasing sequence of values for ρ1 and ρ2 can be found for every problem. That is, the effectiveness of such an approach rests on one’s ability to increase ρ1 and ρ2 at each iteration. 3.5 Information Theoretic Metrics in NEOps One of the basic principles of Network Enabled Operations is the ability to share information, which in the present model includes the capability of coordinating actions. In this section the specifics of coordinated execution of a correlated equilibrium, NEOPs equilibrium strategy are investigated in detail. Section 3.5.1 investigates the amount of bandwidth required for such coordination, while Section 3.5.2 discusses associated network architectures. 3.5.1 Optimizing Information Flow in NEOps In the last section, the NEOps equilibrium strategy was characterized as an optimal correlated equilibrium, given by the solution to a nonlinear program, (3.29). In this section, we quantify the amount of communication involved in the implementation of such a strategy by considering the mutual information between the strategies of the various players. 95 Chapter 3. Decentralized Force Protection: Correlated Equilibrium Measuring the amount of necessary communication is important for implementation, since the amount of communication bandwidth required versus the amount available determines feasibility of the NEOps solution. We must make a distinction between types of communication here. Specifically, the focus here is on online communication, i.e. communication necessary for the implementation of a defensive strategy, as opposed to offline communication, which would be used to share utilities and jointly compute the strategy in the first place. Also disregarded is communication that would send information that could have been perfectly anticipated by the receiver. In other words, we seek the absolute minimum communication requirements for the implementation of a NEOps equilibrium strategy. In the context of games, mutual information measures the reduction in uncertainty about some players’ decisions given knowledge of others. Its definition, for a bivariate probability distribution, (see [99]) is given by: I(X; Y ) = X µ(x, y) log2 x,y µ(x, y) , µX (x)µY (y) (3.33) where random variables X, Y have joint distribution µ and marginal distributions µX and µY . Such random variables are taken here as representing the decisions of independent players. Shannon’s mutual information measure, above, is a ”distance” measure; specifically it is the relative entropy between the joint strategy probability distribution and the product of the marginal distributions. The marginal distributions represent an approximation to the NEOps equilibrium strategy in which each player acts completely independently, without any coordination required. The mutual information ”distance” is then representative of how much coordination is required to get from this independently executable (non-equilibrium) strategy to the NEOps equilibrium strategy itself. To illustrate the connection between mutual information and communication necessary for implementing the NEOps equilibrium strategy, consider a three player (repeated) game, with Players One, Two and Three making decisions given by random variables X, Y, and Z with corresponding joint distribution f and marginals fX , fY , and fZ . If play is according to a Nash equilibrium (in either pure or mixed strategies), then f is a product distribution and clearly I(X; Y, Z) = 0, I(Y ; Z) = 0, etc. That is, the mutual information between players is zero exactly when communication between players is not required, i.e. when they play a Nash equilibrium. Similarly, when the strategy being played is such that each player’s action is determined completely by the other two actions, the mutual information 96 Chapter 3. Decentralized Force Protection: Correlated Equilibrium is maximized (e.g. anti-coordination games). The requirement for coordination increases with the mutual information in the system. Although there are many quantities of interest, we focus on the mutual information between each player and their collective opponents. Let Zil be the random variable representing the action of Player l in state i, and similarly let Zi−l represent the actions of all players but l. These quantities have marginal distributions X µli (·) = µi (·, u−l ), (3.34) µi (ul , ·), (3.35) u−l ∈U −l (i) for Z l , and, X µ−l i (·) = ul ∈U l (i) for Z −l . The quantities of interest may now be defined as I(Zil ; Zi−l ) = X µi (u) log2 u∈U (i) µi (u) . l l −l µi (u )µ−l i (u ) (3.36) For convenience, we also define the maximal mutual information by I(Zi ) = max(I(Zil ; Zi−l )). l∈L (3.37) The mutual information defined in (3.36) is closely linked to the amount of communication required by Player l to implement the NEOps equilibrium strategy µ in state i. Suppose, as an external observer, we are able to observe Zil , the actions of Player l in each state i. We may think of this player as a noisy receiver, and the observations of his actions as received signals about the values of Zi−l , since the player chooses actions according to his information about Zi−l . The quantity I(Zil ; Zi−l ) measures the reduction in uncertainty about Zi−l due to our knowledge of Zil , and thus indirectly measures the amount of communication between Player l and the rest of the system. There is a notion of conservation of information here; using Player l as a noisy receiver, an external observer can directly measure a reduction in uncertainty of the joint actions of other players and thus the amount of information communicated through Player l. Since there is no other source of information creation or loss, the inference is that the same amount of information is communicated to Player l. The mutual information measure can be used as a design consideration. Under the as- 97 Chapter 3. Decentralized Force Protection: Correlated Equilibrium sumption that channel capacity is relatively uniform within a NEOps network, that is each ship l has channel capacity C, we have the design requirement I(Zi ) ≤ C for all i ∈ S. A correlated equilibrium is executable if and only if there exists sufficient communication capability, i.e. beyond C, between ships. If this requirement is not satisfied, then a constrained correlated equilibrium must be implemented. In the worst case, with C = 0, the task group implements a Nash equilibrium. More generally, given bounds on the mutual information, a constrained correlated equilibrium can be computed by including an additional constraint in (3.29). For example, given bounds C l on the mutual information of Player l, we may characterize the optimal correlated equilibrium as the solution to the following program. minimize X X g(i, u)π(i, u) − i∈S0 u∈U (i) XX β(i, l)J l (i) l∈L i∈S0 subject to (∀i ∈ S0 , l ∈ L, ul , dl ∈ U l ) : X X X 1 p(j, i; u)π(j, u) + 0 , π(i, u) = |S | j∈S0 u∈U (j) u∈U (i) X X X X p(i, j; u)π(i, u)J l (j), g l (i, u)π(i, u) + π(i, z)J l (i) = z∈U (i) X X π(i, (ul , u−l ))· ul ∈U l (i) u−l ∈U −l (i) log2 P j∈S0 u∈U (i) u∈U (i) π(i, (ul , u−l )) P l z∈U −l (i) π(i, (u , z)) X −l z∈U l (i) π(i, (z, u )) ≤ Cl X π(i, z), z∈U (i) h i π(i, (ul , u−l )) g l (i, (ul , u−l )) − g l (i, (dl , u−l )) ≤ u−l ∈U −l (i) X z∈U (i) π(i, z) P π(i, (ul , u−l )) u−l ∈U −l (i) i Xh p(i, j; (dl , u−l )) − p(i, j; (ul , u−l )) J l (j), j∈S0 π(i, u) ≥ 0 ∀ u ∈ U (i). (3.38) (3.38) is quite complex and we do not discuss it further here. Rather we remark that when each C l = 0, the program represents a characterization of a stationary Nash equilibrium as the solution to an optimization problem, which is a rarity compared to the more common fixed point characterization. See [101] for examples of both. (3.38) is solved for a simple two player NEOps missile deflection game in the next section. 98 Chapter 3. Decentralized Force Protection: Correlated Equilibrium 3.5.2 Implementation Architectures for NEOps Equilibrium Strategies Up to this point, we have deliberately avoided any specific scheme for executing a stationary correlated equilibrium strategy. Indeed, the main objective of this chapter is to characterize these strategies without being bound to a specific implementation. This is due to our interpretation of a correlated equilibrium, not as a prescription for optimal play, but as a description of how players will eventually make decisions under NEOps, an interpretation inspired by [89], where randomization is considered to be an artifact of the uncertainty about actions held by an outside observer. In this section, this philosophy is temporarily abandoned, as we explore the question of how one might implement a given NEOps equilibrium strategy by explicitly simulating the randomness inherent in the strategy. We include this section because all currently known implementation methods for correlated equilibrium, as surveyed here, are concerned with implementing a prescribed correlated equilibrium, not with describing situations in which correlated equilibria spontaneously arise. For simplicity, we restrict our attention to static correlated equilibrium, with the understanding that a stationary (dynamic) correlated equilibrium can be implemented by considering an appropriate series of static (auxiliary) games. Two approaches to implementing a correlated equilibrium can be found in the literature, those which require a central command and control station (Figure 3.2(a)-(c)), and those which do not (Figure 3.2(d)). Both will be discussed below. The first approach includes the original, canonical approach as outlined as in [88] and depicted in Figure 3.2(a). Here, a central control station randomly draws joint actions from a known correlated equilibrium, and sends a private message to each player recommending only their component of the joint action. The players, aware of the distribution, find it rational to follow the recommendations. This method requires L separate channels, one for each player, and appears to be the most efficient approach, although it resembles a centralized implementation so much that it may be of limited use in practice. An interesting alternative but equivalent approach is to implement an acceptance rejection method, in which players send proposed actions to the control station for authorization. Suppose that L players, or platforms, plus a central control station, wish to implement a known correlated equilibrium, defined by joint probability distributions µ, where the probability of jointly selecting action (y1 , y2 , . . . , yL ) is µ(y1 , y2 , . . . , yL ). Then the following acceptance rejection algorithm (see, for example, [116]), generates actions according to distribution µ. Algorithm 3.5.1 • Each player l randomly proposes an action Yl according to the 99 Chapter 3. Decentralized Force Protection: Correlated Equilibrium (a) (b) Private Channel Private Channel Public Broadcast (c) (d) Private Channel Public Broadcast Figure 3.2: Communication architectures for implementing a NEOps equilibrium strategy. The aircraft represents a central command and control station. Private, public, or a combination of channels can be used for implementation, although private channels are required for optimal performance (a),(b),(d). 100 Chapter 3. Decentralized Force Protection: Correlated Equilibrium marginal distribution µl . • Central control randomly accepts the joint action with probability µ(Y1 , Y2 , . . . , YL ) , cµ1 (Y1 )µ2 (Y2 ) . . . µL (YL ) where µ c = max y1 ,...,yL µ(y1 , y2 , . . . , yL ) 1 µ (y1 )µ2i (y2 ) . . . µL (yL ) ¶ . • If the proposed joint action is accepted, the players implement it. Otherwise, they propose a new action. It is well known that the average number of proposals needed in this scheme is c, which gives a rough measure of the amount of communication required to implement the joint strategy. The use of marginal distributions is justified since the product of marginal distributions is closest, in terms of Kullback-Leibler distance, to the true distribution µ while still being independently generatable by the players, (Lemma 13.8.1 of [99]). Since the mutual information is defined as the Kullback Leibler distance between µ and the product of its marginals, we have that the number of proposals increases with the mutual information between players, reinforcing the idea of Section 3.5.1 that mutual information is a good indicator of the communication overhead required to implement a correlated equilibrium. The acceptance rejection method requires two-way communication, and uses L + 1 channels, since the control station can simply broadcast its accept/reject message to all players (see Figure 3.2(b)). This is more bandwidth-intensive than Aumann’s original scheme, but shares the computational burden throughout the system. Interestingly, there is another central control station-dependent approach corresponding to the two-way private/public communication scheme of Figure 3.2(b). In [110], it was shown how such an approach can be used to implement any correlated equilibrium. First, each player sends a private message to the control station, then the control station broadcasts a public message that is a deterministic function of these private messages. Each player then selects an action as a function of the private message sent and the public message received. Again, L + 1 separate channels are needed for implementation. (However, only one round of communication is needed.) In many cases, it is desirable to restrict one’s attention to efficiently implementable correlated equilibria, for example when communication is highly restricted. The public broadcast implementations above and Aumann’s original one-way private message implementation to- 101 Chapter 3. Decentralized Force Protection: Correlated Equilibrium gether suggest the efficiently implementable class of public correlated equilibria [104]. In a public correlated equilibrium, players observe a commonly broadcast, as opposed to a private, random variable, and choose a random action based on the outcome. However, the players themselves do not communicate with the control station (see Figure 3.2(c)). It is well known [104] that, using this approach, players can implement any correlated equilibrium whose payoffs are within, but not outside, the convex hull of Nash equilibrium payoffs. This is useful for implementing a joint strategy that is fair between all players, and is attractive when the optimal correlated equilibrium does not yield a significant global improvement over the set of Nash equilibria. The major benefit of using public correlated equilibria is that communication is only in one direction, and only a single broadcast channel is required, instead of L + 1 channels, as above. We illustrate the implementation of a public correlated equilibrium for the case of two players, noting that the concept carries over to more players at the expense of more complex notation. Consider a two player game in which Player One’s action is given by random variable X, and Player Two’s action is given by Y. Let P be the probability distribution matrix of joint actions (X, Y ), and assume P is a correlated equilibrium distribution. Suppose P is decomposable as: P = PxT DPy , (3.39) where Px 1 = 1, Py 1 = 1, (i.e. 1, the column vector with components all equal to one, is an eigenvector with eigenvalue 1) and D is diagonal probability distribution matrix. It is straightforward to show that P = PxT DPy = Nb X D(b, b)PxT (b, ·)Py (b, ·), (3.40) b=1 where Nb is the number of signals. That is, P in this case is decomposable into a convex combination of product distributions. In this case the correlated equilibrium can be implemented as follows. First, a control station broadcasts a random variable B such that P(B = b) = D(b, b). (3.41) Players One and Two then choose actions randomly according to the probability distribution given by row B of Px and Py , respectively. (We say that actions X and Y are conditionally independent, given B.) The set of such decomposable correlated equilibria is nonempty; any convex combina- 102 Chapter 3. Decentralized Force Protection: Correlated Equilibrium tion of Nash equilibria qualifies. That is, if the rows of Px and Py are Nash equilibrium distributions, then P is a correlated equilibrium for any diagonal matrix D. It is interesting to note that certain convex combinations of non-Nash equilibria may also yield a correlated equilibrium, but in this case it is not clear that one can induce the players to play such an equilibrium in the same manner, since there will be a temptation for each player to deviate once the public signal is observed. Restricting our attention to convex combinations of Nash equilibria, we can compute the optimal public correlated equilibrium for a given game. For example, the optimal Egalitarian, cost minimizing public correlated equilibrium can be computed as follows: 1. compute all n Nash equilibria and set the resulting strategies as the rows of Px and Py . 2. Define cost vectors C 1 and C 2 corresponding to each Nash equilibrium. If U 1 and U 2 are the cost matrices of each player, C i can be obtained as the diagonal of Px U i PyT . Define C = max(C 1 , C 2 ). 3. Solve the optimization problem minimize C T x subject to: 1T x = 1 x≥0 (3.42) 4. Set D(i, i) = x(i) for i = 1, 2, . . . , n. Here 1 is the column vector with all components equal to 1. This is a particularly simple convex optimization problem, and the resulting equilibrium is easily implementable, as noted above. For the NEOps missile deflection game, an aircraft or command ship (e.g. AWACS aircraft, aircraft carrier) can be used as a control station to broadcast random values to each ship, facilitating a fair public correlated equilibrium. In fact, a control station is not strictly necessary for implementation of a public correlated equilibrium. If players are equipped with a list of common random numbers, then, using a synchronized clock, they may simulate any common random variable themselves. In this case, the correlation between players is implicitly provided by the common observation of the time at which the game is to be played. For this reason, public correlated equilibria are often identified with sunspot equilibria. (Since we are dealing with a stochastic game, 103 Chapter 3. Decentralized Force Protection: Correlated Equilibrium we could in fact be even more clever than this, by harnessing the randomness inherent in state transitions as public correlation signals! However, this would only yield suboptimal or sub-fair equilibria in all but a very special class of games.) This observation of control station-independence leads us to consider the second approach to implementing a correlated equilibrium. This approach allows players to implement an equilibrium without the need for a control station, by communicating only amongst themselves (Figure 3.2(d)). The literature here is extensive, as many different algorithms have been proposed for implementing a correlated equilibrium without a control station, for example in [90, 92, 102, 105, 120, 121]. Such methods generally require extensive message passing, referred to as cheap talk, to ensure that players are honestly and reliably playing the correlated equilibrium. Players must transmit random variables to each other in such a way that no player has full information about the joint action to be played, and such that deviations from the protocol can be detected and punished. Understandably, this message passing phase is often extensive, and we conclude that these methods too inefficient for our purposes at present. Regardless of the particular implementation, each correlated equilibrium requires a certain minimal amount of communication for implementation. The purpose of this chapter is to examine the trade-off between communication and performance, not to identify optimal communication schemes for the execution of a correlated equilibrium. Nonetheless, a survey of implementation methods allows us to grasp the richness of the class of candidate communication schemes. 3.6 Numerical Example of NEOps In this section we derive and analyze NEOps equilibrium strategies for a multiple-ship, multiple-missile CM scenario, and highlight some of the fundamental aspects of the decentralized missile deflection problem. A NEOps equilibrium strategy is computed using standard nonlinear programming techniques, and the resulting communication requirements, as measured by the maximal mutual information quantity I(Zi ) of (3.37), are computed and analyzed. The missile deflection scenario is as follows. M missiles initially threaten three ships. Each missile has two modes of operation: a target acquisition mode and an attack mode. In the target acquisition mode, the missile attempts to select one of the three ships for attack. In the attack mode, it homes in on this ship using radar. The ships are equipped with various electronic countermeasures, which can be deployed against only one missile at 104 Chapter 3. Decentralized Force Protection: Correlated Equilibrium a time. In the target acquisition mode, these countermeasures are simple; they reduce the ship’s radar visibility to the selected missile. In the attack mode, the countermeasures are designed to confuse the missile’s targeting system, by providing false target information. To decrease complexity, we model the scenario by an inner and outer game. In the inner game, we consider a single missile in attack mode, defended against by a subset of the three players. In the outer game, we consider the problem of selecting which missiles to engage. The equilibrium strategy outcomes of the inner game are used to define the costs in the outer game. This is a similar approach to that of a multiple time-scale Markov decision processes [98]. After describing the inner game, the NEOps equilibrium strategy for M = 3 missiles is derived and analyzed. To illustrate how the model can be adapted to high-dimensional problems with low computational complexity, we also compute a myopic NEOps equilibrium strategy for M = 10 missiles. The NEOps equilibrium strategy is compared to the optimal and Nash equilibrium strategies. 3.6.1 Inner Game (Single Missile in Attack Mode) In this section, the modeling notation of Section 3.2. In the inner game, a single missile moves from left to right across an 11 × 11 battlespace B, towards one of eleven target positions at T = {(x, y) ∈ B : x = 11}. The initial target of the missile is at coordinates T1 = {(11, 6)} and belongs to Player One. The other two ships are at coordinates T2 = {(11, 1)} and T3 = {(11, 11)} and belong to Players Two and Three, respectively. The true target set is T = T1 ∪ T2 ∪ T3 . A specified subset of ships deploy countermeasures to defeat the missile’s targeting system. The missile’s position and target evolve according to a Markov chain {Xtm , Ztm , t = 1, 2, . . .} as in Section 3.2.1, where Xtm evolves on B and Ztm evolves on T. At each time index, the missile moves one unit in the positive x direction, and moves one unit randomly in the y direction with probability proportional to the angle induced between its current m depends (x, y)-position and the (x, y)-position of its current target location. That is, Xt+1 on the angle between Xtm and Ztm , as shown in Figure 3.3 via (3.4). A small amount of noise is added to simulate random deviations in the missile guidance system. The missile’s current target location evolves according to a complicated transition probability matrix, controlled by the current joint CM deployed. Player One may deploy CM L or R, intended to cause the target position to drift to the left or right. Players Two and Three may each deploy CM A or I, to increase own attractiveness as a target, or to increase the general amount of radar interference, respectively. All players also have a null action 105 Chapter 3. Decentralized Force Protection: Correlated Equilibrium Direction of Target 1−cos(2x) cos(2x) x 0 Figure 3.3: Physical missile transition probabilities as a function of target angle for the NEOps example scenario. N available. For each of the 27 resulting joint CM actions, the missile dynamics are outlined below. Note that the dynamics are different depending on whether Ztm ≤ (11, 2), (11, 3) ≤ Ztm ≤ m will revert to T2 , T1 , or T3 , (11, 9), or Ztm ≥ (11, 10). For each case, if the joint CM fails, Zt+1 respectively. That is, the missile tends to stick with its nearest target. When Ztm ≤ (11, 2), three types of joint CM action have a high likelihood of success: (1.) Player Two plays A, Player One plays L, and/or Player Three does not play A. (2.) Player One plays L or R, and all others play N . (3.) One of Player Two or Three plays I and Player One does not shift the target in that player’s direction. The situation is similar for Ztm ≥ (11, 10). When (11, 3) ≤ Ztm ≤ (11, 9), joint CM is effective in two situations: (1.) One of Player Two or Three plays A, and Player One attempts to shift the target in that direction. (2.) One of Player Two or Three plays A while the other plays N , and Player One does not shift the m tends to shift in target in that player’s direction. When CM is effective, the target Zt+1 the desired direction by a random amount (the actual distribution is a modeling detail). m moves to the nearest available true target. Otherwise Zt+1 Each possible subset of participants defines a particular inner game, for which we compute a NEOps equilibrium strategy. Objectives are as in (3.14) and (3.21). The complex interdependencies in the controlled missile dynamics are chosen so that each inner game has at least one effective strategy which must be achieved through cooperation. On the other hand, since each participating player hopes to direct the missile to the spaces between ships while avoiding danger, cooperating while satisfying the equilibrium conditions is not trivial. The game is actually of the finite horizon type; it ends when the missile reaches the right 106 Chapter 3. Decentralized Force Protection: Correlated Equilibrium Participation P1,P2,P3 P1,P2 P1,P3 P2,P3 P1 P2 P3 None P1 Hit Prob. .0108 .0141 .0141 .0563 .1553 .0996 .0996 .5543 P2 Hit Prob. .0166 .0198 .0353 .0183 .0292 .0251 .0437 .1383 P3 Hit Prob. .0182 .0353 .0198 .0179 .0292 .0437 .0251 .1383 Table 3.1: Probability of hit for each player P1,P2,P3 in each inner game (attack mode) under the NEOps equilibrium strategy. edge of the state space or a target. This makes computation of the correlated equilibrium particularly simple because (3.29) can be solved by backwards induction. That is, instead of solving 1331 games simultaneously, they may be solved one at a time to compute the NEOps equilibrium; the costs in any game depend only on the policy in future games. The NEOps equilibrium strategy is reasonable, although too complicated to describe in detail here. In general, players jointly select CM actions so as to drive the missile away from the flotilla, with each player particularly concerned about their own welfare. For example, when all players participate and (Xtm , Ztm ) ∈ {((x, 6), T1 ), 1 ≤ x ≤ 11} (the missile is heading directly for Player One), the players choose ut = (R, I, A) with probability one to shift the target to the area between Players One and Three. This occurs for x 6= 8. For m ] = 6, but with large variation. This x = 8, the ships enforce a compromise in which E[Zt+1 is the best compromise, since at x = 8 all ships are highly vulnerable and must maximize their own safety. Note also that the communication peaks at this point (Figure 3.4). The NEOps strategy in other states follows similar logic; players succeed in coordinating, but at no time is there self sacrifice. Achieving the NEOps equilibrium strategy requires communication, since players must often avoid joint actions that would cancel each other out, leaving at least one player vulnerable. The communication overhead, as reflected by the mutual information between player actions, is indicated in Figure 3.4. The NEOps equilibrium strategy was computed for eight inner games, corresponding to each possible subset of players. The probabilities of hit are shown in Table 3.1. These values assume a uniform distribution over initial states (i.e. X1m = (1, y)), and are used as parameters in the outer game below. 107 Chapter 3. Decentralized Force Protection: Correlated Equilibrium Target=1 Target=2 Target=3 1 1 1 0.5 0.5 0.5 0 10 8 6 4 2 0 10 10 6 8 4 2 8 6 4 Target=4 2 10 6 8 4 2 0 10 1 0.5 0.5 0.5 6 4 2 0 10 10 6 8 2 4 8 6 4 Target=7 2 10 6 8 2 4 0 10 1 0.5 0.5 0.5 6 4 2 0 10 10 6 8 2 4 8 6 4 2 2 10 6 8 4 2 6 4 2 10 6 8 2 4 Target=9 1 8 8 Target=8 1 0 10 4 Target=6 1 8 6 Target=5 1 0 10 8 10 6 8 2 4 0 10 8 6 4 2 10 6 8 2 4 Ship Position Target=11 Info. Info. Target=10 1 0.5 0 10 8 6 4 y pos. 2 6 8 2 4 x pos. 10 1 1 0.5 0.5 0 10 8 6 4 y pos. 2 6 8 2 4 x pos. 10 0 10 8 6 4 y pos. 2 10 6 8 2 4 x pos. Figure 3.4: Maximum mutual information for inner game with all players participating. 108 Chapter 3. Decentralized Force Protection: Correlated Equilibrium 3.6.2 Outer Game (Selection from Three Missiles) In the first outer game, M = 3 missiles threaten the three ships. Each missile state is given by its mode (target acquisition, attack mode against Player l, terminal state). The three ships play a stochastic game over 125 states, corresponding to all possible combinations of states of the three missiles. In each state, each of the three ships must choose missile to engage. The associated state-action costs and transition probabilities are determined by considering the effect of these decisions on each missile separately. The total state-action costs are then obtained by adding the cost contribution of each missile, while the transition probabilities are obtained by treating the missiles independently on the product space as in Section 3.2.2. The single-missile costs and transition probabilities differ depending on the missile mode and the set of players deploying CM against it. The values are outlined below for each case. Missile in Target Acquisition Mode The transition probabilities of a missile in target acquisition mode are given according to the following rules: (1.) If no ship selects the missile, it enters attack mode against any player with equal probability 0.2, and otherwise returns to acquisition mode with probability 0.4. (2.) If one ship selects the missile, that ship is targeted with probability 0.1, while the other two are targeted with probability 0.25 each. The missile terminates or returns to acquisition mode with equal probability 0.2. (3.) If two ships select the missile, the termination probability increases to 0.3. Each of the participating ships is targeted with probability 0.25, while the non-participating ship is targeted with reduced probability 0.1. (4.) If all three ships select the missile, each ship is targeted with probability 0.25, and the missile terminates with probability 0.25. In each case, the immediate cost contribution is zero, since the missile cannot directly reach a ship in acquisition mode. Missile in Attack Mode When a missile enters attack mode, and a subset of players deploy CM against it, one of the inner games is played. By convention, the missile always attacks Player One in the inner game, so a relabeling of players may be needed. The ship being targeted becomes Player One, while among the other two ships, the lower numbered ship becomes Player Two, and the remaining ship becomes Player Three for the duration of the inner game. The outcome is then determined by the appropriate row in Table 3.1. Specifically, the cost contribution to each player is the probability of a missile hit to that player, and the probability of transitioning to the terminal state is the sum of these hit probabilities. That is, the players act to minimize the expected probability of hit from that 109 Chapter 3. Decentralized Force Protection: Correlated Equilibrium missile. The remaining transition probability q is divided as follows. (1.) If a ship engages a missile alone, it will not be targeted again in the next stage. Instead, each of its sister ships will be targeted with probability q/6, and the target acquisition mode will be reached with probability 2q/3. (2.) If two ships engage a missile, they are each targeted again in the next stage with probability q/12, while the remaining ship is targeted with probability q/6, and the target acquisition mode is entered with probability 2q/3. (3.) If all three ships engage a missile they are each targeted again in the next stage with probability q/16. The remaining mass 13q/16 is the probability of re-entering the target acquisition mode. Comparison of Strategies We compute and compare the Nash equilibrium, NEOps equilibrium and optimal strategies for the outer game. The total expected number of missile hits for these strategies were found to be 1.10, 0.96, and 0.72, respectively. In other words, the NEOps equilibrium strategy offers a 37.5% relative improvement over the Nash equilibrium. Comparing the strategies in detail reveals some fundamental differences. Most strikingly, when a missile m threatens player l, it turns out to be optimal for l to ignore m, while a player k 6= l targets m for CM. However, in the Nash and NEOps strategies, l will always deploy CM against m. On the other hand, the NEOps strategy is more cooperative than the Nash, since ships assist each other much more often in the former. In addition, the NEOps strategy eliminates obvious miscoordination present in the Nash case, in which ships may occasionally select a missile that is already terminated. Comparing the strategies line for line, we discovered the NEOps equilibrium strategy was much closer to the optimal strategy than the Nash. The degree of similarity, measured as the common area under two probability distributions, is 0.26 for the NEOps and optimal strategy, and only 0.09 for the Nash and optimal strategy. In fact, since the optimal strategy computed is a maximum entropy version (it randomizes between all alternatives when it is indifferent), versions of the optimal strategy surely exist with much higher degrees of similarity (to both the Nash and NEOps strategies). We also found that the NEOps strategy was closer in terms of Kolmogorov-Smirnov distance, and in terms of KullbackLeibler distance on support restricted to the optimal strategy. The average communication overhead for the NEOps equilibrium strategy is 0.3644 bits per state. Since outer game actions are made much less frequently than inner game ones, this overhead can be accommodated using a very small, but nonzero bandwidth. 110 Chapter 3. Decentralized Force Protection: Correlated Equilibrium Participation P1,P2,P3 P1,P2 P1,P3 P2,P3 P1 P2 P3 None Cost to P1 .012x .02x .02x .005x 0 .011x .011x .01x Cost to P2 .012x .02x .005x .02x .011x 0 .011x .01x Cost to P3 .012x .005x .02x .02x .011x .011x 0 .01x Table 3.2: Amended cost contribution for a missile in target acquisition mode. 3.6.3 Outer Game (Myopic Selection from Ten Missiles) In this section we illustrate the applicability of our results to a much larger game model. The situation is the same as in Section 3.6.2, but now we consider an attack by M = 10 missile simultaneously. Since it is unrealistic to perform dynamic programming on such a large state space (510 states), we restrict our attention to myopic policies. That is, we assume that the expected future cost in each state is zero, so that ships choose actions based only on their current state-action costs. We then compare the resultant myopic NEOps equilibrium strategy with the myopic optimal strategy for several states of interest. It should be noted that the myopic strategies are generally worse than the true optimal strategies, but may perform reasonably well. In the three missile example above, the myopic NEOps equilibrium strategy yields a total expected number of hits of 0.98 instead of 0.96, whereas the myopic optimal strategy gives 1.03 instead of 0.72. 4 Other benefits of considering myopic policies are that no model of the missile dynamics (between inner games) is required, and the computational complexity is drastically reduced. A feature of the myopic policies is that, since no cost is associated with missiles in target acquisition mode, ships will tend to disregard those threats. We amend this by adjusting the cost contributions in acquisition mode to those of Table 3.2. The model parameter x is initially set to one. The total cost contribution is minimized when exactly one player targets such a missile. However, the effect is small enough (for x ≈ 1) that if other threats are present, they take priority. We proceed to analyze the myopic strategies for a few key states, and then investigate how these strategies vary as x is increased. We label states as 10−tuples, with entry m 4 It is unusual that the optimal strategy performs worse than equilibrium, but possible considering the strategies are measured by a different criteria than that which they are optimizing with respect to. 111 Chapter 3. Decentralized Force Protection: Correlated Equilibrium corresponding to the state of the mth missile. State value 0 corresponds to target acquisition mode, values 1 ≤ i ≤ 3 correspond to the missile targeting Player i, and value 4 corresponds to the terminal state. First, for states with all entries equal to either 0 or 4, the myopic NEOps and optimal strategies are identical; each missile in acquisition mode is targeted by exactly one ship for CM. This requires some coordination, which decreases as the number of missiles increases. For other states, the myopic NEOps equilibrium strategy behaves according to the following basic threshold rule: When a ship is threatened by k ≤ 1 missiles, its first priority is to target any missile that is threatening a different ship; when a ship is threatened by k ≥ 2 missiles, its first priority is to target one of the missiles threatening itself. For x ≈ 1, missiles in state 0 are only targeted after these priorities are satisfied. This type of behaviour makes sense from inspection of Table 3.1. This is in contrast to the myopic optimal policy, in which a ship always prefers to target the missiles threatening different ships, regardless of the number of threats against itself. These policies require some degree of information exchange between the players, usually less than 1 bit per play. For example, in State (1, 2, 3, 0, . . . , 0) the equilibrium and optimal action is for each player to randomly select a missile other than the one threatening it for CM. However, in State (1, 2, 3, 3, 0, . . . , 0) the equilibrium deviates from the optimal action in that Player 3 always selects Missile 3 or 4 for CM. This is in keeping with the equilibrium conditions. When x is increased beyond a certain threshold, missiles in State 0 become more threatening and are hence taken more seriously. In general this threshold depends on the missile state; if it is exceeded, the ships tend to reorganize their behaviour so that, with positive probability, exactly one ship deals with each threat in State 0. So, for example, in State (1, 2, 0, . . . , 0), if x < 3.5, Ship 3 randomly selects either Missile 1 or 2, for CM whereas as x increases beyond 3.5, Ship 3 begins to select from missiles 3 to 10 with increasing probability. In contrast, the optimal policy also displays this effect, but at a larger threshold. Therefore, for each state, there is a range of x in which the equilibrium policy is suboptimal. Overall, even the myopic strategies appear reasonable, and even display a coherent structure, even if they cannot be expected to perform as well as the optimal strategies. By varying the parameters, we have shown that the myopic NEOps equilibrium strategy does not always coincide with the optimal strategy. Finally, analysis of the strategies shows that the amount of communication overhead is nonzero, but does not appear to require significant bandwidth in practice. The information exchange as measured by (3.37) varies by state, but is always smaller than 1 bit per state. 112 Chapter 3. Decentralized Force Protection: Correlated Equilibrium 3.7 Conclusion We have shown how the problem of decentralized deployment of electronic countermeasures against radar guided anti-ship missiles can be formulated as a transient stochastic game. When the decision makers in this game are linked through a communication network, the joint deflection strategy is given by an optimal stationary correlated equilibrium, referred to as a NEOps equilibrium strategy. Under this strategy, ships are able to simultaneously minimize their own probability of a missile hit as well as the probability of a missile hit to the entire task group. The NEOps equilibrium strategy, which represents the collectively best strategy acceptable to all players, can be computed via the nonlinear optimization problem (3.29). Independent, rational decision makers with well defined local and global objectives, such as those in the missile deflection game presented here, would find it optimal to follow this strategy given that they are able to coordinate their actions through communication channels. In general, the strategy appears to be randomized, which is a natural property of the solution to an optimization problem with such a large number of constraints, but this randomization can also be viewed as a consequence of collective uncertainty associated with any equilibrium solution. Players are uncertain about how others will act, how others think they will act, and so on. This appears to be a fundamental consequence of the fact that players must simultaneously execute a single joint action while working to achieve differing objectives, and cannot be overcome unless either the simultaneity of action or multiple objective conditions are dropped. Communication can improve, but not dispel this state of affairs. In a more general setting, self-interested systems capable of synchronizing observations and coordinating efforts are quite common. Manufacturing supply chains, biological systems, and data networks are just a few examples of such systems. It is hoped that the treatment presented here will serve as an example for similar study in such diverse areas. We have also been able to describe a key performance measure, namely the communication overhead required for implementing a NEOps equilibrium strategy, via elementary concepts in information theory. Specifically, we have shown that the mutual information between a player’s probability distribution over actions and the probability distribution over his opponent’s actions is intimately tied to the amount of communication required by that player in the implementation of the NEOps equilibrium strategy. This relationship provides a measurable quantity for determining whether a given strategy can be feasibly implemented by a group of players with certain bandwidth limitations, or whether a lesser correlated or 113 Chapter 3. Decentralized Force Protection: Correlated Equilibrium even Nash equilibrium must be implemented instead. The formulation of communication requirements for implementing a NEOps equilibrium strategy leads to a trade-off between communication and performance at equilibrium. The intuitive notion that cooperation and communication can improve performance among groups of linked, dynamic decision makers is central and important in modern systems. We have given a concrete example that illustrates that the link between communication and performance exists, but is nontrivial, and have demonstrated how the analytical tools of information theory and game theory can be used to analyze optimal, cooperative strategies. 114 Bibliography [86] D. Alberts, J. Garstka, and F. Stein. Network centric warfare: developing and leveraging information superiority. CCRP Publications, second edition, 1999. [87] E. Altman. Constrained Markov decision processes. Chapman and Hall/CRC, 1999. [88] R. Aumann. Subjectivity and correlation in randomized strategies. Journal of Mathematical Economics, 1:67–96, 1974. [89] R. Aumann. Correlated equilibrium as an expression of bayesian rationality. Econometrica, 55(1):1–18, 1987. [90] R.J. Aumann and S. Hart. Long cheap talk. The Hebrew University of Jerusalem, Center for Rationality, 2002. [91] S. Babcock. Canadian network enabled operations initiatives. Proceedings of the 9th International Command and Control Research and Technology Symposium, 2004. [92] Elchanan Ben-Porath. Correlation without mediation: Expanding the set of equilibrium outcomes by ”cheap” pre-play procedures. Journal of Economic Theory, 80(1):108–122, 1998. [93] D. Bertsekas. Nonlinear Programming. Athena Scientific, second edition, 1999. [94] D. Bertsekas. Dynamic Programming and Optimal Control, volume 1 and 2. Athena Scientific, second edition, 2001. [95] S. Blackman and R. Popoli. Design and analysis of modern tracking systems. Artech House, 1999. [96] S. Boyd. Convex Optimization. Cambridge University Press, 2004. [97] A. Cahn. General procedures leading to correlated equilibria. International Journal of Game Theory, 33(1):21–40, 2004. 115 Bibliography [98] H. Chang, P. Fard, S. Marcus, and M. Shayman. Multitime scale markov decision processes, 2002. [99] T. Cover and J. Thomas. Elements of Information Theory. Wiley, 1991. [100] H. Eustace, editor. The International Countermeasures Handbook. EW Communications, sixth edition, 1980. [101] J. Filar and K. Vrieze. Competitive Markov Decision Processes. Springer-Verlag, 1997. [102] Francoise Forges. Universal mechanisms. Econometrica, 58(6):1341–64, 1990. [103] Francoise M Forges. An approach to communication equilibria. Econometrica, 54(6):1375–85, 1986. [104] D. Fudenberg and J. Tirole. Game theory. MIT Press, 1991. [105] Olivier Gossner. Secure protocols or how communication generates correlation. Journal of Economic Theory, 83(1):69–89, 1998. [106] A. Greenwald and K. Hall. Correlated-q learning. Proceedings of the Twentieth International Conference on Machine Learning, 2003. [107] S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000. [108] S. Hart and A. Mas-Colell. A reinforcement procedure leading to correlated equilibrium, in Economic Essays, pages 181–200. Springer, 2001. [109] P. Herings and R. Peeters. Stationary equilibria in stochastic games: structure, selection and computation. METEOR Working Paper No. RM/00/31., 2000. [110] Ehud Lehrer and Sylvain Sorin. One-shot public mediated talk. Games and Economic Behavior, 20(2):131–148, 1997. [111] M. Maskery and V. Krishnamurthy. Game-theoretic missile deflection in network centric warfare. Proceedings of the IEEE International Conference on Networking, Sensing and Control, 2005. [112] R. Nau, S. Gomez Canovas, and P. Hansen. On the geometry of nash equilibria and correlated equilibria. International Journal of Game Theory, 32(4):443–453, 2004. 116 Bibliography [113] A. Nowak and T. Raghavan. Existence of stationary correlated equilibria with symmetric information for discounted stochastic games. Math. Oper. Res., 17(3):519–526, 1992. [114] Michal Pioro and Deepankar Medhi. Routing, Flow, and Capacity Design in Communication and Computer Networks. Morgan Kaufmann Publishers, 2004. [115] M. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, 1994. [116] S. Ross. Introduction to probability models. Academic Press, sixth edition, 1997. [117] L. Shapley. Stochastic games. Proceedings of the National Acadamy of Sciences USA, 39(10), 1953. [118] H. Sherali, Y. Lee, and D. Boyer. Scheduling target illuminators in naval battle-group anti-air warfare. Naval Research Logistics, 42, 1995. [119] E. Solan and N. Vieille. Correlated equilibrium in stochastic games. Games and Economic Behavior, 38(2):362–399, 2002. [120] V. Teague. Selecting correlated random actions. Proceedings of Financial Cryptography, 2004. [121] T. Rabin Y. Dodis, S. Halevi. A cryptographic solution to a game theoretic problem. Proceedings of CRYPTO, 2000. 117 Chapter 4 Decentralized Activation in a ZigBee-enabled Unattended Ground Sensor Network 5 4.1 Introduction Tasks such as detailed environmental monitoring and intruder detection depend on the ability to reliably observe highly localized events within a large coverage area. In the absence of highly sensitive equipment, researchers have proposed deployment of unattended ground sensor networks (UGSN) to provide this capability, by densely deploying short-range detection equipment throughout the region in question [122, 129, 137, 144, 145, 149, 150]. The challenge then is to coordinate the activity of a large number of devices “in the field” as efficiently as possible, so as to satisfy tight power and bandwidth constraints. Since sensor coordination is a costly and complex process, we take a decentralized approach by modeling each sensor as an independent, nonlinear adaptive filter, which optimizes sensor performance through a utility function, which trades off the energy cost of acquiring data against its value. The adaptive filters control local functionality only, with minimal message passing, thereby reducing intra-sensor communication and associated energy costs. This decentralized approach is attractive because it facilitates self-configuration of sensors, which allows for fast deployment, as well as network robustness and scalability. Self configuration is achieved by the standard approach of improving filter performance based on time-averaged feedback. The main idea of this paper is the following: Suppose each adaptive filter in the UGSN greedily optimizes its utility according to a simple adaptive rule (described below). In this case, What can one say about the global network performance of the UGSN? Our main 5 A version of this chapter has been submitted for publication. Maskery, M. and Krishnamurthy, V. Decentralized Activation in a ZigBee-Enabled Unattended Ground Sensor Network, IEEE Transactions on Signal Processing. 118 Chapter 4. Decentralized UGSN Activation Control results, Theorems 4.3.1 and 4.3.2, show that the sensor network’s performance converges to the set of correlated equilibria of the underlying noncooperative game, which can be tracked as the system changes due to moving targets 6 . The set of correlated equilibria (see [125]) is arguably the most natural attractive set for decentralized adaptive algorithms such as the one considered here, and describes a condition of competitive optimality between sensors. Indeed, Hart and Mas-Colell observe, while describing such adaptive algorithms in [134] that: “At this point a question may arise: Can one actually guarantee that the smaller set of Nash equilibria is always reached? The answer is definitely “no.” On the one hand, in our procedure, as in most others, there is a natural coordination device: the common history, observed by all players. It is thus not reasonable to expect that, at the end, independence among players will not obtain [be realized]. On the other hand, the set of Nash equilibria is a mathematically complex set (a set of fixed-points; by comparison, the set of correlated equilibria is a convex polytope), and simple adaptive procedures cannot be expected to guarantee the global convergence to such a set.” Moreover, the coordination allowed by a correlated equilibrium can lead to higher performance than if each sensor was required to act in isolation. Our main results in this paper are as follows: • A game theoretic model for sensor activation is presented, based on energy and performance characteristics of the ZigBee communications protocol. The model allows individual sensors to trade-off their contribution to end-user awareness with the energy cost of activation. • A computationally efficient (see Section 4.2.3) decentralized, adaptive filtering rule is given for activation control, which converges to the correlated equilibrium set of the underlying game. A constant stepsize allows for tracking of the equilibrium as the environment changes (e.g. due to moving targets or sensor outage). In stochastic optimization problems, the usual tool for proving convergence of a stochastic approximation algorithms to an optimum, is the ordinary differential equation approach [140]. 6 Robert J. Aumann was awarded the 2005 Nobel prize in economics. The following extract is from the Nobel Prize press release in October 2005: “Aumann also introduced a new equilibrium concept, correlated equilibrium, which is more general than Nash equilibrium, the solution concept developed by John Nash, an economics laureate in 1994. Correlated equilibrium can explain why it may be advantageous for negotiating parties to allow an impartial mediator to speak to the parties either jointly or separately, and in some instances give them different information.” 119 Chapter 4. Decentralized UGSN Activation Control However, in the game theoretic setting of this paper, the network performance converges to the set of correlated equilibria. It is necessary to use differential inclusions (rather than ordinary differential equations) to analyze the convergence and tracking properties of the algorithm. • An important aspect of the decentralized stochastic approximation algorithms used in this paper is that they require an estimate of how many neighboring sensors are active. We present an algorithm that exploits the CSMA structure of ZigBee/IEEE 802.15.4 for computing a maximum likelihood estimate of the number of nearby active sensors, to be taken as input by each adaptive filter. This is critical for estimating sensor interaction for the decentralized adaptive filtering rule. • A timer-based pricing mechanism for load-sharing between sensors, which prolongs effective network lifetime by reducing energy consumption. Fairness is enforced by forcing active sensors into a standby mode after a sufficient period. • We implement the decentralized algorithms on a ZigBee test-bed in Matlab and illustrate their performance. The ZigBee compatible sensor network architecture is described in Appendix A. We favor a game theoretic approach due to the interdependence of individual nodes. The channel quality, required packet transmission energy and usefulness of a sensor’s information all depend on the activity of other sensors in the network. Thus each adaptive filter must observe and react to other sensors’ behaviour in an optimal fashion. This sets up a dynamic system in which adaptive filters and their environment (other filters/sensors) adapt to each other, instead of regarding the environment as a non-adaptive entity. The simple decentralized adaptive filtering rule is based on the recent economics literature on regret-based algorithms, initiated in [133, 134]. At successive times, each filter/sensor chooses its actions based on its performance history. By suitably randomizing amongst the set of “best” responses to this history, play converges to the set of correlated equilibria, guaranteeing local optimality for each sensor node. A useful approach is to formulate this rule as a decentralized stochastic approximation deployed by the UGSN, which uncovers relations to similar problems in reinforcement learning for stochastic control [128], as well as optimization and parametric identification problems (e.g., recursive maximum likelihood and recursive expectation maximization algorithms; see [139]); see [127, 140] for excellent expositions of this area. Game theoretic analyses of decentralized problems in UGSNs are becoming increasingly 120 Chapter 4. Decentralized UGSN Activation Control common, see [131, 142] for a general overview. Other instances include, multihop routing [138], random access over a common channel [143], transmission power control [151], and topology control [141]. The common approach in these papers is to define a utility function that each system component selfishly maximizes, and then analyze system performance at the Nash equilibrium point of the resulting game. What has not been studied is games in which sensor nodes must decide how to sense their environment, as opposed to how to transmit their information. In addition, research so far has used only the Nash equilibrium concept; no attention is paid to correlated equilibria. We attempt to fill these gaps in our game theoretic sensor node activation approach. The paper is organized as follows. Section 4.2 presents the game theoretic approach to the sensor activation control problem. Sensor awareness, utility, and an adaptive procedure for increasing individual sensor utility and the resulting overall utility are discussed. Section 4.3 provides a detailed analysis and convergence of the adaptive filtering procedure to the set of “correlated equilibrium” operating points, especially for the case of moving targets in the UGSN. Numerical results are presented in Section 4.4, and conclusions follow in Section 4.5. Appendix A summarizes the physical characteristics of the UGSN, including the ZigBee non-beacon-enabled communication protocol. 4.2 Decentralized Energy-Aware Node Activation This section describes our decentralized approach to the sensor activation problem, which seeks to activate only the most well-positioned sensors, while the rest sleep. Section 4.2.1 formulates the model by which sensors obtain target measurements, Section 4.2.2 formulates a utility function to be maximized by each sensor node, Section 4.2.3 details the simple game theoretic adaptive algorithm that each sensor deploys to greedily optimize their utility, and its performance is analyzed in Section 4.2.4. The task of the UGSN is to perform effective target detection and localization, using as few sensor nodes as possible so as to maximize lifetime. The problem of each sensor l = 1, 2, . . . , L is to select an optimal action from the set: Sl = {0 (sleep), 1 (activate)}. (4.1) We define the set S to represent the joint action of all sensors. We consider a decentralized solution since it provides us with a simple and scalable algorithm for sensor node activation (since there is no central controller or lengthy negotiation), and since sleeping nodes do 121 Chapter 4. Decentralized UGSN Activation Control not have to periodically turn on a receiver to obtain instructions, leading to greater energy efficiency. 4.2.1 Measurement Model Sensor nodes use passive components to detect and relay information on targets with a specified sensor signature within their detection range. Let vector rn denote the set of target positions at discrete time n. Assuming the target signal intensity falls off as an inverse square of its range, the target signal level received by sensor node l is modelled as: ( Dl (rn ) = min M X m=1 Am + σD (n), KD (∆(l, rm,n ))2 ) , (4.2) where there are M targets, Am represents the intensity of target m = 1, 2, . . . , M at one metre, ∆(l, rm,n ) is the distance from sensor node l to target m at discrete time n, KD represents a signal saturation level, and σD (n) is zero-mean, finite-variance and (possibly) spatially correlated noise at time n. Since we are concerned with sensor activation control, the measurement model is abstracted to a utility function in the next section. Because of this, the details of the measurement model are not important, and as a result we do not account for details such as the structure of the noise process. 4.2.2 Activation Control Utility Function To implement decentralized activation control, each sensor l is equipped with an adaptive filter, which evaluates alternative actions xl ∈ {1, 2}, denoting sleep and activate decisions for l, respectively. The evaluation criterion is a utility function ul (xl , x−l ), where x−l represents the joint actions of the other sensors. The trade-off between target monitoring and energy usage is captured by setting: ( l u (x, r) = 0, xl = 0, C l (x, r) − (Epacket (l, η l (x)) + ∆Esens ), xl = 1. (4.3) Here, C l (x, r) is a reward proportional to the expected contribution to an end user’s utility associated with the information l can provide. The second two terms represent the energy cost associated with sensor activation and transmission efforts, and the details of their formulation can be found in Appendix A. The key feature that characterizes our study as a game is that these quantities are a function of x−l , the number of other active nodes. 122 Chapter 4. Decentralized UGSN Activation Control Sensor Activity Local Effect Relative Distance State Target Positions Adaptive Filter Local Effect Relative Position and Cluster Update Algorithm Interaction Regret−BasedActivation Control Local Effect Adaptive Filter Local Effect MAC Success/ Failure Update Algorithm . . . Figure 4.1: Decentralized adaptive filtering interpretation of sensor activation control. Sensors adapt to local target observations and inter-sensor competition through the ZigBee MAC layer. To formulate the trade-off more carefully, let η l (x) and ζ l (x) be given by: η l (x) = 1 + X xk , and, ζ l (x) = 1 + k∈Cl (l) X xk , (4.4) k∈Nb (l) where Cl (l) is the set of sensor nodes belonging to the same cluster as l, and Nb (l) is the set of nodes “near” l (i.e. in neighboring clusters). η affects the success of transmission attempts, as in Section A.2, while ζ affects the contribution of a l’s information, since other nearby active nodes are likely seeing the same target as l. Since the MAC layer performance is accounted for in the activation decision, the activation layer interacts with the ZigBee/IEEE 802.15.4 layer as in Figure 4.1. The energy cost of transmission, Epacket , is specified in (A.4) in Appendix A as a function of l to allow transmission power to vary by sensor node. The additional energy cost of activating, not taken into account in (A.4), is given by ∆Esens , computed from Table A.2 as follows: ∆Esens = 1 [(1 − km )(PCP U − PCP U0 + PS − PS0 ) − kc PRX (TCCA + TRXwu )], F (4.5) where F is the frequency of sensor node activation decisions, and km , kc are the (small) proportions of time spent sensing the environment and channel, respectively, in sleep mode. (It is assumed that sleeping nodes still monitor their environment at a low duty cycle.) It remains to specify the expected reward function C l (·). Assuming that information 123 Chapter 4. Decentralized UGSN Activation Control increases with total signal strength, consider the well-known concave utility case for global (end-user) utility: P Uglob ((y l , y −l ), r) = 1 − e−k2 minm ( j y j (m)) , (4.6) where k2 reflects the concavity of information and information value with signal strength, and y j (m) is the signal reading due to target m received by sensor j. If we assume that one target dominates the reading at any given sensor, we may write y j (m) = Dj s(η)xj , and l can approximate its marginal contribution for activating as: C l (x, r) = K1 (Uglob ((y l , y −l ), r) − Uglob ((0, y −l ), r)) ´³ ´ ³ l ≈ K1 e−k2 D(r)s(η)ζ 1 − e−k2 D (r)s(η) , (4.7) where K1 is a pricing parameter, and D(r) is the mean target signal intensity observed by other nearby sensors, which can be computed from the average sensor density and target characteristics. Values s(η), η, and ζ are defined in Appendix A. To simplify matters, we have imposed here that each sensor assumes it is monitoring the most poorly observed target m, thus believing its contribution to Uglob is positive when it may in fact be zero. This avoids costly coordination of sensors monitoring targets in different regions, in keeping with our decentralized approach. 4.2.3 Nonlinear Adaptive Filtering for Sensor Activation Having outlined the objectives of the UGSN, we now turn to our decentralized adaptive filtering approach to the sensor node activation. Our approach is based on the regret matching procedure of [133], formulated as a distributed stochastic approximation algorithm. This allows us to specify an adaptive variant of the original procedure, called “Regret Tracking,” which uses a constant stepsize to dynamically adapt to time varying conditions, so that sensor nodes may function in a changing environment. In the next section, we will show that the algorithm converges to the set of correlated equilibria of the decentralized sensor activation problem, posed in game theoretic terms. The regret-tracking algorithm itself requires on-line evaluation of the utility (4.3), which in turn requires each sensor node to estimate ηbl and ζbl . In a decentralized system, this poses a problem, since the only information received by nodes is the success or failure of their previous channel access attempts. (See Appendix A for the Zigbee-based channel access architecture; sensors share channels through a ”listen before talk” Carrier Sense Multiple Access scheme, wherein a series of ”Clear Channel Assessment” (CCA) attempts 124 Chapter 4. Decentralized UGSN Activation Control are performed by listening to channels in order to discover a usable one.) Fortunately, the model of Section A.2 relates the proportion of successful CCA attempts to the traffic level through (A.1), which we can use to obtain maximum likelihood traffic estimates. Suppose sensor activity x remains relatively constant over the time it takes for a node to perform n CCA attempts. Then each CCA attempt represents a Bernoulli trial with success probability (probability of a clear channel) given by (A.1). It is straightforward to check that the maximum likelihood estimate of the traffic level η l (x) is given by the solution to: µµ ¶ ¶ ³s´ n s n−s ηb (x) = arg max p(η) (1 − p(η)) = 1 + log1−q , η n s l (4.8) where s is the number of successful CCA attempts in the last n attempts, and the second equality is obtained via (A.1). A similar procedure can be used for estimating ζ l (x). By tracking the proportion of successful CCA attempts, we may thus estimate activity through the following algorithm: Algorithm 4.2.1 Estimating Active Sensors: Select 0 < εact < 1. Select a history length n > 0. Repeat the following: 1. Each time a CCA is performed, update s as the number of successful CCAs of the last n attempts. 2. Compute ηb according to (4.8). 3. Periodically perform CCA attempts on the channel frequencies {f1 , . . . , fk } of any nearby clusters. Update sk as the number of successful CCAs of the last n attempts. P 4. Compute ζb = ki=1 ζbi , where each ζbi is computed according to (4.8) with argument sk . Remark Since CCA successes are i.i.d., the estimates provided by Algorithm 4.2.1 are unbiased. To ensure accurate estimates, we must ensure that CCAs are done with some minimum frequency, so that traffic levels do not change too frequently. Hence extra CCA activity must be carried out on top of those used for transmission. This is especially important when the sensor node is asleep for long periods. In sleep mode, this minimum frequency is set according to the quantity kc in (4.5). An advantage of this method is that very little signal processing needs to be done to determine whether a channel is busy or idle, which results in significant energy savings. 125 Chapter 4. Decentralized UGSN Activation Control This can be carried even further; if sensors can directly sample the energy on the RF band, as suggested in [136] then the number of sensors can be estimated at a high duty cycle with very little energy expenditure. The decentralized adaptive filtering, regret tracking algorithm itself is described as follows. At each decision period n, sensor nodes take joint action Xn ∈ S, with Sensor l taking action Xnl ∈ Sl . To implement the algorithm, each filter/sensor l uses the potential utilities l associated with past joint actions {Xn : n = 1, 2, . . .} to derive regret values θn,jk : j, k ∈ Sl , according to: l θn,jk X = ετ n Y ³ ´ (1 − εσ ) ul ((k, Xτ−l ), rτ ) − ul ((j, Xτ−l ), rτ ) . (4.9) σ=τ +1 τ ≤n:Xτl =j If ετ = 1/(τ + 1), this is simply the average: l θn,jk = 1 n X ³ ´ ul ((k, Xτ−l ), rτ ) − ul ((j, Xτ−l ), rτ ) . τ ≤n:Xτl =j If ετ = ε, it is the exponentially weighted moving average: X l θn,jk = ³ ´ ε (1 − ε)n−τ ul ((k, Xτ−l ), rτ ) − ul ((j, Xτ−l ), rτ ) . τ ≤n:Xτl =j These values reflect the (weighted) average potential difference in utility had l used action k instead of action j in the past. Node l then switches from its current action j to k with l probability proportional to θn,jk (if it is non-negative). The complete, decentralized procedure is summarized below: Algorithm 4.2.2 Nonlinear Adaptive Filter for ZigBee Node Activation: The regret-based algorithm for sensor node activation has parameters (ul , {εn : n = 1, 2, . . .}, θ0l , X0l ), where ul are the utilities, {εn } is a small stepsize, and θ0l , X0l are arbitrary initial regrets and actions. Define the Sl × Sl matrix with entries: ³ ´ Hljk (Xn , rn ) = I{Xnl = j} ul ((k, Xn−l ), rn ) − ul ((j, Xn−l ), rn ) . (4.10) Execute the following: 1. Initialization: Set n = 0, and take action X0l . Initialize regret θ0l = Hl (X0 , r0 ). 126 Chapter 4. Decentralized UGSN Activation Control 2. Repeat for n = 0, 1, 2, . . ., the following steps: l Action Update: Choose Xn+1 = k with probability ( l P r(Xn+1 = k|Xnl = j, θnl = θl ) = where µ > P k6=j l , 0}/µ, k 6= j, max{θjk P l , 0}/µ, k = j, 1 − i6=j max{θji (4.11) l , 0}. max{θjk −l Regret Value Update: Estimate (b ηn+1 , ζbn+1 ), and compute utility ul (0, Xn+1 ) and −l ul (1, Xn+1 ), hence Hl (Xn+1 ). Update θn+1 using the stochastic approximation (SA): l θn+1 = θnl + εn (Hl (Xn+1 , rn+1 ) − θnl ), (4.12) Note that θnl is a moving average of the updates {H l (Xk , rk ) : k = 1, 2, . . . , n}. Because of this, Algorithm 4.2.2 can be viewed as a stochastic approximation algorithm with a constant stepsize ε > 0; actions are chosen with probability proportional to their moving average potential performance in the past. For the original regret matching algorithm of [133], one would instead use εn = 1/(n + 1). We also point out the generality of Algorithm 4.2.2, by noting that it can be easily transformed into the well-known fictitious play algorithm (see l l deterministically, where X l = j, and the extremely [130]) by choosing Xn+1 = arg maxk θjk n simple best response algorithm by further specifying ε = 1. However, in simulation, these two variants do not perform as well as the regret tracking scheme (see Section 4.4). Finally, we use a variable normalization µ as in [126], as opposed to the static µ of [133]. This greatly improves the speed of convergence to equilibrium. The target position rn is regarded as a slowly varying parameter of the utility function in Algorithm 4.2.2. Since the utility varies (due to moving targets), a constant stepsize in Algorithm 4.2.2 is needed to keep sensor nodes responsive to the changes. At each iteration, Algorithm 4.2.2 requires estimates of the activity of other nodes to compute ul (1, Xn−l ). This is carried out through Algorithm 4.2.1. Other than this, the computational complexity of Algorithm 4.2.2 is small, and therefore suitable for implementation in a UGSN. At each iteration, each filter/sensor needs to perform one table lookup to compute the utility given the current readings of (Xn−l , rn ). Two additions and two multiplications update the regret value, and one random number, one multiplication and one comparison suffices to compute the next action. In total, assuming random numbers are stored in a table, this yields two table lookups, one comparison, and five arithmetic operations per iteration. 127 Chapter 4. Decentralized UGSN Activation Control 4.2.4 Performance of Regret Tracking Our main result is that if each adaptive filter operates according to Algorithm 4.2.2, the system converges to the set of correlated equilibria. This provides a performance guarantee for the sensor network: each sensor converges to its best action given its limited information and coordination capabilities. We will investigate the meaning and details of this convergence in the next section. Concerning the overall performance of the correlated equilibrium, the following two results are established: Theorem 4.2.1 1. If the correlated equilibrium is deterministic, and the number of sen- sors in each cluster is less than the critical point of (A.4), then the resulting sensor deployment is a Pareto optimal Nash equilibrium. 2. If the correlated equilibrium is deterministic, and the overall system objective is to maximize: K1 Uglob (y, r) − X xl (Epacket (l, η l (x)) + ∆Esens ) (4.13) l (i.e. trade off the end-user utility with the total energy consumed, see (4.5), (4.6) (A.4) ) then the equilibrium deployment is optimal in the sense that any unilateral deviation will reduce (4.13). The condition in Part 1 of Theorem 4.2.1 essentially guarantees that the utility (4.3) is concave in the number of active sensors. This implies that no group of sensors can improve their situation cooperatively if they cannot do so independently. A formal proof follows. Proof For the first part, we wish to show that there is no other deployment that makes at least one sensor better off and no sensor worse off (a deterministic correlated equilibrium is e l = 0 when a Nash equilibrium by definition). Let the equilibrium point be X and define X X l = 1 and vice-versa. Consider a sensor l such that X l = 1. By the equivalence of negative l regret and correlated equilibrium set (see Section 4.3), we have θn,10 = ul ((0, Xn−l ), rn ) − ul ((1, Xn−l ), rn ) ≤ 0, i.e. ul ((1, Xn−l ), rn ) > 0, which implies that l is worse off if it chooses action 0 (and gets zero reward), regardless of others’ actions. For a sensor k : X k = 0, k we have θn,01 = uk ((1, Xn−k ), rn ) − uk ((0, Xn−k ), rn ) ≤ 0, i.e. uk ((1, X −k ), rn ) < 0, so k cannot gain by unilaterally choosing X k = 1. Moreover, if multiple sleeping sensors simultaneously activate, their loss will be even worse, since C l (x, r) is decreasing in x, and Epacket (x) is increasing by hypothesis. Hence there is no globally beneficial deviation and Pareto optimality holds. For the second part, it is enough to note that ul ((1, Xn−l ), r) ≥ 0 iff K1 Uglob ((y l , y −l ), r) − (Epacket (l, η l (x)) + ∆Esens ) ≥ K1 Uglob ((0, y −l ), r) − 0. Since each 128 Chapter 4. Decentralized UGSN Activation Control sensor activates if and only if its utility is positive, a consideration of (4.13) shows the result holds. The analysis above reveals a connection between optimality and the set of correlated equilibria, which results from our choice of utility function. This suggests that our approach is competitive with similar approaches to sensor deployment, either centralized or decentralized. 4.3 Correlated Equilibrium and Convergence This section deals with the convergence and tracking properties of Algorithm 4.2.2. Section 4.3.1 defines the correlated equilibrium set, and discusses its interest in decentralized systems. Sections 4.3.2 and 4.3.3 present technical convergence and tracking results for Algorithm 4.2.2 via a differential inclusion approach. Section 4.3.4 presents improvements for Algorithm 4.2.2 through pricing and outer-level stochastic optimization. The set of correlated equilibria to which Algorithm 4.2.2 converges (shown below) is a convex set of strategies representing competitively optimal behaviour between sensors. Theoretically, finding a correlated equilibrium in a game is equivalent to solving a linear feasibility problem, which is a linear program with a null objective function. Algorithm 4.2.2 is essentially an iterative procedure for solving this problem, which is interesting because of its decentralized nature. 4.3.1 Definition of Correlated Equilibrium For a game with L players, the problem is for each player l = 1, 2, . . . , L to devise a rule for selecting actions xl from a set Sl (with size S l ), to maximize the expected value of their utility function ul (x1 , x2 , . . . , xL ). (To maintain generality, we suppress the dependence of ul on r in this section.) Since each player only controls one of L variables, this requires careful consideration of the expected actions of other players. The standard solution to this problem is an equilibrium, which identifies stable operating points of the game. The most common such equilibrium is due to Nash [146]. However, we focus on an important generalization of the Nash equilibrium, proposed in [124, 125], known as the correlated equilibrium, defined as follows: Definition 4.3.1 Define a joint strategy π to be a probability distribution on the product space S = S1 × S2 × . . . × SL . Any π may be decomposed into marginals (π l , π −l ) for any 129 Chapter 4. Decentralized UGSN Activation Control l, where π l is the marginal probability distribution (strategy) of l, and π −l is the marginal distribution of all players but l. Now, π is a correlated equilibrium if X π(j, x−l )[ul (k, x−l ) − ul (j, x−l )] ≤ 0. (4.14) x−l ∈S−l One intuitive interpretation is that π provides each player l with a private “recommendation” al ∈ Sl . Based on this, and knowing π, a player could compute a posteriori probability distribution for the actions of other players, and hence an expected utility for each action. The equilibrium condition states that there is no deviation rule that would award l a better expected utility. Assuming Sl is the same for all l, condition (4.14) can be written in matrix form as: Aπ ≤ 0, 1π = 1, where 1 represents a row vector of ones, and A is an L(Sl )2 × (Sl )L matrix with entries Aljk,(i,m) I{i = j}(ul (k, m) − ul (j, m)), (4.15) where l, j, k correspond to a unique row of A, and (i, m) correspond to a unique column. There is a distinct similarity between (4.15) and (4.10); if we let π(Xn ) denote the representation of joint action Xn in the space of probability distributions on S, then the regret update equation (4.12) in Algorithm 4.2.2 can be rewritten (simultaneously for all players) as: θn+1 = θnl + εn (A(π(Xn+1 )) − θn ). (4.16) From (4.15), it is clear that the correlated equilibria comprise a convex set, and we denote by Ce the set of π ∈ ∆S satisfying (4.14) for all l, j, k. There are several benefits for considering a correlated equilibrium over a Nash. The correlated equilibrium concept permits coordination between players, and can lead to improved performance over a Nash equilibrium [125]. The set (4.14) is also structurally simpler than the set of Nash equilibria; it is a convex set, whereas the Nash equilibria are isolated points at the extrema of this set [147]. Since the set of correlated equilibria is convex, fairness between players can also be addressed in this domain. Finally, decentralized, online adaptive procedures naturally converge to (4.14), whereas the same is not true for Nash equilibria (the so-called law of conservation of coordination [135]). 130 Chapter 4. Decentralized UGSN Activation Control 4.3.2 Convergence Analysis of Algorithm 4.2.2 Our goal is to show that when each filter/sensor deploys Algorithm 4.2.2, the global network performance of the UGSN (that is, the empirical distribution of play below) for the UGSN converges to the set of correlated equilibria (4.14). Definition 4.3.2 Define the global network performance (empirical distribution of play) up to time n by zn = X Ã ετ n Y ! (1 − εσ ) eXτ , (4.17) σ=τ +1 τ ≤n where ex = [0, 0, . . . , 1, 0, . . . , 0] with the one in the xth position. Here εn is a weighting factor. This yields zn = X 1X (1 − ε)n−τ eXτ , eXτ , or, z n = ε n τ τ (4.18) depending on whether εn = 1/(n + 1) or εn ≡ ε > 0, respectively. Note that in both cases P z n is an empirical frequency, since i z n (i) = 1. We are interested here, as in [133], in convergence to the set of correlated equilibria (Ce ). Recall that z n → Ce as n → ∞ if dist(z n , Ce ) = inf |z n − y| → 0 as n → ∞. y∈Ce (4.19) Because of the set theoretic nature of the results, analysis is carried out via a differential inclusion, the set theoretic extension of a differential equation. In order to present the convergence of z n to the set of correlated equilibria, it is convenient to represent the global network performance z n in (4.17) by the following recursion: z n+1 = z n + εn (eXn+1 − z n ), (4.20) where {εn } is either a sequence of decreasing stepsize or a small constant εn = ε > 0, and Xn+1 is constructed according to Step (2) of Algorithm 4.2.2 and {εn } is a sequence of stepsizes. We will use both εn = 1/(n + 1) and εn = ε (a small constant stepsize). As is widely used in stochastic approximation proofs [140], we also introduce the following continuous time interpolated processes. 131 Chapter 4. Decentralized UGSN Activation Control Definition 4.3.3 For a constant stepsize ε > 0, define the interpolated global network performance z ε (t) = z n for t ∈ [εn, ε(n + 1)). (4.21) For a decreasing stepsize sequence εn = 1/(n + 1) is used, we define tn = n−1 X εj , m(t) = max{n : tn ≤ t}. j=0 Define a sequence of piecewise constant interpolations as z 0 (t) = z n for t ∈ [tn , tn+1 ) and define the shift sequence by z n (t) = z 0 (t + tn ). We present two lemmas, the proof of the first lemma can be found in [140, Chapters 6 and 8]. This lemma (a generic differential inclusion result for any xn process) will be used to obtain the convergence results. Recall that a generic differential inclusion is of the form dz/dt ∈ G(z), i.e., it specifies a family of trajectories. It is a generalization of a differential equation dz/dt = g(z) which comprises of a single trajectory. Lemma 4.3.1 Consider a generic stochastic approximation algorithm xn+1 = xn +εn g(xn , ζn ), where εn is either a constant stepsize εn = ε or εn = 1/(n + 1). Assume that for each ζ, g(·, ζ) is continuous, and for each x, g(x, ·) is bounded. Moreover, assume that for each fixed x, there is a set G(x) such that ( n+m−1 X 1 dist g(x, ζj ), G(x) → 0 n j=m in probability when εn = ε, almost surely when εn = 1/(n + 1), Define the interpolated process xn (·) and xε (·) as in Definition 4.3.3. Then xn (·) and xε (·) converges to x(·) almost surely (for the decreasing stepsize case) and weakly (for the constant stepsize case), where the limit x(·) is a solution of ẋ ∈ G(x). Define the following differential inclusion as in [126] dz ∈ ν(z) × ∆S −l − z, dt (4.22) where ν(z) is the set of probability distributions over l0 s action set Sl , such that X k l νk [Hkj (z)]+ = νj X l [Hjk (z)]+ k 132 Chapter 4. Decentralized UGSN Activation Control for all j, k ∈ Sl . Here ∆S −l is the set of all probability distributions over joint actions of the remaining L − 1 sensor nodes, and l Hjk (z) = X Hl (xi )z(i). (4.23) xi ∈Sk Convergence of (4.22) to the set of correlated equilibria (4.14) is established in Section 5 of [126] by a Liapunov argument. Let us now present the convergence of the algorithm. Theorem 4.3.1 Consider the interpolated processes defined in (4.3.3) generated by Algorithm 4.2.2. The following results hold: 1. Under a decreasing stepsize εn = 1/(n + 1) in Algorithm 4.2.2, the trajectory of the interpolated SA z n (·) for the global network performance converges almost surely to a trajectory z(·) whose dynamics are given by (4.22). 2. For the decreasing stepsize algorithm, as n → ∞, for any qn → ∞ as n → ∞, z n (· + qn ) converges almost surely to the set of correlated equilibria. 3. Under a constant stepsize εn = ε in Algorithm 4.2.2, z ε (·) converges weakly as ε → 0 to a trajectory z(·) that is given by (4.22). 4. For the constant stepsize algorithm, as ε → 0, for any qε → ∞ as ε → 0, z ε (· + qε ) converges weakly to the set of correlated equilibria. Outline of Proof: First, note that by the argument of [126, see pp. 334-335], it can be shown that 1 dist n n+m−1 X eXj , Ce → 0 (4.24) j=m as n → ∞ for any m ≥ 0, where the convergence takes place either in the sense of almost surely for the decreasing stepsize case, or in probability for the constant stepsize case. With (4.24) holding, Lemma 4.3.1 implies that the interpolated processes defined in Definition 4.3.3 converges to z(·) (either almost surely or weakly) that satisfies (4.22). The rest of the proof follows from the standard argument as in [140]. The details are omitted. For static game conditions, and stepsize εn = 1/(n + 1), Algorithm 4.2.2 is exactly the regret matching algorithm introduced in [133], and convergence to the set of correlated equilibrium is proven there. For dynamic game conditions, Theorem 4.3.1 states that if the game varies slow enough, then the equilibrium set is still tracked through time by the 133 Chapter 4. Decentralized UGSN Activation Control algorithm. This is important because it guarantees that the pricing modifications in Section 4.3.4 are valid, and that slowly moving targets and sensor failures are supported. 4.3.3 Tracking Performance of Algorithm 4.2.2 The above analysis assumes that the target does not move. The purpose of this section is to analyze the tracking performance of Algorithm 4.2.2 when the target process {rn : n = 0, 1, 2, . . .} evolves slowly according to a finite state Markov chain. Given that the measurement process at sensors occurs at a much faster time scale compared to the movement of the targets, formulating the target dynamics as a slow Markov chain is a reasonable assumption. We show that Algorithm 4.2.2 can satisfactorily track the set of correlated equilibria of the network. It is important to keep in mind that the Markovian assumption of the target is only used in our analysis below. The implementation of Algorithm 4.2.2 does not require the dynamics of the target to be known explicitly. Suppose, for the purpose of analysis, rn , the target position parameterizing (4.10), evolves according to a slow Markov chain with a finite state space Mr = {r1 , . . . , rm } and a transition matrix of the form P = P ρ = I + ρQ, where I is the identity matrix, ρ > 0 is a small parameter, and Q is a generator of a continuous-time Markov chain (i.e., Q = (qij ) P satisfying qij ≥ 0 for i 6= j, j qij = 0 for each i. In what follows, we work with ρ = O(ε) or ρ = O(ε2 ). Clearly if ρ = O(ε2 ), then the target evolves on a slower time scale compared to the adaptation speed of the stochastic approximation Algorithm 4.2.2 which has step size ε. On the other hand if ρ = O(ε), then the target evolves on the same time scale as the adaptation speed of Algorithm 4.2.2. Depending on the time scale in which the target evolves, the following theorem analyses the tracking performance of Algorithm 4.2.2. Define the interpolated process z ε (·) as in the previous section and define rε (t) = rn for t ∈ [nε, nε + ε). Theorem 4.3.2 Using a constant stepsize εn = ε in Algorithm 4.2.2, the following hold: 1. Mean Square Tracking Error for Slow Target: If ρ = O(ε2 ), then for each y ∈ Ce and for n sufficiently large, E|z n − y|2 = O(ε). 2. Differential Inclusion Limit for Slow Target: If ρ = O(ε2 ), then as ε → 0, the pair of processes (z ε (·), rε (·)) converges weakly to (z(·), r(·)) where (z(·), r(·)) satisfy the differential inclusion dz ∈ ν(z) × ∆S −l − z, with dt (4.25) 134 Chapter 4. Decentralized UGSN Activation Control X l νk [Hkj (z)]+ = νj k X l [Hjk (z)]+ . k Here rn is the actual state of the target process, z ε (·) is defined in (4.21), and H(·) is defined analogously to (4.23), but with explicit dependence on rn . 3. Differential Inclusion Limit for Fast Target: If ρ = O(ε), then as ε → 0, (z ε (·), rε (·)) converges weakly to (z(·), r(·)) where (z(·), r(·)) satisfy the Markovian switched differential inclusion (4.25) with X k l νk [Hkj (z, r(t))]+ = νj X l [Hjk (z, r(t))]+ k Here r(t) is a continuous-time Markov process with generator Q. Remark Due to limited space, we provide an outline of the proof. The proof is a combination of convergence analysis provided in the last section and the two-time-scale formulation for Markov chains (see [153]). By a slow Markov chain, we mean that the chain changes its state relatively infrequently. Thus the time-varying parameter stays in a state (a constant) at most of the time, and jumps to a new location at random instance. Note that the infrequent changes make it possible for us to track the time-varying parameter process. If it changes too fast, then there is virtually not possible to keep track of the time-varying parameter. This is known as trackability. In the traditional analysis, one often works with a time-varying parameter whose trajectories change slowly but continuously. Here, we allow the parameter jump changes with possibly large jump size. However, we require the jumps do not happen too frequently. The results depend on how “slow” the Markov chain changes. If the dynamics of the Markov chain, namely, the transition of the Markov chain changes much slower than the adaptation rate, then the effect of this variation can be asymptotically “ignored.” The detailed analysis for this claim follows from the analysis given in [152]. On the other hand, if the Markov chain dynamics evolve in the same order of magnitude as that of the adaptation speed, then a Markov regime-switched mean dynamic system is obtained. For the proof of part 1, using a perturbed Liapunov function, argument [140] (see [152] for details), for small ε and ρ, we obtain for n sufficiently large, E|z n − y|2 = O(ρ + ε + ε2 /ρ). Note that to obtain such a result, we have not used any relative order of ε and ρ up to this point. Next, using ρ = ε2 , we establish the desired result. As a consequence, £ ¤ E dist (z n , Ce )2 = O(ε). The rest of the proof is a modification of two-time-scale algorithms 135 Chapter 4. Decentralized UGSN Activation Control for tracking parameter variations studied in previous work [152, 154] in conjunction with the differential inclusion for correlated equilibria. In what follows, we only indicate the main steps of the proof. In fact, we will only comment on the last assertion. Recall that in this case, ρ = O(ε). For convenience, we simply put ρ = ε in what follows. This simplification is adopted mainly because this part of the proof is more difficult than the first two assertions. Denote by pεn = (P (rεn = r1 ), . . . , P (rεn = rm ))0 ∈ Rm , the probability distribution vector. Then it can be show that for each T > 0 and 0 ≤ n ≤ T /ε, pεn = p(t) + O(ε + exp(−κ0 t/ε)), with p(t) satisfying ṗ(t) = Q0 p(t). In the above, for X ∈ Rι1 ×ι2 with ι1 , ι2 ≥ 1, X 0 denotes its transpose. Similarly, we obtain related convergence result for the corresponding transition probability matrix. Next, we can show that rε (·) converges weakly to a Markov chain whose generator is given by Q; see [153] (also [154]) for details. To proceed, consider the pair (z ε (·), rε (·)). We use martingale averaging ideas to obtain the desired limit. To do so, we show that the two-component process (z ε (·), rε (·)) is tight in the space of functions that are right continuous with left limits endowed with the Skorohod topology (see [140]). Then we can extract weakly convergent subsequence. For simplicity, still index the subsequence by ε with limit denoted by (z(·), r(·)). We then use the averaging techniques to finish up the proof (see [154]). 4.3.4 Algorithm Improvements and Pricing Under Algorithm 4.2.2, if target activity persists in a given area for a long period, the adaptive filters will tend to converge to a single action, tending to activate some nearby sensors to monitor the target indefinitely. This may waste resources, prematurely burning out sensors instead of sharing the load. Another drawback is that regret averaging results in a lag between target motion and an appropriate response. To remedy these problems, we introduce two improvements: a two-level pricing dynamic, which changes sensor incentives if they are steadily active for a specified number of iterations, and a reset mechanism for the regret, which activates in the event that no targets are nearby. The two improvements can be implemented via the following algorithm3: Algorithm 4.3.1 Timer-Based Pricing/Event-Based Reset: Each sensor is set with timeout parameter φ, and pricing levels K1 (Normal) > K1 (Backoff) and performs the following: 1. Initialization: Initialize an action counter cnt = 0, and define a “Normal” and “Backoff” state. Initially set “State”=“Normal.” Also set a threshold parameter Dmin for the sensed target distance. 136 Chapter 4. Decentralized UGSN Activation Control 2. Repeat for iteration n = 1, 2, . . .: l if Xnl = Xn−1 , i.e. if the action from the last iteration is repeated, increment counter cnt. 3. if cnt = φ, toggle the state between “Normal” and “Backoff,” and reset cnt = 0. 4. Set K1 = K1 (state) in (4.7) for utility calculations in Algorithm 4.2.2. 5. Concurrently, reset the regret values according to sensor readings in either of the following events: (a) If Xnl = 1 and Dl < Dmin , reset the filter regret θnl (1, 0) = 1/ε. (b) If Xnl = 0 and Dl < Dmin , reset θnl (0, 1) = −ε. Algorithm 4.3.1 introduces a two-level pricing dynamic which tends to periodically switch the sensor on or off (if K1 (Backoff) is low enough) periodically, allowing other nearby sensors to rest or take over as appropriate. It also mitigates the adverse effects of using averaging to track the regret values in two ways. First, if an active sensor suddenly loses track of its target(s), it will be endowed with a high regret for remaining active and thus quickly switch to sleep mode. Second, if an inactive sensor suddenly observes a target, it will be prepared to switch quickly to active mode should a target appear since its average regret is only slightly negative. Provided that Algorithm 4.2.2 utilizes a constant stepsize, the changes initiated by Algorithm 4.3.1 can be tracked by the sensors, and the result can significantly improve sensor network operation. Timer parameters must be chosen carefully; faster timeouts distribute effort more even, but also impact performance since the best sensors are often in backoff. We study the trade-off through simulation in Section 4.4. The pricing levels (and similarly the timer parameter φ) can be set by further stochastic approximation on a slow time scale relative to that of Algorithm 4.2.2. To motivate, suppose the end user divides time into long decision periods labeled m = 1, 2, . . . . At the end of each period m, the unpriced global utility, U (Ē) = 1 − e−k2 P j yj , (4.26) (see (4.6)) received in period m can be computed, along with the total energy Ē spent by all sensors. Assuming the end user discounts future utilities by α < 1, and that the present value of U (Ē) is indicative of future values, the end user should set the relative sensor prices 137 Chapter 4. Decentralized UGSN Activation Control (parameter K = K1 (Normal)/K1 (Backoff) in Algorithm 4.3.1 to maximize: N (Ē) Utotal = X U (Ē)αm = U (Ē) m=1 ´ α ³ 1 − αN (Ē) , 1−α (4.27) (where N (Ē) ≈ Enet /Ē is the network lifetime; Enet is the total energy remaining in the network.) Thus, the sensor utilities are periodically varied by a broadcast transmission from the base station, which may use a stochastic gradient algorithm (e.g. Kiefer-Wolfowitz) to select the current relative price K. The base station estimates U (Ē), N (Ē) based on the number and content of reports received and evaluates (4.27). Then derivative dUtotal /dK is estimated using a finite difference or simultaneous perturbation methods as in [148], and a new pricing parameter K is broadcast to be used by each sensor in interval m, satisfying: K(m + 1) = K(m) + βm dUtotal (K(m)), dK (4.28) for stepsize βm . The following is a standard result from stochastic optimization (see [140, 148]): Theorem 4.3.3 The sequence K(m) : m = 1, 2, . . . computed in (4.28), with unbiased estimates for dUtotal (K(m))/dK, converges almost surely to a locally optimal price, provided P P∞ 2 that βm > 0, βm → 0, ∞ m=1 βm = ∞, and m=1 βm < ∞. The slow time scale stochastic pricing operations described above are important factors in the long-term performance of the UGSN. There are many other parameters that could benefit from this approach, including the stepsize in Algorithm 4.2.2, the various utility parameters, and even the ZigBee operational characteristics. Providing a unified framework for treating all these parameters as prices to be optimized would be an interesting exercise, but beyond the scope of this paper. 4.4 Test Bed and Numerical Results for ZigBee-enabled UGSN In this section we present simulation results illustrating the algorithms in Sections 4.2.3 for representative scenarios. For this purpose, a simulation test bed for ZigBee sensor nodes was developed in Matlab, using the TrueTime simulation package, [123]. This package provides a ready-made event-based simulator for ZigBee compatible UGSNs, which can be 138 Chapter 4. Decentralized UGSN Activation Control Parameter K1 k2 D F D km kc Am KD εact ε Table 4.1: Simulation Parameters Value Description 20 Pricing parameter, Eq. (4.7) 3 Utility concavity, Eq. (4.7) 0.1 Mean signal strength, Eq. (4.7) 20 Hz Decision frequency, q, Eq. (4.5) 100 bytes Data block size, q, Eq. (A.4) 0.01 Target monitoring activity (sleeping), Eq. (4.5) 0.01 Channel monitoring activity (sleeping), Eq. (4.5) 1 Target signal intensity, Eq. (4.2) 1 Sensor saturation, Eq. (4.2) 0.2 Channel averaging memory, Algorithm 4.2.1 0.05 Regret matching memory, Algorithm 4.2.2 run in parallel with other Matlab code, allowing us to simultaneously simulate both target dynamics and node activity. The node activity in turn comprises of Algorithms A.2.1, 4.2.1, and 4.2.2. Note Algorithm A.2.1 is fully specified by the TrueTime package. The adaptive filtering algorithm 4.2.2 specifies that sensor nodes should activate only when they are close enough to a target, and far enough from other active nodes. In this section we discuss the performance of the regret tracking algorithm for the following scenario: • The UGSN comprises of L = 500 sensor nodes deployed uniformly over a 100 × 100 meter grid, and 10 hubs. • We consider M = 1 and M = 10 targets, each of which radiates an acoustic signature with mean intensity 1 and variance 0.05, at one metre. We consider a bandwidth-intensive application, in which activation decisions are made every 20 ms, and 100 bytes of data are transmitted each decision period by active sensor nodes. The parameter values for the simulation are specified in Table 4.1. Physical parameters are km = kc = 0.01, KD = 4, meaning that sleeping sensors only wake up 1% of the time for monitoring, and that the sensor itself is saturated when its target is closer than 0.5 metre. The value D = 0.1 is computed by considering the average distance from a target to a sensor, given 500 sensors in a 100 × 100 metre grid. The adaptive filtering Algorithm 4.2.2 was embedded into the simulated ZigBee test bed, and the outcomes averaged over 30 independent simulations were used to obtain the results below. 139 Chapter 4. Decentralized UGSN Activation Control Performance Analysis (Utility) for Changing Targets 0.7 0.6 Mean Global Utility 0.5 Target Innovation Matches Tracking Parameter ε 0.4 0.3 0.2 0.1 0 −4 10 −3 −2 10 10 Target Innovation Rate ε tar −1 10 (10 targets) 0 10 Figure 4.2: Utility is only high when target locations change sufficiently slowly. The solid line represents the case where the regret θnl is reset as in Section 4.3.4. The broken line represents performance without resetting the regret. The target innovation rate is plotted on a logarithmic scale. 4.4.1 Tracking Dynamic Targets: Comparison of Performance We first investigate the performance of Algorithm 4.2.2 when target vector rn evolves according to a slow process; at each iteration we allow each of the M = 10 targets to randomly jump to a new location with probability ρ. The overall utility (4.6) is shown in Figure 4.2 for various ρ, and the proportion of time spent by sensors in non-equilibrium states is shown in Figure 4.3. We also show the effect of the second part of Algorithm 4.3.1 (solid line), which enhances performance by periodically resetting the regret parameters. As shown, utility decreases as targets become more mobile, since sensors require time to adapt to the constantly changing conditions, and hence may lose track of fast-moving targets. This is also shown by the time sensors spend out of equilibrium while adapting, which is a measure of the time taken to adapt to target changes. As predicted, the algorithm 140 Chapter 4. Decentralized UGSN Activation Control Performance Analysis (Time in Equilibrium) for Changing Targets 7 % Time Spent out of Equilibrium 6 5 4 3 Target Innovation Matches Tracking Parameter ε 2 1 0 −4 10 −3 −2 10 10 Target Innovation Rate ε tar −1 10 (10 targets) 0 10 Figure 4.3: As target motion increases, the proportion of time spent in suboptimal states increases. The solid line represents the case where the regret θnl is reset as in Section 4.3.4. The broken line represents performance without resetting the regret. The target innovation rate is plotted on a logarithmic scale. 141 Chapter 4. Decentralized UGSN Activation Control Table 4.2: The Regret-based approach outperforms a simple greedy scheme. Performance is shown for optimal stepsize ε, and is defined as the average end-user utility per joule of energy spent in an iteration, per target. Target Behaviour Performance (Greedy) Performance (Regret Tracking) Ntar = 1, ρ = 0.01 82.9 96.4 Ntar = 1, ρ = 0.05 74.5 89.5 Ntar = 10, ρ = 0.005 107.2 112.6 performs well for slowly moving targets, but performance degrades for faster targets, since dynamics are not accounted for. To show the benefit of the game theoretic approach, we also compare sensor activation according to Algorithm 4.2.2 with a simple greedy scheme, in which sensors activate deterministically if their average utility for activating in previous periods was positive. Table 4.2 compares performance in terms of the global utility per Joule of energy spent for three scenarios: First, with a single, slow-moving target with ρ = 0.01, then for a faster target with ρ = 0.05, and finally for ten targets, each with ρ = 0.005. Table 4.2 shows that the regret-based scheme outperforms the simple greedy scheme in all cases. The key to this result is the randomization of actions in Algorithm 4.2.2, which allows sensors to properly coordinate actions, and is critical to the convergence properties the scheme. 4.4.2 Parameter Selection and Sensor Estimation A key factor dictating sensor behaviour is the weight of the trade-off between a sensor’s reward and the energy cost for activation. Parameter K1 in (4.7) fulfills this role, and one may think of it as a price paid to a sensor per unit of information delivered. Naturally, increasing K1 provides a greater incentive for sensors to activate, even when their signal is weak, thereby increasing both the amount of energy consumed and the information provided. However, since the user’s utility is concave, there is a diminishing return for setting K1 too high, which suggests a trade-off between performance and energy expended exists. We illustrate this in Figure 4.4, which shows the effect of varying the sensors’ utility (pricing) parameter K1 on both the end-user utility and energy consumed in the network. Another factor influencing sensor behaviour is the accuracy of their awareness of nearby sensor activity. In particular, one may ask whether Algorithm 4.2.1, which delivers a maximum likelihood estimate of sensor activity, is effective. To address this, Figure 4.4 shows sensor network behaviour with the estimate provided by Algorithm 4.2.1, as well as the the- 142 Chapter 4. Decentralized UGSN Activation Control Performance Analysis of Pricing Parameter K1 5 0.5 Mean Global Utility Using Actual Sensor Count 4 0.4 3 0.3 Utility 0.2 2 Energy Using Algorithm 3.1 Mean Energy per Sensor (µJ) 6 0.6 1 0.1 0 0 0 5 10 15 20 25 30 Parameter K1 Figure 4.4: End user utility and energy increase as the incentive parameter for activating, K1 , is increased. Overall performance is only degraded slightly when sensors estimate contention levels via Algorithm 4.2.1. 143 Chapter 4. Decentralized UGSN Activation Control oretical optimal performance if the true number of active sensors were known. As shown, the degradation due to estimation is small; the general effect of using Algorithm 4.2.1 is that marginally fewer sensors are active on average (which actually reduces energy consumption). The final innovation we illustrate is the timer-based load-sharing mechanism of Algorithm 4.3.1. This is intended to provide an additional layer of activation control to “smooth out” sensor activity so as to extend network lifetime, with minimal impact to performance. Figure 4.5 shows both sensor network performance and energy consumption for a range of timeout parameters. Recall that when a sensor is active or inactive for a number of consecutive decision periods equal to the timeout parameter, it switches its pricing parameter K1 to a smaller or larger value (respectively), effectively sending the sensor in and out of a backoff state. While large timeout parameters correspond to this feature being disabled, smaller timeout lengths cause the most well-situated sensors to back off after a period of time. This negatively impacts performance, since less well-positioned sensors are forced to operate, but the impact is limited, relative to the impact of parameter K1 above. However, as shown, the mean energy spent and the energy standard deviation are also decreased for smaller timeout parameters. As before, this implies a trade-off between performance and energy usage which must be achieved by setting the timeout parameters appropriately. 4.5 Discussion Energy efficient design for unattended ground sensor networks (UGSN) requires a highly integrated approach. Increasing the autonomy and intelligence of sensor nodes is critical to reducing costly communication overhead associated with coordinating sensor nodes, but this must be done carefully, as it is complicated for a large number of nodes to adapt simultaneously. To address this, we have presented a computationally efficient adaptive filtering algorithm for effective node activation without costly overhead, and demonstrated the usefulness of tools used such as correlated equilibria of non-cooperative games and stochastic approximation. Some of the critical issues considered are a mechanism for sensor awareness, and the tracking capabilities of the regret-based algorithm for sensor deployment. The stochastic approximation algorithms used here are extremely flexible, and can be applied to a large number of decentralized adaptive processing problems. Application of these techniques is an area for further study. We conclude with a discussion of two useful extensions of the methods in this paper. 1. Stochastic Sensor Scheduling: Typically, a sensor node has a suite of sensors available 144 Chapter 4. Decentralized UGSN Activation Control Performance Analysis of Timer Parameter φ 0.8 20 18 0.7 Global Utility 14 Utility 0.5 12 0.4 10 8 0.3 Energy (Std. Dev.) 0.2 Energy (Mean) 6 Energy per Sensor (µJ) 16 0.6 4 0.1 2 0 10 100 1000 0 10000 Timer Parameter φ Figure 4.5: Utility is higher when sensors are not required to enter a backoff state, due to timeout. Energy consumption is reduced and more fairly distributed if sensors are forced to share duties due to the timeout/backoff mechanism. 145 Chapter 4. Decentralized UGSN Activation Control for measurements, but can only deploy one or a few of these at a time. It is possible to extend our methods to the stochastic sensor scheduling problem, which deals with how these on-board sensors should be dynamically scheduled to optimize a utility function comprising of the Bayesian estimate of the target and the usage cost of the sensors. Such problems are partially observed stochastic control problems [137]. We refer the reader to [150] for the partially observed Markov decision process (POMDP) sensor scheduling problem. 2. Analysis of stochastic approximation algorithms: With constant stepsize, Algorithm 4.2.2 tracks a correlated equilibrium set that varies due to a slowly moving target. Suppose we were to assume that the target moves according to a slow Markov chain with transition probability matrix I + ²Q (where ² > 0 is a small parameter and Q is a generator matrix with each row summing to zero). With this assumption, how can one analyze the tracking performance of Algorithm 4.2.2 with (the same) step size ²? In our recent work [154], we have shown that the limiting behaviour (see [127]) of similar systems is captured by a Markovian switched ordinary differential equation, and we conjecture that a similar result applies to Algorithm 4.2.2, using a differential inclusion. This analysis requires use of yet another extremely powerful tool in stochastic analysis namely, the so called “martingale problem” of Strook and Varadhan, see [132] for comprehensive treatments of this area. 146 Bibliography [122] I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci. Wireless sensor networks: A survey. Computer Networks, 38(4):393–422, 2002. [123] Martin Andersson, Dan Henriksson, Anton Cervin, and Karl-Erik rzn. Simulation of wireless networked control systems. Proceedings of the 44th IEEE Conference on Decision and Control and European Control Conference, 2005. [124] R. Aumann. Subjectivity and correlation in randomized strategies. Journal of Mathematical Economics, 1:67–96, 1974. [125] R. Aumann. Correlated equilibrium as an expression of bayesian rationality. Econometrica, 55(1):1–18, 1987. [126] M. Benaim, J. Hofbauer, and S. Sorin. Stochastic approximations and differential inclusions ii: Applications. Levine’s bibliography, UCLA Department of Economics, May 2005. [127] A. Benveniste, M. Metivier, and P. Priouret. Adaptive Algorithms and Stochastic Approximations, volume 22 of Applications of Mathematics. Springer-Verlag, 1990. [128] D. Bertsekas and J.N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA., 1996. [129] N. Bulusu, D. Estrin, L. Girod, and J. Heidemann. Scalable coordination for wireless sensor networks: Selfconfiguring localization systems. in Proceedings of the Sixth International Symposium on Communication Theory and Applications (ISCTA 01), July 2001. [130] D. Fudenberg and D. Levine. The theory of learning in games. MIT Press, 1999. [131] A.J. Goldsmith and S.B. Wicker. Design challenges for energy-constrained ad hoc wireless networks. IEEE Wireless Comm., 9(4):8–27, 2002. 147 Bibliography [132] Kushner H. Approximation and Weak Convergence Methods for Random Processes, with applications to Stochastic Systems Theory. MIT Press, Cambridge, MA., 1984. [133] S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000. [134] S. Hart and A. Mas-Colell. A reinforcement procedure leading to correlated equilibrium, in Economic Essays, pages 181–200. Springer, 2001. [135] S. Hart and A. Mas-Colell. Uncoupled dynamics do not lead to nash equilibrium. American Economic Review, 93(5):1830–1836, December 2003. [136] J. Hill and D. Culler. Mica: A wireless platform for deeply embedded networks. IEEE Micro, 22(6):12–24, 2002. [137] Baras J and Bensoussan A. Optimal sensor scheduling in nonlinear filtering of diffusion processes. SIAM Journal Control and Optimization, 27(4):786–813, 1989. [138] R. Kannan, S. Sarangi, S. S. Iyengar, and Lydia Ray. Sensor-centric quality of routing in sensor networks. IEEE Infocom, 2003. [139] V. Krishnamurthy and G. Yin. Recursive algorithms for estimation of hidden Markov models and autoregressive models with Markov regime. 48(2):458–476, February 2002. [140] H.J. Kushner and G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer-Verlag, New York, NY, second edition, 2003. [141] Ning Li and Jennifer Hou. Localized fault-tolerant topology control in wireless ad hoc networks. IEEE Transactions on Parallel and Distributed Computing, 17(4):307–320, 2006. [142] A.B. MacKenzie and S.B. Wicker. Game theory and the design of self-configuring, adaptive wireless networks. IEEE Communications Magazine, pages 126–131, November 2001. [143] A.B. MacKenzie and S.B. Wicker. Stability of multipacket slotted aloha with selfish users and perfect information. in Proceedings of IEEE INFOCOM, 2003. [144] Heather Marandola, Joseph Mollo, and Paul Walter. Rembass-ii: the status and evolution of the army’s unattended ground sensor system. Proceedings of SPIE, 4743:99–107, 2002. 148 Bibliography [145] W.M. Merrill, F. Newberg, K. Sohrabi, W. Kaiser, and G. Pottie. Collaborative networking requirements for unattended ground sensor systems. In Proc. IEEE Aerospace Conference, 2003. [146] J. Nash. Non-cooperative games. Annals of Mathematics, 54(2):286–295, 1951. [147] R. Nau, S. Gomez Canovas, and P. Hansen. On the geometry of nash equilibria and correlated equilibria. International Journal of Game Theory, 32(4):443–453, 2004. [148] J. Spall. Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control. Wiley Press, 2003. [149] N. Srour. Unattended ground sensors: a prospective for operational needs and requirements. Tech. rep., Army Research Laboratory, October 1999. [150] Krishnamurthy V. Algorithms for optimal scheduling and management of hidden Markov model sensors. IEEE Trans. Signal Proc., 50(6):1382–1397, 2002. [151] Yiping Xing and R. Chandarmouli. Distributed discrete power control for bursty transmissions over wireless data networks. In in Proc. CISS? 2004. [152] G. Yin and V. Krishnamurthy. Lms algorithms for tracking slow markov chains with applications to hidden markov estimation and adaptive multiuser detection. IEEE Trans. Information Theory, 51:2475–2490, 2005. [153] G. Yin and Q. Zhang. Discrete-time Markov Chains: Two-time-scale Methods and Applications. Springer, New York, NY, 2005. [154] Krishnamurthy V Yin G and Ion C. Regime switching stochastic approximation algorithms with application to adaptive discrete stochastic optimization. SIAM Journal on Optimization, 14(4):117–1215, 2004. 149 Chapter 5 Game Theoretic Dynamic Spectrum Access for Cognitive Radios 7 5.1 Introduction Technologies such as mobile computing and cellular telephony are increasingly striving to deliver an “always connected” user experience. As these technologies becomes more ubiquitous, it becomes critical to make efficient use of limited radio resources to reliably deliver this experience to as wide a market as possible. This in turn requires active management of spectral resources, a challenge considering the decentralized structure of the radio system. To address this, researchers in cognitive radio [166, 168], propose RF devices that actively monitor and adjust to their radio environment to efficiently communicate in a crowded spectrum. This dynamic spectrum access functionality, in which cognitive radios compete for resources while respecting legacy (licensed) users, is the subject of this chapter. We explore the spectrum overlay approach (also referred to as opportunistic spectrum access) to dynamic spectrum access, using a game theoretic framework to highlight issues of cooperation and competition between multiple radios. In this model, cognitive radios share portions of the RF spectrum (channels) that are temporarily unoccupied by licensed users. Each radio dynamically selects several available channels so as to balance its own demand (competition) against system-imposed sharing incentives (cooperation). Selections are made independently by each radio, based only on their own performance history. We focus on applications where primary users’ spectrum access activities either vary slowly with time (see [171, 172]), or where their spectrum access activities vary quickly, but average behaviour varies slowly. Example applications include the reuse of certain TV-bands that 7 A version of this chapter has been submitted for publication. Maskery, M. and Krishnamurthy, V. Decentralized Dynamic Spectrum Access for Cognitive Radios: Cooperative Design of a Non-cooperative Game, IEEE Transactions on Communications. 150 Chapter 5. Game Theoretic Dynamic Spectrum Access are not used for TV broadcast in a particular region. Since optimal resource allocation in a decentralized, competitive environment is not straightforward, we propose to operate radios according to a game theoretic algorithm which slowly adapts resource allocation over time. We show that this algorithm tracks the time-varying set of correlated equilibrium actions of the game, so that each radio learns to respond optimally to its environment. For appropriate radio utilities, this equilibrium leads to efficient global use of resources. There are several reasons for using a game theoretic approach. First, since game theory explicitly recognizes the interdependence between radios, it can be used as a synthesis tool to provide decentralized algorithms for adaptive resource allocation. Second, the game theoretic concept of equilibrium provides a useful analysis tool ; if we specify a simple algorithm that converges to equilibrium, then we can characterize the long-run behaviour of the system, which may be measured against a global, system-wide objective. In this chapter, we assume that detecting the activities of primary users is sufficiently accurate that interference to primary users is below the required level. To resolve contention among cognitive radios, we use CSMA (carrier sense multiple access) to randomly allocate channel times among competing cognitive radios based on a reservation system. This simple mechanism allows us to capture fundamental issues such as uncertainty of the activity of others and nonlinear channel degradation due to crowding effects. Related Work Dynamic spectrum access presents technical challenges across the entire networking protocol stack. An overview of challenges and recent developments in dynamic spectrum access can be found in [176]. In the context of spectrum overlay, basic design components include spectrum opportunity identification and spectrum opportunity exploitation. The opportunity identification module is responsible for accurately identifying and intelligently tracking idle frequency bands that may be dynamic in both time and space. The opportunity exploitation module takes input from the opportunity identification module and decides whether and how a transmission should take place. Spectrum opportunity detection can be reduced to a classic signal processing problem: detecting the presence of primary users’ signals. Based on the cognitive radios’ knowledge of the signal characteristics of primary users, three traditional signal detection techniques can be employed: matched filter, energy detector (radiometer), and cyclostationary feature detector [158]. While classic signal detection techniques exist in the literature, detecting primary transmitters in a dynamic wireless environment with noise uncertainty, shadowing, 151 Chapter 5. Game Theoretic Dynamic Spectrum Access and fading is a challenging problem that has attracted much research attention [159, 162, 174]. When the activities of primary users are fast varying, spectrum opportunity tracking becomes a critical issue. This problem is addressed within the framework of Partially Observable Markov Decision Processes (POMDP) in [177]. Once spectrum opportunities are detected, cognitive radios need to decide whether and how to exploit them. In the design of the spectrum exploitation module, specific issues include whether to transmit given that opportunity detectors may make mistakes, what modulation and transmission power to use, and how to share opportunities among secondary users to achieve a network-level objective. The optimal design of spectrum access strategies in the presence of spectrum sensing errors has been addressed in [160, 161]. Specifically, the interaction between the spectrum access protocols at the MAC layer and the operating characteristics of the spectrum opportunity detector at the physical layer is quantitatively characterized, and the optimal joint design of opportunity detectors, access strategies, and opportunity tracking strategies is obtained. Orthogonal frequency division multiplexing (OFDM) has been considered as an attractive candidate for modulation in spectrum overlay networks as discussed in [157, 173]. Power control for cognitive radios needs to take into account the detection range of the opportunity detector, the maximum allowable interference level, and the transmission power of primary users [175]. This complex networking issue remains largely open. Spectrum opportunity sharing among cognitive radios, which is the focus of this chapter, has been addressed in the literature. The problem of noncooperative radio resource allocation is considered in [171, 172] and related work from a non-game theoretic perspective, and in [169] from a game theoretic one, using a similar approach to the one presented here. However, [169]] and similar references deal with a CDMA scenario which gives rise to a highly-structured supermodular game scenario, whereas the problem considered here lacks such convenient structure. Organization of Chapter There are several underlying themes in this chapter. Based on the model of Section 5.2, we investigate how radios can estimate information about their environment in Section 5.3. The issue of performance evaluation based on this information is dealt with in Section 5.4, which includes consideration of whether radios are designed for selfish or cooperative behaviour. Section 5.5 discusses our main decentralized adaptive algorithm for spectrum access, along with some related variants. These are compared through simulation in Section 5.6, and the main procedure is shown to be superior. Concluding remarks are made in Section 5.7. 152 Chapter 5. Game Theoretic Dynamic Spectrum Access 5.2 Opportunistic Spectrum Access Model We consider a network of fully connected cognitive radios communicating by exploiting channels unused by primary users. We divide time into equal slots of length Λ, and label slots n = 1, 2, . . . (we also refer to a slot as a decision period). At the beginning of the nth time slot, each cognitive radio l = 1, 2, . . . , L knows the following: 1. C, the number of channels in the radio system. 2. Cn ∈ RC , the channel quality vector (bits per time slot for each channel) at time n. 3. Yn ∈ {0, 1}C , the current channel usage pattern of primary users; Channel i is in use if Yn (i) = 1. 4. dln ∈ R, the current demand level of cognitive radio l (in bits per time slot). 5. ml ∈ Z+ , the maximum number of channels that l may use simultaneously. All these quantities are static or vary slowly in time. The time-dependence of Cn allows us to consider sharing channels with fast primary users, whose statistics are slowly varying. A fast primary user on channel i periodically preempts cognitive radio activity, reducing the effective quality to Cn (i). We will refer to these specifically as fast primary users throughout the chapter, while the term “primary user” will be reserved for the slow varying type. Next, each radio l chooses a channel allocation Xln (see Section 5.5) from the state space: n o X Sln = x ∈ {0, 1}C : x · Yn = 0, x(i) ≤ ml (5.1) That is, each radio can select up to ml any unused (by primary users) channels. We denote the joint action of all cognitive radios by Xn ∈ Sn = S1n × . . . × SL n. Cognitive radios share channels using a simple carrier sensing multiple access (CSMA) scheme, as follows: Divide each decision period n into K equal subslots, labeled n1 , n2 , . . . , nK . (So each subslot has length Λ/K.) For every subslot nk and channel i such that Xln (i) = 1, radio l executes the following: 1. Generate a backoff time τnl k (i) according to a uniform distribution on the interval (0, τmax ) for some fixed parameter τmax . 2. Upon expiry of the backoff timer, monitor channel i and transmit data only if the channel is sensed clear. 153 Chapter 5. Game Theoretic Dynamic Spectrum Access Exactly one radio will transmit successfully on Channel i in subslot nk , provided that its backoff time is sufficiently less than the next smallest time (allowing time to sense the channel clear and switch from receive to transmit mode). Otherwise there will be a collision on i for that subslot. For each n, i : Xln (i) = 1, and k = 1, 2, . . . , K, denote success in subslot nk by: γnl k (i) = I{ channel i captured by l in subslot nk }, (5.2) where I{·} is the usual indicator function. At the end of decision period n (of length Λ), each radio l will have collected the following information: n o (γnl (i), τnl (i)) = (γnl k (i), τnl k (i)) : Xln (i) = 1, k = 1, 2, . . . , K (5.3) This information will be used for adaptive decision making in subsequent sections. A block diagram of our proposed system is given in Figure 5.1. There are three time scales in our problem formulation. The slowest time scale corresponds to the variation of primary user activity and demand levels. Second, and much faster, is the decision time scale (intervals of length Λ) of the cognitive radios themselves, and third, the fastest time scale (intervals of length Λ/K), are the CSMA channel access attempts. 5.3 Estimating Channel Characteristics It is clearly undesirable that cognitive radios compete for the same channel while other channels lie idle. It is thus crucial that a cognitive radio recognize and avoid crowded channels. To address this, we discuss how any radio user l may use the CSMA feedback (γnl , τnl ) to estimate competition for channels, and calculate the expected throughput and number of collisions experienced by a radio, which will be used to adjust channel allocation decisions. 5.3.1 Channel Contention Estimate To estimate Nnl (i), the number of users competing with l for channel i during decision slot n, consider a single CSMA channel access attempt on Channel i. Suppose there are φ(i) other users, each choosing a random backoff time τ m (i) uniformly on (0, τmax ). If l chooses τ l (i) = t, it captures the channel if t < τ m (i) + δ for all m 6= l, where δ is the time required (1) to sense the channel clear and switch its receiver from receive to transmit mode. Let τφ(i) 154 Chapter 5. Game Theoretic Dynamic Spectrum Access Logical Channels Slow Primary Users CSMA Feedback Channel Allocation CSMA Feedback Channel Allocation CSMA Feedback Fastest Channel Contention MLE (Section III) Channel Contention MLE (Section III) Channel Contention MLE (Section III) Fast Adaptive Learning Channel Allocation (Section V) Adaptive Learning Channel Allocation (Section V) Adaptive Learning Channel Allocation (Section V) Cognitive Radio User Cognitive Radio User Cognitive Radio User Figure 5.1: Block diagram of decentralized learning system for cognitive radio opportunistic spectrum access. Cognitive radios select channels that are unoccupied by primary users so as to maximize their utility (formulated in Section 5.4). Feedback depends on the channel quality and the contention for resources due other users. 155 Chapter 5. Game Theoretic Dynamic Spectrum Access denote the smallest of the φ(i) backoff times (the first order statistic). It is well known that: ³ ´ ³ ´ 1 − t+δ φ(i) , t ≤ τ (1) max − δ. τmax P r(l captures channel) = P r τφ(i) > t + δ = 0, t > τmax − δ. (5.4) Since φ(i) is fixed, and CSMA attempts are independent between subslots, radio l can compute the likelihood of contention level φ(i) as: L(φ(i)) = ³ ´ (1) P r τφ(i) > τnl k (i) + δ · Y l (i)=1 k:γn k = l (i)=0 k:γn Ã Y ³ ´ (1) P r τφ(i) < τnl k (i) + δ , Y 1− τnl k (i) +δ Y · τmax l (i)=1 k:γn k !φ(i) Ã 1 − 1− τnl k (i) l (i)=0 k:γn k !φ(i) +δ . τmax (5.5) k bnl (i) of the number of competing users obtained by The maximum likelihood estimate N maximizing (5.5), i.e. solving: φ(i) X ak l (i)=0 k:γn k log(ak ) 1− = φ(i) ak X log(ak ), (5.6) l (i)=1 k:γn k where ak = 1 − (τkl (i) + δ)/τmax . Eq. (5.6) is difficult to solve, but we characterize the average behaviour numerically in b l (i) increases with the number of channel access failures, Figure 5.2, which shows that N n decreases when the maximum successful backoff time among the K subslots increases, and increases when the minimum unsuccessful backoff time among the K subslots decreases. By replacing the ak terms on the left-hand side of (5.6 by their average, it is possible to bnl (i) by: approximate N Ã b l (i) ≈ − log 1 + P |I0 | log(ā0 ) N n l (i)=1 log(ak ) k:γn ! / log(ā0 ), (5.7) k where |I0 | = PK l k=1 (1 − γnk (i)), P and ā0 = ( k:γnl k (i)=0 ak )/|I0 |. Numerical studies show that (5.7) is quite accurate on average, but may have large error when either |I0 | or the successful backoff times are large. In this case, we recommend using (5.7) to generate an initial guess, which may be refined by the Newton-Raphson method. 156 Chapter 5. Game Theoretic Dynamic Spectrum Access Behaviour of Contention MLE 0.8 l log10 of Average MLE Nn(i) 1 0.6 0.4 z=9 z=7 z=5 z=3 z=1 0.2 0 0 0.2 0.4 0.6 0.8 Maximum Successful Backoff Time 1 0 0.2 0.4 0.6 0.8 1 Minimum Failed Backoff Time bnl (i). z denotes the number of failed Figure 5.2: Average result of the contention estimate N CSMA attempts out of K = 10, and the maximum backoff time is τmax = 1. Each data point represents an average of 5000 randomly generated observations with the specified maximum successful backoff time and minimum failed backoff time. 157 Chapter 5. Game Theoretic Dynamic Spectrum Access 5.3.2 Other Channel Characteristics b l (i), we can estimate the throughput and number of collisions Using contention estimates N n on a channel (which are more relevant measures of a radio’s performance), as follows: Using (5.4) and the uniform distribution of backoff times, we may compute the unconditional (with respect to backoff time) probability of channel capture in a subslot as: µ ¶ bl t + δ Nn (i) 1 1− dt = x (i) τmax τmax 0 µ ¶1+Nbnl (i) xl (i) δ = 1− . b l (i) τmax 1+N n Z Rnl (i) l τmax −δ (5.8) bnl (i) > 0, otherwise Rnl (i) = 1.) (This formula holds for N Next, observe that l is involved in a channel collision in a subslot if either (a.) it has the lowest backoff time, but by a margin less than δ, or (b.) it does not have the lowest backoff bnl (i) > 0, the probability of l being time, but is within δ of the lowest. It follows that, for N involved in a collision on Channel i in subslot nk , using backoff time τnl k = t is given by: (1) (1) Qln (i, t) = P r(t − δ < τN < t or t < τN < t + δ) (1) (1) = P r(τN < t + δ) − P r(τN < t − δ)) "µ ¶Nbnl (i) µ ¶Nbnl (i) # min{t + δ, τ } max{t − δ, 0} max . − 1− = xl (i) 1 − τmax τmax (5.9) Again, integrating out the backoff time gives the unconditional probability of collision: " µ ¶1+Nbnl (i) µ ¶1+Nbnl (i) # l (i)δ l (i) x x δ δ Qln (i) = + 1− − 1− . bnl (i) τmax τmax τmax 1+N (5.10) Since channel access attempts are i.i.d., Rnl (i) is also the expected proportion of successful CSMA attempts in a given decision period n, and Qln (i) is the expected proportion of CSMA attempts that result in collisions during period n. These quantities will be very useful for measuring radio performance in the next section. 158 Chapter 5. Game Theoretic Dynamic Spectrum Access 5.4 System Performance and Radio Utility The goal of this chapter is to achieve a global objective (efficient allocation of radio resources) using a decentralized scheme (local adaptation by individual radios). Consequently, we must demonstrate a connection between a global utility and the local utility function that will guide the allocation decisions of each radio in Section 5.5. This connection is presented below through the derivation of global (Section 5.4.1) and local (Section 5.4.2) performance measures. 5.4.1 Global System Utility When each cognitive radio has roughly the same priority and bandwidth requirements, a reasonable global objective is for channel resources to be allocated fairly, such that the proportion of resources captured by the worst-off cognitive radio (relative to its demand) is maximized (max-min fairness). That is, the global system utility in decision period n is: µ µ T l ¶¶ Cn Rn U (x) = min min ,1 , l=1,...,L dl (5.11) where each term CnT Rnl /dl represents the “satisfaction level” of Radio m. This is the total amount of resource captured (CnT Rnl ) divided by the demand level dl . Substituting (5.8) into (5.11) yields an alternative formulation for the global utility: Ã U (x) = min l=1,...,L min Ã µ ¶1+Nnl (i) !! δ 1 X xl (i) 1− ,1 , Cn (i) l τmax dl Nn (i) + 1 (5.12) i for given CSMA parameters (δ, τmax ). The system objective is to maximize U (x) over all x ∈ S, thus maximizing the satisfaction level of the worst-off user. However, for a decentralized implementation, we cannot choose an action from S in a coordinated fashion, but each radio m must choose its own action xl ∈ Sl . Moreover, as we will see in the next section, the global utility cannot be easily evaluated by any one user, so an appropriate substitute must be found. 5.4.2 Local Radio Utility If each cognitive radio had a reliable estimate of U (x), then decentralized operation would be straightforward; each radio would simply act to maximize the global utility directly. Unfortunately, this would require l knowing dm and xm for all m 6= l, which is unrealistic. 159 Chapter 5. Game Theoretic Dynamic Spectrum Access We therefore construct an alternative utility function ul (xl ) which has the property that, if each radio l obtains a high value of ul (xl ), then the global utility U (x) will also be high. The first portion of l’s local utility reflects the self-interested component of (5.12): Ã ul [0](xl ) = min µ ¶xl (i)+Nbnl (i) ! 1 X xl (i) δ Cn (i) 1− ,1 . b l (i) dl τmax xl (i) + N n (5.13) i Maximizing (5.13) directly maximizes l’s part of the global utility U (x) (Note we have used N m (i) = N l (i) if xm (i) = xl (i) and N m (i) = N l (i) − 1 if xm (i) = 1, xl (i) = 0.) A game with (5.13) as the only component of the utility function would resemble a classic congestion game, which might be readily solved in closed form. However, (5.13) neglects a good portion of the global objective; l should maximize others’ satisfaction of demand as well as its own. Since we assume that dm and xm are unknown for m 6= l, we induce such cooperation through the following two principles: • Radio l’s realized rate should not exceeds its demand dl , as this leaves fewer resources for other users. • Radio l should minimize the number of CSMA collisions it causes, as this impacts the performance of other users. These principles can be justified by noting that the components of (5.12) belonging to m 6= l are decreasing in xl (i). To satisfy the first principle, we introduce a penalty for achieving excess rate: ul [1](xl ) = − ´+ 1 ³ T l l C R − (d + β) , n n dl (5.14) where (y)+ denotes the operation max{y, 0}, and parameter β represents the size of a “grace” region, where excess rate is not penalized. (Since l’s realized rate is observed in noise, it may “accidentally” satisfy more than its demand, so small excesses are not penalized.) The second principle, to minimize collisions, is satisfied by considering Qln (i) in (5.10). Neglecting collisions involving three or more users, Qln (i) can be interpreted as the degradation of Channel i (proportional to Cn (i)) caused by l. If this is spread evenly among all users, we have that the average performance degradation seen by radio m, caused by b l (i). Our penalty is a weighted sum of activity of Radio l on Channel i is Dnl (i) = Qln (i)/N 160 Chapter 5. Game Theoretic Dynamic Spectrum Access the channel degradation: 1 k Cn (k) X ul [2](xl ) = − P Cn (i)Dnl (i). (5.15) b l (i)>0 i:N The final utility function for Cognitive radio l is given by: n o ul (xl ) = max ul [0](xl ) + α1 ul [1](xl ) + α2 ul [2](xl ), 0 , (5.16) Eq. (5.16), with user-defined parameters (α1 , α2 ), is used to guide channel allocations by allowing each radio to take action Xnl , observe utility uln , and generate a new action that increases its expected utility. The method for choosing actions is the subject of Section 5.5. 5.5 Decentralized Adaptive Channel Allocation In this section we describe our decentralized learning approach to opportunistic spectrum access. Our approach is based on the regret matching procedure of [163], formulated as a distributed stochastic approximation algorithm. This allows us to specify an adaptive variant of the original procedure, called “Regret Tracking,” which uses a constant stepsize to dynamically adapt to time varying conditions. The procedure is game theoretic in nature; it converges even when multiple cognitive radio users are simultaneously adapting their behaviour. This is critical observation; since naive, single-agent procedures do not account for the presence of multiple users, and hence may not converge. In the regret tracking procedure, each radio takes a sequence of actions {Xln ∈ Sln : n = 0, 1, 2, . . .} and observes a sequence of rewards {uln ∈ R : n = 0, 1, 2, . . .}. The action at time n + 1 is a random function of this history of actions and rewards. Before proceeding, we clarify a point of notation. In (5.16), a radio’s utility ul is written only as a function of its own action. However, it is also implicitly a function of the channel b l (i), which depends on the actions of other players. In game theory, this is contention N made explicit by writing ul (Xln , X−l n ), and we follow this convention below. Recall the channel allocation or radio l at decision period n is denoted Xln ∈ Sl , with the joint allocation of all radios denoted Xn ∈ S. Based on the history of allocations and l : j, k ∈ Sl , according to: utilities, radio l compute average regret values θn,jk l θn,jk = X τ ≤n:Xlτ =j ετ −1 Ãn−1 Y ! (1 − εσ ) ³ ´ l −l ul (k, X−l ) − u (j, X ) . τ τ (5.17) σ=τ 161 Chapter 5. Game Theoretic Dynamic Spectrum Access Depending on whether ετ = 1/(τ + 1) or ετ = ε, this is a simple or moving average: l θn,jk = 1 n ³ ´ l −l ul (k, X−l ) − u (j, X ) , τ τ X τ ≤n:Xlτ =j X l = θn,jk ³ ´ l −l ε(1 − ε)n−τ ul (k, X−l ) − u (j, X ) . τ τ (5.18) τ ≤n:Xlτ =j The regret values measure the average gain that l would have received had he chosen allocation k in the past instead of j. If the gain is positive, then l is be more likely to switch to k in the future. Note, however, that (5.17) requires that l knows what utility he would have received for each action, even if that action was not taken. We address this difficulty either by assuming that extra resources are available to monitor unused channels to reconstruct potential utilities, or by using the modified regret tracking algorithm of Section 5.5.2. The exact channel allocation algorithm is summarized in Algorithm 5.5.1. This is a regret tracking procedure, executed independently by each radio. Algorithm 5.5.1 Adaptive Learning for Channel Allocation: Define parameters (ul , µ, {εn : n = 1, 2, . . .}, θ0l , Xl0 ), where ul are the radio utilities, µ satisfies µ > (S l − 1)(ulmax − ulmin ), l = 1, 2, . . . , L, (5.19) ((ulmax , ulmin ) are obtained from (5.16)), {εn } are small stepsizes, and θ0l , Xl0 are arbitrary initial regrets and actions. Also define the Sl × Sl instantaneous regret matrix with entries: ³ ´ l −l Hljk (Xn ) = I{Xln = j} ul (k, X−l n ) − u (j, Xn ) . (5.20) Each radio executes the following steps: 1. Initialization: Set n = 0, take action Xl0 , and initialize θ0l = Hl (X0 ). Set n = 1, l (X ), where j = Xl , and set θ l = Hl (X ). take action Xl1 = arg maxk Hjk 0 1 0 1 2. Repeat for n = 2, 3, . . .: Action Update: Choose Xln+1 = k with probability ( P r(Xln+1 = k|Xln = j, θnl l =θ)= l , 0}/µ, k 6= j, max{θjk P l , 0}/µ, k = j. 1 − i6=j max{θji (5.21) Average Regret Update: Given Hl (Xn+1 ), update θn+1 according to the stochastic 162 Chapter 5. Game Theoretic Dynamic Spectrum Access approximation (SA): l θn+1 = θnl + εn (Hl (Xn+1 ) − θnl ), 5.5.1 (5.22) Convergence of Adaptive Channel Allocation Because Algorithm 5.5.1 tends to switch to actions with higher rewards at each decision period, channel allocations converge such that no allocation with positive regret is played. Since the set of correlated equilibrium strategies [155] is, by definition, equivalent to the set of strategies with this no regret condition, it follows that allocation decisions converge to the set of correlated equilibrium under Algorithm 5.5.1. For a decreasing stepsize εn = 1/(n + 1), almost sure convergence of Algorithm 5.5.1 to the set of correlated equilibria is proven in [156, 163] as n → ∞. For a constant stepsize ε, this is replaced by weak convergence (i.e. convergence in distribution), as ε → 0. This weak convergence result is a general property for constant stepsize stochastic approximation (SA) algorithms when decreasing stepsize SA convergence holds [167]. Intuitively then, our adaptive version of Algorithm 5.5.1 should also track the set of correlated equilibria, with changes to the utility functions handled smoothly by the constant stepsize. To formally state the convergence result, we introduce the empirical distribution of play, which can be viewed as a diagnostic that monitors the performance of the entire cognitive radio network, as follows: Definition 5.5.1 Let ex = [0, 0, . . . , 1, 0, . . . , 0] with a one in the xth position. The empirical distribution of play to time n is: z̄n = X τ ≤n ετ −1 Ãn−1 Y ! (1 − εσ ) eXτ , (5.23) σ=τ Here εn is the weighting factor given in Algorithm 5.5.1. Similarly to (5.18), z̄n can be P viewed as an average or moving average frequency of play, with i z̄n (i) = 1. Moreover, as was the case for θn , z̄n may be represented by the recursion: z̄n+1 = z̄n + εn (eXn+1 − z̄n ), (5.24) where Xn+1 given by Step (2a) of Algorithm 5.5.1. We also need the notion of convergence to the set of correlated equilibria: Definition 5.5.2 z̄n converges to the set of correlated equilibria, Ce if, for any ε > 0, there 163 Chapter 5. Game Theoretic Dynamic Spectrum Access exists N0 (ε) such that for all n > N0 , we can find ψ ∈ Ce at a distance less than ε from z̄n . The convergence proof follows a differential inclusion approach, as in[156], suitably modified for a constant stepsize through the methods of [167]. The differential inclusion approach proceeds by introducing the following continuous time interpolated process for any ε > 0 : z̄ ε (t) = z̄n for t ∈ [εn, ε(n + 1)), (5.25) for n = 0, 1, 2, . . . . Next, define the following differential inclusion, as in [156]: dz ∈ ν(z) × ∆S −l − z, dt (5.26) where ν(z) is the set of probability distributions over Sl , such that: X l νk [θkj (z)]+ = νj k X l [θjk (z)]+ k for all j, k ∈ Sl . Here ∆S −l is the set of all probability distributions over joint actions of l (z) is the average regret corresponding to play history the L − 1 competing radios, and θjk z. The main result is stated as follows: Theorem 5.5.1 Consider the interpolated process (5.25) generated by Algorithm 5.5.1. If all radios operate according to this algorithm, the following results hold: 1. All solutions z(t) to (5.26) converge to the set of correlated equilibria as t → ∞. 2. For a decreasing stepsize εn = 1/(n+1), the trajectory of the interpolated process z ε (t) converges almost surely to a trajectory z(t) satisfying (5.26). 3. Under a constant stepsize εn = ε in Algorithm 5.5.1, the trajectory of the interpolated process z ε (t) converges weakly as ε → 0 to a trajectory z(t) satisfying (5.26). 4. As t → ∞, since z(t) converges to the set of correlated equilibria, the trajectory z ε (t) also converges to this set. The proof of Theorem 5.5.1, parts (1), (2) and (4) is given in [156]. Part (3) follows from standard arguments as in [167]. 164 Chapter 5. Game Theoretic Dynamic Spectrum Access 5.5.2 Other Adaptive Strategies For reference, we briefly review three alternative methods for decentralized decision making that can be applied to the opportunistic spectrum access problem; best response, fictitious play and modified regret tracking. We relate these to our regret tracking approach, and in future sections compare the three approaches through simulation. Best Response In the simplest approach, each cognitive radio chooses the channel allocation that maximizes its utility, assuming that the actions of other cognitive radios will not change. In Algorithm 5.5.1, this corresponds to setting εn = 1 and replacing Step (2a) with l (X )) where j = Xl . In best response, the average history θ l does Xln+1 = arg maxk (Hjk n n n not need to be tracked at all, since actions are based solely on feedback from the previous step. Nevertheless, it still provides a useful performance measure for the system since it indicates how close the system is to equilibrium. Fictitious Play Best response suffers from the defect that actions of other users are assumed to be constant between iterations. This was realized very early on, and [170] proposed that users should respond not just to the last observation, but to the average history. Fictitious play has since been widely studied and has been shown to converge in many, but not all, games. In [165] it was shown that fictitious play is a special case of regret-based algorithms. Specifically, (assuming a decreasing stepsize εn ), it corresponds to l (Xn )) where j = Xln . replacing Step (2a) in Algorithm 5.5.1 with Xln+1 = arg maxk (θn,jk An adaptive version of fictitious play, with constant stepsize ε, can also be generated in this manner. Modified Regret Tracking The preceding strategies rely on each cognitive radio knowing the value of all actions, not just those actually taken. This means radios must monitor bnl (i) for all i and hence ul (k, ·) for all possible all possible channels in order to determine N k. This can be achieved straightforwardly if extra receivers are used to scan channels not currently in use by the cognitive radio. An alternative procedure is proposed in [164], which replaces the explicit utility of actions not taken with an estimate, proceeds as follows. Label the probability distribution used to choose action Xln in Step (2) of Algorithm 5.5.1 by pln . We first replace (5.20) by: Hljk (Xn ) = I{Xln = k} pln (j) l l l −l u (k, X−l n ) − I{Xn = j}u (j, Xn ). pln (k) (5.27) 165 Chapter 5. Game Theoretic Dynamic Spectrum Access This avoids the need to evaluate ul for actions not taken. However, we must compensate for this by allowing radios to explore alternative actions. We therefore replace Step (2) of Algorithm 5.5.1 by: 20 ) Generate a uniform random variable U. If U < δ (small), choose Xln+1 from a uniform distribution over Sl . Otherwise, choose Xln+1 = k with probability as in (5.21). It is proven in [164] that the non-tracking version of this algorithm converges almost surely to the set of correlated equilibria. However, in our tracking simulations, convergence is much slower, due to the required periodic random exploration. 5.6 Numerical and Simulation Results We now provide a numerical (Matlab) comparison of the methods of Section 5.5. For demonstration purposes, we specify a relatively small number of users (six) and channels (ten), and focus on moderately congested systems, with total capacity of channels roughly equal to total user demand. Similar results hold for other cases. We take K = 20 CSMA attempts per decision period on selected channels, and assume unused channels are scanned randomly such that information is obtained as if K ≈ 10 CSMA attempts were made. 5.6.1 Effect of Utility Pricing on Performance of Regret Tracking In this section we investigate the impact of (α1 , α2 ) on system performance. (Recall (α1 , α2 ), parameterize the radio utility ul (xl ), and affect l’s level of cooperation with system objectives.) The problem of selecting optimal parameters can be viewed as a pricing problem (in this case carried out offline). For each parameter choice (α1 , α2 ), 20 scenarios were generated with channel quality Cn and user demands dln , selected from uniform distributions on {1, 2, 3} and {1, 2, 3, 4}, respectively. Hence, on average there was 16 units of channel capacity and 18 units of demand, indicating mild system congestion. We held ml = 2 for each radio, to limit the size of the decision space. For ideal operation, we initially assume that primary user activity is fixed to two specific channels. We will address varying channel occupancy in Section 5.6.3. For each choice (α1 , α2 ),, system performance (5.11) was averaged over 1000 iterations of Algorithm 5.5.1 and the 20 randomly selected scenarios. The result is shown in Figure 5.3. System performance was highest for α1 ≈ 0.3 and α2 > 1. Hence completely selfish behaviour (α1 = α2 = 0) is not optimal. To simulate good cooperation, we chose (α1 , α2 ) = (0.2, 1.8) for subsequent analysis. 166 Chapter 5. Game Theoretic Dynamic Spectrum Access Proportion of Demand Satisfied (Worst Player) Effect of Utility Parameters on Spectrum Utilization for Regret Tracking 0.9 0.8 0.7 0.6 0.5 2 6 1.5 5 1 4 3 0.5 Parameter α 1 2 0 1 0 Parameter α2 Figure 5.3: The average system performance for a fixed scenario type (mild congestion) depends on the parameter choices (α1 , α2 ) of the radio utility (5.16). The parameters can be thought of as unit prices, imposed by a system-wide authority to discourage different types of interference. The selfish (anarchy) case, α1 = α2 = 0, does not yield the best system performance, but left to their own devices radios may gravitate towards this case. 167 Chapter 5. Game Theoretic Dynamic Spectrum Access It is interesting to ask whether a player can benefit by deviating from these cooperative parameters (set by design or by a system authority). The answer appears to be “yes”; if one player takes (α1 , α2 ) = (0, 0), while the others use (α1 , α2 ) = (0.2, 1.8), the selfish player achieves much higher satisfaction of demand (96.5% after 3000 iterations), at the expense of the system (the worst off player satisfies 76% of his demand). This points to the ultimate benefit of cooperative design; even as radios seek to maximize their own utility, this should reflect system, not individual, performance. It also suggests that the parameter-tuning “meta-game” is of the prisoner’s dilemma type, with Nash equilibrium (α1 , α2 ) = (0, 0). 5.6.2 Comparison of Algorithms in a Static Environment We now compare the long-run performance of Algorithm 5.5.1 to other possible approaches assuming fixed cognitive radio demand dln primary user activity Yn . Simulation of Algorithm 5.5.1 and the three alternatives in Section 5.5.2 were performed as follows. 100 scenarios were randomly generated as in Section 5.6.1. Each cognitive radio ran the same algorithm over 3000 iterations, (CSMA access, channel contention estimation, and adaptive channel allocation). The results are shown in Figures 5.4 and 5.5. As shown, the regret tracking algorithm achieves the best performance, and most closely approaches the set of correlated equilibria of the spectrum access game (has the lowest regret). High utility was designed to correspond with good system performance, so it is not surprising that locally optimal (i.e. equilibrium) performs well. Modified regret tracking performed significantly poorer, since radio awareness is severely impaired in this procedure (radios do not know the contention on a channel until they use it). Analysis reveals that the random exploration required by the modified procedure is chiefly to blame for its poor performance. (Exploration is done indifferently between good and bad actions, and may drastically impact the performance of other radios.) The two remaining procedures, fictitious play and best response, performed surprisingly poorly, not even approaching 50% spectrum utilization. The problem here can be traced to the fact that fictitious play and regret tracking choose the best action, whereas regret tracking chooses randomly from among all better actions. This greedy behaviour performs poorly under noisy utility measurements, as is the case here, since users tend to overreact to bad information. This hypothesis was validated by running simulations with perfect utility information. In this case, all algorithms performed equally well and reached equilibrium almost immediately. 168 Chapter 5. Game Theoretic Dynamic Spectrum Access Performance of Channel Allocation Techniques (Static Environment) Proportion of Demand Satisfied (Worst Player) 0.9 0.8 Regret Tracking 0.7 0.6 Modified Regret Tracking 0.5 0.4 0.3 Best Response 0.2 Fictitious Play 0.1 0 500 1000 1500 2000 2500 3000 Decision Period Index Figure 5.4: Performance comparison of the channel allocation techniques of Section 5.5, according to global design objective (5.11). The regret-based algorithms outperform the classical “greedy” algorithms due to their tolerance to noisy feedback. The optimal performance based on the selected scenarios is approximately 92.5%. 169 Chapter 5. Game Theoretic Dynamic Spectrum Access Equilibrium Comparison of Channel Allocation Techniques (Static Environment) 1 Maximum Regret Value (Worst Player) Best Response (º 1) 0.9 0.8 0.7 0.6 Fictitious Play 0.5 0.4 0.3 0.2 Regret Tracking 0.1 0 0 500 1000 1500 2000 2500 3000 Decision Period Index Figure 5.5: Evolution of regret for the channel allocation techniques described in Section 5.5. The regret indicates how close the system is to correlated equilibrium “competitively optimal” behaviour. The modified regret tracking algorithm has large, un-normalized regrets and is not shown. 170 Chapter 5. Game Theoretic Dynamic Spectrum Access 5.6.3 Performance of Regret Tracking in a Dynamic Environment Finally, we analyze the performance of Algorithm 5.5.1 when the primary user activity and radio demands vary in time. We first considered C = 10 channels and L = 6 users, with uniformly distributed channel qualities and demand levels, and two primary users. At each decision period, each of the demand levels dl and primary user channel occupation were allowed to change with probability ρ. If a demand level changed, it was recomputed from the uniform distribution; if a primary user changed channels, the new channel was chosen uniformly from among the eight currently unoccupied channels. Next, we analyzed regret tracking in the presence of fast primary users, by allowing each channel quality Cn to be recomputed with probability ρ (slowly varying statistics), with Cn (i) allowed to fluctuate randomly up to ±10% in each decision period (fast varying activity). Performance of Algorithm 5.5.1 for various values of ρ is shown in Figure 5.6. For w slowly time-varying attributes, the expected time the system remains constant is: T (ρ) = (1 − (1 − ρ)w ))−1 . (5.28) We have w = 16 with fast primary users, and w = 8 without. As expected, average utilization is higher for slower changes, since radios have more time to adapt to their environment. The regret tracking algorithm also outperforms the other algorithms in this case, even when the rate of innovation in the environment is high. 5.7 Conclusions We have presented a game-theoretic approach for cognitive radio dynamic spectrum access which seeks out and shares temporarily vacant radio spectrum between multiple users. Our approach is completely decentralized in terms of both radio awareness and activity; radios estimate spectral conditions based on their own experience, and adapt by choosing spectral allocations which yield them the greatest utility. Iterated over time, this process converges so that each radio’s performance is an optimal response to others’ activity. Moreover, we are able to use this apparently selfish scheme to deliver system-wide performance by a judicious choice of utility function. 171 Chapter 5. Game Theoretic Dynamic Spectrum Access Performance of Channel Allocation Techniques (Dynamic Environment) Proportion of Demand Satisfied (Worst Player) 0.8 Regret Tracking (fast primary users) 0.7 Regret Tracking 0.6 0.5 Modified Regret Tracking 0.4 Fictitious Play 0.3 0.2 Best Response 0.1 0 0 200 400 600 800 1000 Mean Innovation Time T(ρ) Figure 5.6: Long-run average spectrum utilization in a dynamic environment for the channel allocation techniques of Section 5.5. T (ρ) as in (5.28) is the mean time between innovations in the system (changes in primary user activity, fast primary user statistics or cognitive radio demands). 172 Bibliography [155] R. Aumann. Correlated equilibrium as an expression of bayesian rationality. Econometrica, 55(1):1–18, 1987. [156] M. Benaim, J. Hofbauer, and S. Sorin. Stochastic approximations and differential inclusions ii: Applications. Levine’s bibliography, UCLA Department of Economics, May 2005. [157] U. Berthold and F. Jondral. Guidelines for designing ofdm overlay systems. in Proc. of the first IEEE Symposium on New Frontiers in Dynamic Spectrum Access Networks, November 2005. [158] D. Cabric, S. Mishra, and R. Brodersen. Implementation issues in spectrum sensing for cognitive radios. in Proc. the 38th. Asilomar Conference on Signals, Systems, and Computers, pages 772 – 776, 2004. [159] K. Challapali, S. Mangold, and Z. Zhong. Spectrum agile radio: detecting spectrum opportunities. in International Symposium on Advanced Radio Technologies, 2004. [160] Y. Chen, Q. Zhao, and A. Swami. Joint design and separation principle for opportunistic spectrum access. in IEEE Asilomar Conference on Signals, Systems, and Computers, 2006. [161] Y. Chen, Q. Zhao, and A. Swami. Joint PHY/MAC design of opportunistic spectrum access in the presence of sensing errors. submitted to IEEE Transactions on Information Theory, February 2007. [162] A. Ghasemi and E. Sousa. Collaborative spectrum sensing for opportunistic access in fading environments. in Proc. of IEEE Symposium on New Frontiers in Dynamic Spectrum Access Networks, November 2005. [163] S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000. 173 Bibliography [164] S. Hart and A. Mas-Colell. A reinforcement procedure leading to correlated equilibrium, in Economic Essays, pages 181–200. Springer, 2001. [165] S. Hart and A. Mas-Colell. Uncoupled dynamics do not lead to nash equilibrium. American Economic Review, 93(5):1830–1836, December 2003. [166] S. Haykin. Cognitive radio: Brain-empowered wireless communications. IEEE Journal on Selected Areas in Communications, 23(2):201–220, 2005. [167] H.J. Kushner and G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer-Verlag, New York, NY, second edition, 2003. [168] J. Mitola. Cognitive radio for flexible mobile multimedia communications. Mob. Netw. Appl., 6(5):435–441, 2001. [169] N. Nie and C. Comaniciu. Adaptive channel allocation spectrum etiquette for cognitive radio networks. in: Proc. IEEE DySPAN, pages 269–278, 2005. [170] J. Robinson. An iterative method of solving a game. Annals of Mathematics, 54:298– 301, 1951. [171] S. Sankaranarayanan, P. Papadimitratos, A. Mishra, and S. Hershey. A bandwidth sharing approach to improve licensed spectrum utilization. in: Proc. IEEE DySPAN, pages 279–288, 2005. [172] W. Wang and X. Liu. List-coloring based channel allocation for open-spectrum wireless networks. in Proc. of IEEE VTC, 2005. [173] T. Weiss and F. Jondral. Spectrum pooling: an innovative strategy for enhancement of spectrum efficiency. IEEE Communications Magazine, pages 8–14, March 2004. [174] B. Wild and K. Ramchandran. Detecting primary receivers for cognitive radio applications. November 2005. [175] Q. Zhao. Spectrum opportunity and interference constraint in opportunistic spectrum access. in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, April 2007. [176] Q. Zhao and B. Sadler. Dynamic spectrum access: signal processing, networking, and regulatory policy. IEEE Signal Processing Magazine, May 2007. 174 Bibliography [177] Q. Zhao, L. Tong, A. Swami, and Y. Chen. Decentralized cognitive MAC for opportunistic spectrum access in ad hoc networks: a POMDP framework. IEEE Journal on Selected Areas in Communications: Special Issue on Adaptive, Spectrum Agile and Cognitive Wireles Networks, April 2007. 175 Chapter 6 Self-Configuration in Dense Sensor Networks using a Global Game Approach 8 6.1 Introduction Wireless sensor networks, as an emerging technology, promise to achieve high-resolution monitoring in difficult environments by placing sensors up close to the phenomena of interest. To be practical, such networks must be flexible, convenient, energy efficient and as self-regulating as possible. To achieve these goals, it is natural to respond to the decentralized awareness of a sensor network with decentralized information procrocessing for proper operation. The idea is that if each sensor or small group of sensors can appropriately adapt their behaviour to locally observed conditions, they can quickly self-organize into a functioning network, eliminating the need for difficult and costly central control. This chapter explores the possibility of using the theory of global games to achieve effective decentralized sensor network operation with minimal communication. Specifically, we devise a global game framework in which each sensor acts as an independent entity to optimize its operation based on environmental signals and predictions about other sensors’ behaviour. We consider sensors which can operate in either an energy-efficient “low-resolution” mode, or a more expensive “high-resolution” mode, and seek to configure sensors’ strategies for mode selection. This allows us to specify an event-driven framework, adapted to the particular sensor network environment, in which sensors operate at low resolution until a target signal of sufficient intensity is received, at which point high-resolution monitoring and data transmission are initiated. Our main results are: • Using the theory of global games, we motivate and establish a scenario and conditions 8 A version of this chapter has been submitted for publication. Maskery, M. and Krishnamurthy, V. Self-Configuration in Dense Sensor Networks using a Global Game Approach, IEEE Transactions on Signal Processing. 176 Chapter 6. Self-Configuration of Sensors: Global Games under which a simple threshold mode selection strategy provides competitively optimal (Nash equilibrium) behaviour for each sensor. This result holds even when each sensor receives its information in noise. Unlike previous work on global games, we consider the case where both observation noise and utility functions may vary between sensors. • Given a prior probability distribution for sensor information, which is observed in either Uniform or Gaussian noise, we formulate the Nash equilibrium mode selection thresholds as the solution to set of integral equations. The structure of the solution is examined for limiting low noise conditions, which are particularly important when sensors accumulate and average information over time. • Equilibrium tests for the threshold strategy are given for both the Uniform and Gaussian noise cases. We also formulate a novel “near equilibrium” condition which provides a simple test for the threshold strategy under low noise conditions. • A procedure for decentralized computation of the Nash equilibrium threshold mode selection strategy is also given, based on an iterative best strategy response between sensor classes. The analysis presented is for a one-shot game, that is, the evolution of the sensor environment is not accounted for explicitly, but each sensor faces a static decision problem. These decision problems can, of course, be repeated in time, and we account for this repetition by considering the behaviour of the decision problem as measurements are accumulated, that is as noise is reduced. The theory of global games was first introduced in [179] as a tool for refining equilibria in economic game theory, see [194] for an excellent exposition. The global term refers to the fact that players may actually at any time be playing any game selected from a subclass of all games, which adds an extra dimension to standard gameplay (wherein players act to maximize their own utility in a fixed interactive environment). In a global game, each player observes a noisy signal indicating which of several possible game is being played, given that other players receive similar noisy signals. Since a player’s utility depends on others’ actions, the first step in evaluating alternatives is to predict (conditional on their signal) the actions of others. Global games are attractive, not only because their explicit uncertainty accurately models many real-life situations, but also because their (Nash equilibrium) solution is often easy to characterize and implement. From a game theoretic perspective, our contribution is to extend the work of [187, 194] to the case where players observe their signals under differing, as opposed to identical, noise conditions. While this complicates the analysis, 177 Chapter 6. Self-Configuration of Sensors: Global Games we are still able to characterize threshold equilibrium strategies where they exist, although players of different classes now have different thresholds. We also find that the equilibrium conditions of [187] can be relaxed without much loss, as long as the amount of noise is small. Another contribution is the use of the well-known Banach fixed point theorem to establish conditions for convergence of the best response dynamic for decentralized computation of the (unique) equilibrium threshold strategy. Given the tight energy budget of current sensor networks, self-organization and selfconfiguration are particularly important for efficient operation [178, 181, 193], and have been applied to such diverse problems as routing, topology control, power control, and sensor scheduling. For mode selection, self-configuration allows sensor networks to efficiently extract information, by adjusting individual sensor behaviour to “form to” their environment according to local conditions. Since environments evolve over time, and since centralized organization is costly in terms of communication and energy, self-configuration of sensors is the most feasible method for adapting sensors for this purpose. This leads to robust, scalable and efficient operation since a central authority is not required. Game theory is a natural tool for describing self-configuration of sensors, since it models each sensor as a self-driving decision maker. A good overview of methods and challenges in this area are given in [183, 190] and the references therein. Despite the practicality and insight provided by global game models, and despite the popularity of game theoretic approaches in the field of sensor networks, we are not aware of any other work applying global games in the realm of sensor networks. Another feature of our approach is that it is event-driven, since it relies on environmental signals for mode selection. This is common for unattended sensor networks for target detection and tracking [182, 186, 189, 191, 192], allowing sensors to act in a decentralized fashion, based on the occurrence of a target event. Moreover, this approach allows us to account for spatial correlation of events and possible contention in the MAC layer to efficiently and reliably deliver information to an end user. The event-driven approach here is closely related to that of [186, 191], in that energy is conserved by realizing that not all sensors need to transmit data to obtain an adequate picture of events. The rest of this chapter deals with self-configuration and equilibrium operation of sensors through a global game theoretic model for mode selection. Section 6.2 presents the mode selection decision problem as a game. Section 6.3 then motivates and defines a case in which threshold decision rules are in Nash equilibrium (i.e. competitively optimal for each sensor), and gives precise formulations and conditions for threshold equilibrium when noise in the global game is either uniform or Gaussian. Section 6.4 details a decentralized iterative 178 Chapter 6. Self-Configuration of Sensors: Global Games procedure for self-configuration of thresholds, which converges and considers the limiting case where sensors reduce noise through cumulative averaging of information. Finally, numerical examples and conclusions follow in Sections 6.5 and 6.6. 6.2 A Global Game for Sensor Mode Selection As mentioned above, global games study the interaction of players who independently choose their actions based on noisy observations of a common signal. This is appropriate for sensor networks, which rely on imperfect information, and do not have unlimited coordination capabilities. In this section we formulate a decision model for sensor mode selection as a global game, with multiple sensor classes. The goal of each sensor is to determine a decision rule for mode selection based on a noisy environmental signal. Since sensor observations are correlated, this signal is also used to predict the actions of other sensors, which is relevant to the decision problem since utilities are interdependent. 6.2.1 Characteristics of Global Games We begin with a brief discussion of global games. A global games analysis allows us to account for complex sensor behaviour, based on translating noisy environmental observations into predictions of other sensors’ behaviour and then into actions. Sensors, imperfectly aware of their environment, choose their best action, assuming that other sensors have information similar to their own. Moreover, each sensor is aware that others are predicting its own behaviour, that they are aware that it is aware, and so on. The eductive reasoning process [180, 184] by which sensors iteratively (and without interaction) hypothesize policies and predict reactions to arrive at an optimum, is the justification for the Nash equilibrium conditions considered in this chapter. (It is theoretical only; we favor shortcuts to this iterative process.) The resulting equilibrium has interesting artifacts, such as a bias or clustering effect, in which each sensor (rationally) assumes that its observations are close to the true value, and hence predicts that other sensors will act similarly to itself, even though this may not be the case. Another commonly studied aspect of global games is strategic supermodularity (complementarity) or submodularity (substitutability) of actions. This is the property that each sensor’s incentive to switch to a higher action is increasing (resp. decreasing) in the average action of other sensors. Either property leads to simple, threshold forms for the Nash equilibrium strategy, and we follow [187] in showing how these conditions can be somewhat 179 Chapter 6. Self-Configuration of Sensors: Global Games relaxed while still retaining the threshold property. This is critical to practical implementation, since simple sensors cannot be expected to execute highly complex policies in reality. 6.2.2 Model Description In this section we describe the sensor mode selection model in detail, as a global game problem, and characterize the optimal strategy to be implemented by each sensor. We consider a large number N of sensors deployed to watch for events of interest in the field. Our subsequent analysis of the Nash equilibrium is asymptotic as the number of sensors N → ∞. Sensor Measurement Model Each sensor periodically gathers information through either low-resolution or high-resolution measurements, which are periodically transmitted to a base station at a low or high rate, depending on the sensor mode, which is one of two states as chosen by each sensor. The measurements in each mode include determination of an “environmental energy” level X, which indicates the activity level in the local environment, which is measured in noise. The value of measuring in low- or high-resolution mode is dependent on the proportion of other sensors in either mode. Sensor Types To allow for sensor diversity, we consider multiple classes of sensors. We assume that each of the N sensors can be classified into one of I possible classes, where I denotes a positive integer. Let I = {1, 2, . . . , I} denote the set of possible sensor classes, P and let NJ be the number of sensors of class J ∈ I, such that IJ=1 NJ = N. Furthermore, P let rJ = NJ /N denote the proportion of sensors of class J ∈ I, so J∈I rJ = 1. All sensors of a given class J ∈ I will be functionally identical, having the same measurement noise distribution and utility function. Since the noise distribution and reward parameters will never be known precisely, it is reasonable to perform this type of classification; it is simply a rough division of sensors based on how well they are positioned. Physically, a sensor’s class depends on factors such as the variance of its (low-resolution) measurements, the quality of its high-resolution measurements, and the energy required for it to transmit data (a function of its location in the network). To fully define the model, we first introduce the following random variables: • X, the current (true) environmental energy. 180 Chapter 6. Self-Configuration of Sensors: Global Games • Y (i) , Sensor i’s noisy measurement of X. • α ∈ (0, 1), the proportion of sensors who respond to their measurement of X by switching to high-resolution mode. Details of these quantities are given below. Environmental Energy X: Suppose each sensor continuously monitors for events of interest (e.g. magnetic or mechanical signals with a given signature). For a given space/time window, the actual arrival rate of events is given by a random variable D, with mean m and variance v 2 . If events occur uniformly in space and time, the average arrival rate over M windows is (by the central limit theorem) given by a Gaussian random variable with mean m and variance s2 = v 2 /M. It is this value that we denote by the environmental energy X. Note that X is a nonlocal variable, representing an average rate of actual events. We assume that each sensor can obtain an unbiased measurement of X, although the quality of these measurements may vary between sensors due to variable noise conditions. Noisy Measurement Y (i) : Sensor i’s estimate of X can be written as: Y (i) = X + W (i) , (6.1) where W (i) is a random variable representing noise (independent between players). We assume each sensor has a common prior probability distribution for X, and a distribution for W (i) , which may vary between sensor classes but not among sensors of a given class. That is, for any class J ∈ I, for any sensor i of class J, W (i) will have some known distribution νJ . Details for specific distributions for W (i) , (uniform or Gaussian), are given below. Proportion of “Active” Sensors α: Let αJ represent the proportion of sensors of class J ∈ I that are active, that is in high-resolution mode, at a given time. That is αJ = We then have α = PI J=1 rJ αJ . # of active sensors of class J . NJ (6.2) Letting NJ → ∞ for all J ∈ I yields a real value for αJ and α. (This is equivalent to assuming a continuum of sensors.) 181 Chapter 6. Self-Configuration of Sensors: Global Games Reward Function and Policy of each Sensor The task of the sensor network is ensure that its end users are sufficiently informed of the environment, without expending more effort than required. By considering an autonomous decision model, we are requiring that each sensor supports this goal by independently deciding whether to report its observations or not, based only on Y (i) . This reduces costly and complicated active message passing for coordination between sensors. We emphasize that sensors are never required to estimate the number of active sensors α; instead they predict this value purely from their observations Y (i) . This is a realistic constraint, since estimating α would require multiple rounds of gameplay to enable each sensor to estimate the actions of its counterparts. However, the autonomous model does not simply imply that a sensor should transmit information whenever its measurement Y (i) is sufficiently large, since it may be better to exploit signal correlation by remaining idle and relying on others to act instead. Indeed, too many active sensors may cause inefficiency due to increased packet collisions (since high-resolution sensors transmit more often), whereas too few may result in insufficient information delivered to the base station. In contrast to the supermodular cases of [180, 194], this gives rise to a game in which the utility is not monotone in α. If α is large, there is a congestion effect which causes the utility to decrease as more sensors switch to high-resolution mode. Fortunately, if the amount of congestion is small, a threshold Nash equilibrium still exists. To address this, we specify the task of each sensor i is to choose action u ∈ S = {“High Res”, “Low Res”} (i.e. operate in high resolution or low resolution mode; only transmit data when in high-resolution mode) so as to maximize the expected value of a local reward: ( (i) C (X, α, u) = h(i) (X, α), if u = “High Res” 0, if u = “Low Res” . (6.3) The reward function h(i) (·) is class dependent; for any class J ∈ I, for any sensor i of class J, h(i) (·) will equal some known function hJ (·) almost everywhere. The reward h(i) (·) is chosen to extract desirable system behaviour from optimal individual behaviour. In the next section we will specify an exact form, but the essential intuition is that h(i) = h(i) (“High Res”) − h(i) (“Low Res”), where h(i) (·) represents the marginal contribution to some system-wide objective function, minus an energy cost in the appropriate mode. The goal of sensor i then is to execute a (possibly randomized) strategy µ(i) : Y (i) → S which maximizes E[C (i) (X, α, µ)], where µ = {µ(i) : i = 1, 2, . . .} is a collection of strategies 182 Chapter 6. Self-Configuration of Sensors: Global Games of all sensors. Such a strategy is characterized as follows: Lemma 6.2.1 µ(i) is an optimal strategy for Sensor i if ( (i) (i) µ (y ) = “High Res” if E[h(i) (X, α)|Y (i) = y (i) ] > 0 “Low Res” if E[h(i) (X, α)|Y (i) = y (i) ] < 0. (6.4) Moreover, µ(i) is an optimal threshold mode selection strategy if there exists y ∗(i) such that: E[h(i) (X, α)|Y (i) = y (i) ] > 0 ⇔ y (i) > y ∗(i) . (6.5) (Proofs are given in Appendix B.) To clarify what is meant by a threshold mode selection strategy (optimal or otherwise), we supply the following definition: Definition 6.2.1 A threshold strategy µ(i) is a strategy for sensor i characterized by: µ(i) (y (i) ) = “High Res” ⇔ y (i) > y ∗(i) , (6.6) where y (i) is a realization of the random observation Y (i) , and y ∗(i) is a specified value characterizing the threshold strategy, referred to as the threshold or switching point. Furthermore, a symmetric threshold strategy is a threshold strategy with the property that for any class J ∈ I, for any sensor i of class J y ∗(i) = yJ∗ . A block diagram depicting sensor operation is given in Figure 6.1. In (6.4), a sensor’s optimal strategy depends on the conditional expectation of α, also written α(µ) to emphasize that the proportion of sensors in “High Res” mode depends on the sensors’ strategies. This interdependence leads us naturally to seek Nash equilibrium solutions, which is the main goal of the chapter. 6.2.3 Predicting the Proportion of Active Sensors Implicit in the above formulation, is that each sensor needs to predict the proportion of other sensors selecting each mode, based on its own noisy observations Y (i) of the environment. Assuming that sensors of the same class use the same threshold strategy for mode selection (Symmetric strategies, see Definition 6.2.1), it is possible to write α as a function of X, thus simplifying the threshold computation considerably. As above, let αJ be the proportion of sensors of class J ∈ I selecting action “High Res” (relative to the total number of sensors of that class). 183 Chapter 6. Self-Configuration of Sensors: Global Games α =Sensor Activity Sensor i Y (i) (X,α ) Estimate u =Mode Selection argmax C(X, α , u) u Data Environment Base Station X Sensor j Y Sensor k (j) Y (k) Figure 6.1: Block diagram of sensor mode selection problem. Each sensor observes its environment in noise, then selects its mode based on estimates of the environment and the predicted actions of other sensors. Since there are NJ total sensors of class J, for symmetric threshold strategy µJ (yJ∗ ), the proportion of sensors that choose “High Res” is the proportion who receive a signal larger than yJ∗ . That is, αJ = NJ 1 X I{y (i) > yJ∗ }. NJ (6.7) i=1 Taking limits as NJ → ∞, one obtains αJ = E[I{y (i) > yJ∗ }] = P r(y (i) > yJ∗ |X). That is: Z αJ = αJ (X, µJ (yJ∗ )) ≈ ∞ pY (i) |X,J (y|X)dy, yJ∗ (6.8) and hence, ∗ α = α(X, µ(y )) ≈ I X J=1 Z rJ ∞ yJ∗ pY (i) |X,J (y|X)dy. Given X = x, the probability of y (i) > yJ∗ for any sensor i of class J is (6.9) R∞ yJ∗ pY (i) |X,J (y|x)dy. By the strong law of large numbers this is also (almost surely) the proportion who receive signal y (i) > yJ∗ . 184 Chapter 6. Self-Configuration of Sensors: Global Games In the limit, α is thus a deterministic function of X, and for finite numbers of sensors, (6.9) represents an expected proportion of active sensors. In this case, our analysis will be correct under the common notion of “certainty equivalence,” that is, treating the (random) value α as if it were equal to its expected value. This ability to predict the proportion of active sensors, based only on a noisy signal and some assumptions as to the behaviour of other sensors, is central to the global game approach. It implies that sensors need only measure Y (i) , not α, in order to make a mode selection decision. This in turn allows sensors to act simultaneously, without a costly coordination and consultation phase. The conditional prediction of α above will become critical to the threshold levels and Nash equilibrium results in the next section. 6.3 Nash Equilibrium Threshold Strategies In (6.4), a sensor’s optimal strategy depends on the conditional expectation of α, also written α(µ) to emphasize that the proportion of sensors in “High Res” mode depends on the sensors’ strategies. This interdependence leads us naturally to seek Nash equilibrium solutions, defined as follows: Definition 6.3.1 A collection of strategies µ is a Nash equilibrium if each µ(i) is optimal for Sensor i given α(µ). The rest of this chapter will explore Nash equilibrium strategies in detail, focusing on criteria for the existence of threshold Nash equilibria. Such strategies are straightforward to characterize and simple to implement, since sensors need only follow a simple threshold rule for mode selection. Optimal strategy µ depends on the expected behaviour of the other sensors’ behaviour through α, the total proportion of sensors choosing action “High Res.” This in turn depends on the strategy of the other sensors. We are therefore interested in determining a collection of strategies for each sensor that are simultaneously optimal, i.e. in Nash equilibrium. For simplicity, we concentrate on collections of strategies µ comprised of symmetric threshold strategies, as in Definition 6.2.1. Such strategies are attractive because they are practical for implementation in simple, inexpensive sensors, since only a single parameter yJ∗ need be computed or specified for each sensor class J ∈ I. We are therefore interested in conditions under which satisfactory performance (i.e. Nash equilibrium) can be obtained by such collections of symmetric threshold strategies. In this section we formulate potential Nash equilibrium threshold strategies for 185 Chapter 6. Self-Configuration of Sensors: Global Games sensors under differing noise conditions (Uniform or Gaussian), and give conditions under which the proposed threshold strategies are in Nash equilibrium. It is straightforward to show that the conditional expectation in Lemma 6.2.1 is continuous in the signal y (i) . Hence, it is reasonable to propose as a set of mode selection thresholds the solution to: o n yJ∗ : E[hJ (X, α)|Y (i) = yJ∗ ] = 0 : J ∈ I . (6.10) The corresponding threshold strategy is then optimal for any sensor i of class J if: E[hJ (X, α)|Y (i) = y (i) ] > 0 ⇔ y (i) > yJ∗ . (6.11) If the collection (6.10) is optimal for every sensor, then a symmetric threshold Nash equilibrium has obtained, that is, the collection of symmetric threshold strategies as in Definition 6.2.1 is in Nash equilibrium. We will explore specific conditions under which this holds in Section 6.3. To motivate the existence of Nash equilibrium threshold strategies for the mode selection problem, suppose the system (global) utility of information gathered when proportion α of sensors are in mode “High Res”, given X is: H(X, α) = Xα + Gs (α). (6.12) The implication here is that higher environmental energy should elicit more high-resolution monitoring from the network, but this trend is perturbed by secondary nonlinear effects, given by Gs (α). Our goal is to maximize the global utility subject to energy constraints in a decentralized manner. Thus, each sensor is equipped with its own utility as follows. Let the additional (i) energy required by Sensor i in high-resolution mode be proportional to Ge (α), and let c(i) represent the quality of information obtained by i. Then if we take i’s reward h(i) to be its marginal contribution to the system utility, less its energy cost, we may write: ∂H (i) (i) (X, α) − G(i) e (α) = c X + f (α), ∂α where f (i) (α) = c(i) G0s (α) − G(i) e (α). h(i) (X, α) = c(i) (6.13) (6.14) A practical implementation of such a utility function is given in [191] in the context of Zigbee-enabled unattended ground sensor networks. If the impact of f (i) (α) is small enough, it is reasonable to expect that a threshold strategy will be optimal, since for high values of 186 Chapter 6. Self-Configuration of Sensors: Global Games X, (6.13) will be positive, and hence “High Res” will be the optimal action. Based on this intuition, we focus on threshold strategies for the remainder of the chapter. Using (6.13) and (6.9), we may now quantify the conditional expectation of reward hJ as: E[hJ (X, α)|Y (i) = y (i) ] Z i (i) = cJ E[X|Y = y ] + Ã ∞ −∞ fJ X k∈I Z rk ∞ yk∗ ! pY (i) |X,k (y|x)dy pX|Y (i) ,J (x|y (i) )dx. (6.15) Note there is a cluster effect here: sensors with a higher signal Y (i) tend to have a higher expectations of X and hence predict a higher proportion α of active sensors. This subtle bias must be accounted for in the Nash equilibrium. In the remainder of this section, we characterize the threshold strategies and give conditions under which the resulting strategy is a Nash equilibrium. We consider two different cases for X and W (i) . In Section 6.3.1 we study the case when the prior on X is diffuse and W (i) is uniformly distributed; in Section 6.3.2 we study the case when both X and W (i) have Gaussian distributions. 6.3.1 Nash Equilibrium Threshold Strategies for Uniform Observation Noise This section develops the Nash equilibrium threshold strategy for the case where there is no prior information on X, which is observed in uniform noise. If sensors are deployed in an unknown environment, there may be no prior knowledge of the distribution of environmental energy X, and the distribution of observation noise may be unknown. In this case, one may take a conservative approach, by supposing that: X ∼ Uniform(−∞, ∞), W (i) ∼ Uniform(−εJ , εJ ), where Sensor i is of class J. This approach is taken, in the context of players who must decide whether to go to a nightclub (given the quality of music and having decreased utility if the club is either too empty or too crowded) in [187]. However, that work only allows for one sensor class; we extend the results to multiple classes here. The diffuse nature of the prior distribution does not cause any technical difficulties here, since we are only interested in the posterior distribution on X given Y (i) , which is well-defined [185]. The value and equilibrium conditions for threshold strategies in this case are fairly simple, 187 Chapter 6. Self-Configuration of Sensors: Global Games since the uniform distribution has finite support. The thresholds are given as the solution to a system of integral equations, which represents an interesting extension to the known single-class case for global games with congestion. The main result, giving the general form of the threshold and equilibrium conditions, is as follows: Theorem 6.3.1 1.) For uniform noise and diffuse prior, the set of switching points of the symmetric threshold strategy (Definition 6.2.1) in the global game must satisfy the system: ( yJ∗ = − 1 2εJ cJ Z yJ∗ +εJ yJ∗ −εJ fJ Ã X · rk k∈I x − yk∗ + εk 2εk ¸1 ! ) dx : J ∈ I . (6.16) 0 That is, in Nash equilibrium, the threshold strategy must switch at yJ∗ ; if signal y (i) < yJ∗ for a sensor of class J, then the sensor should select “Low Res” mode, otherwise it should select “High Res” mode. 2.) Furthermore, the collection of all threshold strategies is a Nash equilibrium if: fJ X " rk y (i) k∈I − εJ − 2εk yk∗ " #1 #1 (i) ∗ X y + εJ − yk + εk + εk rk − fJ ≤ 2εJ cJ ∀ y (i) 2εk k∈I 0 0 (6.17) for all J ∈ I, and the collection of all threshold strategies is not a Nash equilibrium if: fJ Ã X · rk k∈I yJ∗ − εJ − yk∗ + εk 2εk ¸1 ! − fJ 0 Ã X k∈I · rk yJ∗ + εJ − yk∗ + εk 2εk ¸1 ! > 2εJ cJ (6.18) 0 for any J ∈ I. The notation [x]10 denotes the operation min{1, max{0, x}}. The threshold strategy switching point above is essentially an indifference point; if a sensor receives a signal Y (i) = yJ∗ , then it will be indifferent between its two modes, since it expects to receive an expected reward of zero either way. The equilibrium condition is really that this point be the only zero-crossing of the expected reward E[h(X, α|Y (i) = y (i) ] as a function of y (i) . Solutions to this nonlinear system may be easily obtained numerically, e.g. by Newton’s method, or by online adaptive computation, as in the discussion on self-configuration in Section 6.4.1. Note that if εJ is large, then (6.17) is always satisfied if fJ is bounded. To gain some intuition and connection with other global game results, suppose the index set I is rearranged so that y1∗ < y2∗ < . . . , and consider the special “disjoint switching” case where yJ∗ + εJ ≤ yk∗ − εk for all J 6= k. That is, when the observation noise εk becomes small 188 Chapter 6. Self-Configuration of Sensors: Global Games for all classes k ∈ I. This leads to the following corollary: Corollary 6.3.1 Define RJ = P k≤J rk , (where rk is the proportion of all sensors of class k ∈ I) and assume the switching points of the threshold strategies satisfy y1∗ < y2∗ < . . . and yJ∗ + εJ ≤ yk∗ − εk for all J 6= k. We obtain for (6.16): yJ∗ 1 =− cJ rJ Z RJ 1 fJ (α)dα = − c J RJ−1 Z RJ /rJ RJ−1 /rJ fJ (rJ τ )dτ. (6.19) Corollary 6.3.1 reveals a novel result: the “multi-class Laplacian” condition for mode selection. This is analogous to the single-class version, detailed in [187, 194], which can be recovered by considering just one class J (and absorbing cJ into fJ ), yielding y ∗ = R1 − 0 f (α)dα. This implies that the mode selection threshold could be equivalently obtained by simply assuming that the proportion α of high resolution sensors is uniformly distributed over the interval [0, 1]. In the case of multiple classes (in the disjoint switching case), the assumption is that the proportion αJ of high resolution sensors of ones own class are uniformly distributed, while sensors of other classes are either all active (k < J) or all inactive (k > J). Corollary 6.3.2 For a sensor network consisting of a single sensor class (i.e. I = {1}), (6.17) reduces to: " y (i) − y ∗ f 2ε " #1 #1 (i) − y ∗ y −f + 1 ≤ 2ε ∀ y (i) , 2ε 0 (6.20) 0 or equivalently, f (0) − f ≤ 2ε, and f − f (1) ≤ 2ε. (6.21) f (0) − f (1) > 2ε. (6.22) and (6.18) becomes: These are very similar to the conditions in Remarks 1 and 2 of [187], but more general, as we do not require f (0) = 0 or quasi-concavity. The interpretation there (and here) is that f cannot fall too much from its beginning to its minimum value, or from its maximum to its ending value. In the words of [187], there cannot be too much “congestion” in the system, relative to the noise. Conditions (6.17) and (6.18) essentially require the same property, but the overlap of classes makes the role of congestion more subtle. 189 Chapter 6. Self-Configuration of Sensors: Global Games 6.3.2 Threshold Strategies for Gaussian Priors and Noise This section develops the Nash equilibrium threshold strategy for the case where the prior information on X, and the observation noise are both characterized by Gaussian distributions. In many applications, more prior information about X and W (i) will be available, which sensors may incorporate into their computations. In this section, we repeat the analysis of Section 6.3.1 for Gaussian conditions, which are commonly encountered in practice. For sensor i of class J, assume: X ∼ N(m, s2 ) , W (i) ∼ N(0, σJ2 ). Then by well-known properties of the Gaussian distribution, we obtain: Y (i) ∼ N(m, s2 + σJ2 ), Y (i) |(X = x) ∼ N(x, σJ2 ), (i) where νJ = ³ ´ (i) X|(Y (i) = y (i) ) ∼ N νJ , ωJ2 , σJ2 s2 σJ2 s2 (i) 2 m + y , ω = . J s2 + σJ2 s2 + σJ2 s2 + σJ2 (6.23) The threshold values and equilibrium conditions for this case are given by the following theorem: Theorem 6.3.2 Let ΨJ (y, z) = X m∈I µ rm Q ∗ − ω z − ν (y) ym J J σm ¶ . (6.24) Then, 1. For Gaussian noise and prior, the set of switching points of the threshold strategy (Definition 6.2.1) in the global game must satisfy the system: ½ ¾ Z ∞ σJ2 m (s2 + σJ2 ) 1 ∗ −z 2 /2 ∗ √ yJ = − 2 − fJ (ΨJ (y , z)) e dz : J = I . s cJ s2 2π −∞ (6.25) 2. Furthermore, the collection of all threshold strategies is a Nash equilibrium if: µ ∗ ¶2 Z ∞ y −ω z−ν (y) − k Jσ J /2 1 X rk 2 k √ fJ0 (ΨJ (y, z)) e − e−z /2 dz < 1, cJ 2πσk −∞ k∈I (6.26) for all y and all J ∈ I, and the collection of all threshold strategies is not a Nash 190 Chapter 6. Self-Configuration of Sensors: Global Games equilibrium if: µ ∗ ¶2 Z ∞ y −ω z−ν (y ∗ ) − k Jσ J J /2 1 X rk 2 k √ fJ0 (ΨJ (y, z)) e − e−z /2 dz > 1, cJ 2πσk −∞ k∈I (6.27) for any J ∈ I. Corollary 6.3.3 As in the uniform case, suppose the index set I of sensor classes is rearranged so that y1∗ < y2∗ < . . . , and consider the limiting case in which σk2 → 0 for all k ∈ I, and such that σJ /σk < ∞ for all J, k ∈ I. Then for each J ∈ I, the threshold point satisfies: yJ∗ 1 →− √ cJ 2π where RJ = P k≤J Z ∞ −∞ fJ (RJ−1 + rJ Q(−z)) e −z 2 /2 1 dz = − cJ rJ Z RJ RJ−1 fJ (α)dα, (6.28) rk . This is exactly the same “multi-class Laplacian” condition as in the uniform case [187, 194], which is intuitive since the two probability models converge as noise is decreased. Unfortunately, the derivative condition, especially (6.26), does not easily yield a simple closed-form expression here as in the uniform case. Clearly, however, if fJ0 > 0 everywhere, then the threshold strategy for mode selection is a Nash equilibrium. Moreover, if σk is large, then (6.26) is always satisfied. The equilibrium condition (6.27) for the small-noise case is given as follows: Corollary 6.3.4 If σk is small for all k, then µ(y ∗ ) is not an equilibrium if: −1 σJ < cJ Z RJ RJ−1 fJ0 (α) dα, for any J ∈ I. (6.29) We will derive more tractable conditions in Section 6.4.2 for the case when σk is small. 6.4 Further Analysis of Mode Selection Thresholds In this section, we consider some practicalities involved with implementing threshold strategies for a wireless sensor network. First, the problem of decentralized, self-configuration of the threshold strategies is addressed in Section 6.4.1 by an iterative algorithm for threshold computation. Next, hypothesizing that sensors may reduce their noise by averaging their accumulated history of measurements, we examine the robustness of the equilibrium property of the threshold strategy. This results in an interesting “near” equilibrium condition 191 Chapter 6. Self-Configuration of Sensors: Global Games which can be easily checked to determine whether threshold operation is appropriate for a given scenario. 6.4.1 Adaptive Self-Configuration of Mode Selection Thresholds For a self-configuring sensor network, it is necessary that each sensor determines its strategy without supervision. Since we are concerned with symmetric threshold strategies in this chapter, we outline in this section an iterative method for computing a set of thresholds for mode selection that are a candidate Nash equilibrium. The threshold depends only on the sensor’s class, and the proportion of other classes (locally) present. Since classes change infrequently (they are the parameters of the system), it is reasonable to determine thresholds iteratively through repeated interaction. We propose the following iterative procedure for computing the set of thresholds {yJ∗ : J ∈ I} : Algorithm 6.4.1 Dynamic Threshold Configuration: 1. Each sensor of class J (or group of sensors) proposes an initial threshold yJ∗ (0). 2. For n = 1, 2, . . . until yJ∗ (n) ≈ yJ∗ (n − 1), repeat: yJ∗ (n) = z : E[hJ (X, α(X, µ(y ∗ (n − 1))))|Y (i) = z] = 0. (6.30) Referring to (6.16) and (6.25), this procedure can be written as the iteration of a map y ∗ (n) = T (y ∗ (n − 1)), where T has components: ∗ TJ (y ) = − 1 Z 2εJ cJ yJ∗ +εJ yJ∗ −εJ fJ Ã X · rk k∈I x − yk∗ + εk 2εk ¸1 ! dx (6.31) 0 for the uniform case, and σ 2 m (s2 + σJ2 ) 1 √ TJ (y ) = − J2 − s cJ s2 2π ∗ Z ∞ −∞ fJ Ã X k∈I µ rk Q yk∗ − ωJ z − νJ∗ σk ¶! e−z 2 /2 dz (6.32) for the Gaussian case. By Banach’s fixed point theorem [188], if T is a contraction, then the iterative computation will converge. Moreover, the limit point is unique; it is the only point satisfying y ∗ = T (y ∗ ). The fixed point theorem in this case is stated as follows: 192 Chapter 6. Self-Configuration of Sensors: Global Games Theorem 6.4.1 Suppose T is a contraction, that is there exists a q < 1 such that (using the sup-norm): sup |T (y) − T (w)| ≤ sup q|y − w| for all y, w ∈ RI . J∈I J∈I Then Algorithm 6.4.1 converges to a unique threshold y ∗ . Furthermore, convergence is exponentially fast, in the sense that: kT (y ∗ ) − T (y ∗ (n)k ≤ qn kT (y ∗ (1)) − T (y ∗ (0)k. 1−q (6.33) Theorem 6.4.1 is useful because it significantly relaxes the conditions for the iterative procedure; proposals do not have to be universally circulated, or even communicated at all. Small groups or individual sensors can compute the thresholds so as to avoid excessive communication costs. In practice, (see Section 6.5.3) we found that T was a contraction for a wide range of examples. 6.4.2 Robustness of Threshold under Measurement Averaging As measurements are accumulated by sensors, they may be averaged to produce a better estimate for X. It is therefore natural that there will be less uncertainty as time goes on, as represented by reduced noise σJ2 or εJ . The purpose of this section is to determine when the threshold strategy µ(y ∗ ) remains a Nash equilibrium as sensors gain more information about their environment. In the Gaussian case, it is easy to see how noise in W (i) may decrease in time, by simply taking a time average of past measurements. That is, take Y (i) = X + N 1 X (i) (i) Wn = X + W . N (6.34) n=1 (If X evolves according to known dynamics, this can be generalized using a Kalman filtering approach.) By the central limit theorem, the variance of W (i) is reduced to σJ2 /N. In the uniform case, average measurements also lead to a Gaussian distribution for W (i) by the central limit theorem. For low noise conditions, there is a simple necessary and sufficient condition for µ(y ∗ ) to be “nearly” a Nash equilibrium. That is, as long as the noise value is less than a given maximum value, and as long as some simple conditions are satisfied, it is optimal for Sensor i of class J to follow threshold strategy µJ (yJ∗ ) except for signal y (i) on a very narrow interval. Moreover, yJ∗ is particularly simple to compute in this case, as we saw in (6.19) 193 Chapter 6. Self-Configuration of Sensors: Global Games and (6.28). Our result is stated as follows (for completeness we consider both uniform and Gaussian noise, although only Gaussian is required in the motivation above): Theorem 6.4.2 For each sensor i of class J ∈ I, suppose εJ (uniform noise, Section 6.3.1) or σJ (Gaussian noise, Section 6.3.2) are small, and σJ /σk is near unity for all J, k ∈ I. Then, the collection of threshold strategies of all sensors, denoted µ(y ∗ ), is a Nash equilibrium everywhere except possibly on the small region where any Sensor i receives a signal y (i) “close to” yk∗ for some k ∈ I, if and only if the following conditions on the measurement at each sensor i holds: For all J ∈ I, y (i) > yJ∗ ⇒ y (i) y (i) < yJ∗ ⇒y (i) −1 fJ > cJ −1 < fJ cJ Ã X ! rk I{y (i) > yk∗ } , and, k∈I Ã X ! rk I{y (i) > yk∗ } , (6.35) k∈I where cJ , fJ are utility parameters as in (6.13),(6.14), and rJ is the proportion of sensors of class J, for J ∈ I. The small region where following the threshold strategy may not be optimal is given by: y (i) : |y (i) − yk∗ | > εJ + εk for some k ∈ I, (uniform) or, (6.36) y (i) : |y (i) − yk∗ | > 3(σJ + σk ) for some k ∈ I, (Gaussian) (6.37) for sensor i of class J. The advantage here is that the right-hand side of (6.35) is piecewise constant, rendering the expected utility computation particularly simple. This allows us to specify the necessary and sufficient condition for the threshold strategy µ(y ∗ ) to be “nearly” an equilibrium. When only a single sensor class is present (i.e. I = {1}), it is straightforward to show Theorem 6.4.2 is equivalent to the following: Corollary 6.4.1 Suppose there is one class of sensor, with small ε or σ. Then the threshold strategy µ(y ∗ ) for mode selection is a Nash equilibrium except possibly for: y (i) : |y (i) − yk∗ | > 2ε, or, y (i) : |y (i) − yk∗ | > 6σ, (6.38) if and only if: Z f (0) < 1 f (α)dα < f (1). (6.39) 0 194 Chapter 6. Self-Configuration of Sensors: Global Games Theorem 6.4.2 and Corollary 6.4.1 give simple conditions under which the threshold strategy remains a near-Nash equilibrium for small noise conditions. In design, this is useful for determining system parameters for which the threshold strategy is a good choice under any noise conditions. If the conditions are not satisfied, a closer look at the anticipated noise conditions should reveal whether or not such a strategy is appropriate. This result should be contrasted with those of [187, 194], which suggest that optimality of the Nash equilibrium is not preserved in the limit. Strictly speaking, the same is true in our case, for example, as εJ → 0 equilibrium conditions (6.17), (6.18) converge to a condition: µJ (yJ∗ ) is in equilibrium ⇔ fJ (RJ−1 ) < fJ (RJ ) . (6.40) That is, fJ must be increasing at discrete points; if we allow arbitrary rJ , fJ must be increasing everywhere. Thus, any amount of congestion (negative slope) will eventually lead to suboptimality of the threshold strategy. However, our point is that for low noise conditions, this should not matter since the signal y (i) will rarely fall there. Moreover, in the limit as the noise goes to zero, these regions vanish altogether. If the equilibrium condition of Theorem 6.4.2 fails, we can still use Corollary 6.3.4 to say exactly when the threshold fails. Specifically, when N measurements have been taken such that: σ −1 √J < cJ N Z RJ RJ−1 fJ0 (α) dα, for any J ∈ I, (6.41) then the threshold strategy µ(y ∗ ) can no longer be a Nash equilibrium. This is again a consequence of high congestion, since (6.41) is more quickly satisfied when fJ is decreasing quickly on average over the given region. We conclude with a few words about non-threshold Nash equilibria. If the conditions above are not met, Nash equilibria for the global game still exist, but are highly complex. In [187] it was suggested that each individual player could have a different strategy in equilibrium, consisting of disjoint signal regions over which the optimal action alternates between “Low Res” and “High Res.” These strategies are deterministic (by Lemma 6.2.1), but inherently unpredictable. Such erratic behaviour is not desirable in sensor networks, nor are the equilibria easy to compute, and we suggest that it should prompt further design measures to alleviate congestion in the game. 195 Chapter 6. Self-Configuration of Sensors: Global Games 6.4.3 Robustness of Threshold under Variance-based Measurement Quality In this section we suppose that the environmental energy X is the key characteristic to be probed by the sensor network. Instead of averaging multiple measurements as in Section 6.4.2, one can use previous measurements to reduce uncertainty in the prior distribution of X. In addition, the quality-of-information weighting c(i) in (6.13) can be tied (in the Gaussian case) to the ratio between observation noise and uncertainty. The latter is reflected by taking: c(i) = d(i) s2 , σ (i)2 (6.42) for some utility parameter d(i ). This affects the utility by which sensors take decisions: h(i) (X, α) = d(i) 2 s2 (i) s X + d G0 (α) − Ge (α). σ (i)2 σ (i)2 s (6.43) In this section we examine behaviour of the threshold and Nash equilibrium under this reformulation, as the variance s2 on the prior distribution of X evolves under successive observations. In general, if the “environmental energy” X evolves according to a linear system, taking value Xn at discrete time n, Sensor i can use a Kalman filter to deterministically update (i)2 the variance sn (i)2 (i)2 s1 , s2 , . . . of Xn over time. For the static case, we have by (6.23) that the sequence satisfies: (i)2 sn(i)2 (i)2 Clearly then limn−→∞ sn = sn−1 σ (i)2 (i)2 sn−1 + σ (i)2 . (6.44) = 0. Under this reformulation, the following result holds: Lemma 6.4.1 The threshold points yJ∗ → ∞. Furthermore, the threshold strategy cannot be an equilibrium if G0e (α) > 0 everywhere. The proof of Lemma 6.4.1 follows straightforwardly from substitution of (6.43) and letting s(i)2 → 0. Lemma 6.4.1 holds even in the limit, as σk2 → 0 for all k, and such that σJ /σk < ∞ everywhere. Note that G0e (α) > 0 is a more or less standard assumption, since greater congestion is likely to lead to more energy required for transmission. Lemma 6.4.1 shows that, if sensors price their information according to the ratio between prior uncertainty and observation uncertainty, then initially the threshold may be low, but 196 Chapter 6. Self-Configuration of Sensors: Global Games as prior uncertainty decreases, the threshold will increase until all sensors remain in their low-power, low resolution mode. At this point there is simply no incentive to make further costly measurements, since X is already known with near certainty. 6.5 Numerical Examples We have shown that for large noise conditions, µ(y ∗ ) is a Nash equilibrium, and for small noise conditions, a simple test can be performed to check whether this is the case. To gain some intuition as to when the equilibrium holds at intermediate values, we turn to numerical examples. We begin by analyzing a sensor network comprising of a single class of sensor and weak congestion effects, which is then extended to multiple sensor classes and to stronger congestion, which eventually destroys the equilibrium property of the threshold strategy for mode selection. We conclude by investigating the convergence properties of the iterative procedure for threshold computation in Section 6.4.1. 6.5.1 Single Utility, Weak Congestion To begin, we assume each sensor i has identical utility parameters: c(i) = 0.05, f (i) (α) = −0.4106α2 + 0.5517α + 0.0301, see Figure 6.2. In this case, f (i) decreases when the system becomes crowded (large α) but this congestion is never sufficient to switch any sensor back to its low-resolution mode. In R1 fact, calculating f (0) = −0.03, 0 f (α)dα = 0.1089, and f (1) = 0.111, we see that Corollary 6.4.1 is satisfied, so the threshold strategy µ(y ∗ ) will always be a Nash equilibrium. The threshold value is constant for the uniform noise case at y ∗ = −2.178, which is also the limiting threshold value for small noise conditions in the Gaussian case. Figure 6.3 shows the single crossing property of E[h(i) (X, α)|Y (i) = y (i) ], which ensures the threshold strategy is an equilibrium. That is, it is optimal for each sensor to select “High Res” mode if y (i) > y ∗ , since the expected reward is positive in this case, and for each sensor to select “Low Res” mode if y (i) < y ∗ , where the expected reward is negative. This property is preserved even when the noise becomes small. The figure is shown for a Gaussian prior on X with mean m = 2 and variance s2 = 16, and several Gaussian noise distributions. Note how the threshold y ∗ drifts toward the limiting value as σ 2 decreases, according to Theorem 6.3.2. The uniform noise case behaves similarly, except that the threshold remains constant. 197 Chapter 6. Self-Configuration of Sensors: Global Games Sensor Utility for Weak Congestion 0.45 0.4 0.35 Utility Value 0.3 (i) c G’ (α) s 0.25 (i) Ge (α) 0.2 0.15 0.1 (i) f (α) 0.05 0 −0.05 0 0.2 0.4 0.6 0.8 Proportion of Active Sensors α 1 Figure 6.2: Sensor utility function for first example. The interactive component f (i) (α) is the difference between the expected contribution to the end-user’s information, c(i) G0s (α) (i) and the energy cost, Ge (α), as in (6.14). 198 Chapter 6. Self-Configuration of Sensors: Global Games Evolution of Equilibrium of Threshold for Single Type (Weak Congestion) 0.2 0.15 (i) (i) E[h(X,α)|Y =y ] 0.1 0.05 0 −0.05 −0.1 −0.15 σ=1 σ=0.5 σ=0.1 −0.2 −0.25 −4 −3 −2 −1 (i) Observed Signal y 0 1 Figure 6.3: As the observation noise σ decreases, the threshold y ∗ increases, and more fluctuation is observed in the expected utility. However, the utility function is such that there is a single crossing as σ → 0, so the threshold strategy is a Nash equilibrium. 199 Chapter 6. Self-Configuration of Sensors: Global Games Equilibrium of Threshold Weak Congestion & Multiple Noise Types 0.2 0.1 (i) (i) E[h(X,α)|Y =y ] 0.15 σ =0.05 1 0.05 σ2=1 0 σ3=2 −0.05 −4 −3 −2 −1 (i) Observed Signal y 0 1 Figure 6.4: Example of the expected utility and thresholds (zero-crossing points) for multiple sensor classes. Lower noise corresponds to more fluctuation in expected utility, but does not necessarily result in a lower threshold. A similar result holds when sensors have identical utility parameters, but are separated into multiple classes based on their noise variances σ (i) . In Figure 6.4, we show the values of E[h(i) (X, α)|Y (i) = y (i) ], for three noise classes: σ1 = .05, σ2 = .5, σ3 = 2, with r1 = 0.4, r2 = r3 = 0.3. These result in slightly different thresholds {y1∗ , y2∗ , y3∗ }, but again the equilibrium property is preserved. Interestingly, we have y3∗ < y1∗ < y2∗ , which is not reflected in the noise conditions σ1 < σ2 < σ3 . This implies that the interaction between sensors is complex, which may not be appreciated by a more naive approach. 6.5.2 Multiple Utilities, Mixed Congestion The previous utility function represents a limiting case, since R1 amount of congestion is increased even slightly, e.g. by lowering energy cost Ge (α), the threshold strategy µ(y ∗ ) 0 f (α)dα ≈ f (1). If the c(i) or by increasing the falls out of equilibrium in low noise condi- tions. 200 Chapter 6. Self-Configuration of Sensors: Global Games Sensor Utilities for Mixed Congestion 0.3 0.25 Utility Value 0.2 f3(α) 0.15 0.1 f (α) 1 f (α) 2 0.05 0 −0.05 0 0.2 0.4 0.6 0.8 Proportion of Active Sensors α 1 Figure 6.5: Sensor utility functions for second example. The degree of concavity in fJ (α), for classes J = 1, 2, 3, corresponds to each class’s sensitivity to congestion; for sufficient concavity and low noise the threshold strategy fails to constitute a Nash equilibrium. We define three sensor classes, as follows: r1 = 0.4 c1 = 0.04 f1 (α) = −0.3685α2 + 0.4414α + 0.0141 r2 = 0.3 c2 = 0.05 f2 (α) = −0.4106α2 + 0.5517α + 0.0301 r3 = 0.3 c3 = 0.06 f3 (α) = −0.5527α2 (6.45) + 0.662α + 0.0462 The first two classes have Gs and Ge as before, the third class finds it more costly to transmit when α is high. Since Type 3 has a higher contribution c3 , we expect that it should have a lower threshold, but its higher cost of transmission suggests that the threshold will be optimal less often. Utilities for all three classes are shown in Figure 6.5. For this scenario, we can compute the thresholds for small signals via (6.25) to yield y1∗ = −2.8889, y2∗ = −2.7822, y3∗ = −2.8893. However, Theorem 6.4.2 does not hold in this 201 Chapter 6. Self-Configuration of Sensors: Global Games Figure 6.6: Illustration of the conditions for equilibrium in Theorem 6.3.2 for the second example. If the noise value (σ1 , σ2 , σ3 ) lies above the surface in Figure 6.6a, Theorem 6.3.2 guarantees the threshold strategy to be a Nash equilibrium. If it lies below the surface in Figure 6.6b, it cannot be a Nash. The circles represent test points which lie on the actual equilibrium boundary, between the two surfaces. case, since, for example, if y (i) = −2.5, then y (i) > y1∗ but y (i) −1 f1 < c1 Ã X ! rk I{y (i) > yk∗ } = f (1) = −2.175. (6.46) k∈I To establish for which noise values σ1 , σ2 , σ3 equilibrium holds, we plot the bounds given in Theorem 6.3.2 in Figure 6.6 for a large number of test cases. The surface in Figure 6.6a corresponds to condition 6.26; if the noise triple lies above this surface, the threshold strategy is guaranteed to be a Nash equilibrium. Likewise, the surface in Figure 6.6b corresponds to condition 6.27; if the noise triple lies below this surface, the threshold strategy cannot be a Nash equilibrium. We also show a number of boundary points (shown as circles in Figure 6.6) at which equilibrium just holds. 202 Chapter 6. Self-Configuration of Sensors: Global Games For this scenario, the two conditions in Theorem 6.3.2 are close; the first surface lies just above the second. Moreover, the boundary points we computed lie just above the surface in Figure 6.6b. This indicates that (6.27) is a reliable condition for predicting Nash equilibrium of the threshold strategy for mode selection µ(y ∗ ). This is fortunate since (6.27) is much easier to compute than (6.26); the function fJ0 need be evaluated only at I points, instead of on a continuum. 6.5.3 Convergence of Algorithm 6.4.1 The iterative procedure contained in Algorithm 6.4.1 for threshold computation need not always converge, since the map T (y ∗ ) may not be a contraction for all scenarios. In this section we test the algorithm on a large number of random cases to get a sense of when it will converge. We again focus on the Gaussian case, since it is more interesting, and considered N = 20, 000 scenarios with parameters randomly chosen according to a uniform distribution as follows: • The prior distribution for X was Gaussian with mean m between −10 and 10, and variance s2 between 0 and 100. • Three classes J = 1, 2, 3 were considered, with utility: fJ (α) = aJ α3 +bJ α2 +cJ α+dJ , with each of aJ , bJ , cJ , dJ between −1.5 and 1.5. • The noise distribution for W (i) was Gaussian with zero mean and variance σJ between 0 and 100. • The proportion of sensors of each class, r1 , r2 , r3 was chosen uniformly. A histogram showing the proportion of cases in which the threshold computation takes a given number of iterations to converge is shown in Figure 6.7. The figure shows the number of iterations drops rapidly, which is to be expected due to the exponential convergence of Banach’s fixed point theorem. Such empirical results lend a high degree of confidence to the iterative procedure, even though theoretically, convergence depends on parameter selection. Note also that, although Algorithm 6.4.1 may produce a threshold, the equilibrium properties must still be checked. 203 Chapter 6. Self-Configuration of Sensors: Global Games Length of Iterative Threshold Computation 35 30 % of Cases 25 20 15 0.63 % Failed to Converge 10 5 -3 31 0 51 -50 -1 0 >1 0 00 21 19 17 15 13 11 9 7 5 3 1 0 Number of Iterations Figure 6.7: Number of iterations required for the threshold computation procedure of Algorithm 6.4.1 for 20, 000 randomly selected scenarios. Only 0.63% of cases failed to converge within 100 iterations. 6.6 Conclusions Large sensor populations and tight energy constraints have created a need for extremely light coordination protocols in many wireless sensor network applications; sensors must operate with a high degree of independence, based only on noisy information which varies from sensor to sensor. The theory of global games is a natural tool for providing insight into such cases as it considers sensors which act in isolation, with imperfect information and without complete knowledge of the information acquired by others. The application of global games developed here provides fresh perspective on the problem of decentralized, selfconfiguring sensor network design, allowing sensors to act independently while accounting for such issues as bias, algorithm complexity, and noise. In general, it is impossible for sensors acting under these constraints to perform perfectly, since they must predict instead of react to their counterparts’ actions, based on imperfect information. However, our analysis reveals that there are conditions under which self-configuration and decentralized operation are straightforward, leading to performance guarantees in the form of a Nash equilibrium for simple threshold behaviour. Moreover, by providing the proper motivation to each sensor, this Nash equilibrium can be made to reflect 204 Chapter 6. Self-Configuration of Sensors: Global Games the overall design goals of the sensor network. The result is a truly decentralized solution, in which sensors use local information to independently contribute to a global goal. Our results indicate that a straightforward, threshold Nash equilibrium for mode selection exists when congestion is low or noise is high, and we give several simple tests for checking for such equilibria, especially in the limiting, low-noise case resulting from aggregated information. This extends the known existence of Nash equilibrium threshold policies from supermodular global games to global games with congestion and multiple player classes. The ability to relax the supermodularity conditions for a global game suggests there may be a weaker general condition for the existence of threshold strategies, much as the single-crossing condition is a weakening of supermodularity for traditional game theory. In addition, we show in Section 6.4.2 that, even when an equilibrium threshold does not exist, it may still provide near equilibrium behaviour and thus represents a viable solution. These findings point to the general applicability of global game theory to sensor network problems. 205 Bibliography [178] P. Biswas and S. Phoha. Self-organizing sensor networks for integrated target surveillance. IEEE Transactions on Computers, 55(8):1033–1047, 2006. [179] H. Carlsson and E. van Damme. Global games and equilibrium selection. Econometrica, 61(5):989–1018, 1993. [180] C. Chamley. Rational herds: economic models of social learning. Cambridge, 2004. [181] L. Clare, G. Pottie, and J.R. Agre. Self-organizing distributed sensor networks. Proceedings of SPIE, pages 229–237, 1999. [182] A. Aroraa et. al. A line in the sand: a wireless sensor network for target detection, classification, and tracking. Computer Networks, 46(5):605–634, 2004. [183] A.J. Goldsmith and S.B. Wicker. Design challenges for energy-constrained ad hoc wireless networks. IEEE Wireless Comm., 9(4):8–27, 2002. [184] R. Guesnerie. An exploration of the eductive justification of the rational expectations hypothesis. American Economic Review, 82:1254–1278, 1992. [185] J. Hartigan. Bayes Theory. Springer-Verlag, 1983. [186] Kyle Jamieson, Hari Balakrishnan, and Y.C. Tay. Sift: a MAC protocol for eventdriven wireless sensor networks. In Third European Workshop on Wireless Sensor Networks (EWSN), Zurich, Switzerland, February 2006. [187] L. Karp, I. Lee, and R. Mason. A global game with strategic substitutes and complements. Department of Agricultural and Resource Economics, UCB. CUDARE Working Paper 940., 2004. [188] W. Kirk and B. Sims. Handbook of metric fixed point theory. Kluwer Academic, 2001. [189] J. Liu, J. Reich, and F. Zhao. Collaborative in-network processing for target tracking. Journal on Applied Signal Processing, pages 378–391, 2003. 206 Bibliography [190] A.B. MacKenzie and S.B. Wicker. Game theory and the design of self-configuring, adaptive wireless networks. IEEE Communications Magazine, pages 126–131, November 2001. [191] M. Maskery and V. Krishnamurthy. Decentralized activation in a zigbee-enabled unattended ground sensor network: a correlated equilibrium game theoretic analysis. submitted to IEEE Transactions on Signal Processing, 2006. [192] D. McErlean and S. Narayanan. Distributed detection and tracking in sensor networks. Asilomar Conference on Signals, Systems and Computers, pages 1174–1178, 2002. [193] J. Mirkovic, G. Venkataramani, S. Lu, and L. Zhang. A self-organizing approach to data forwarding in large scale sensor networks. Proceedings of IEEE ICC, pages 1357–1361, 2001. [194] S. Morris and H. S. Shin. Global games: Theory and applications. Cowles foundation discussion papers, Cowles Foundation, Yale University, September 2000. 207 Chapter 7 Conclusion The previous chapters have discussed, at length, several game theoretic approaches to design and analysis related to decentralized systems of wireless-enabled devices. This chapter completes the study by summarizing the approaches of Chapters 2 to 6. Common features of the game theoretic approaches are highlighted in Section 7.1, and the major results and lessons are reviewed in Section 7.2. A discussion of potential future work is given in Section 7.3. 7.1 Common Approach to Game Theoretic Design and Analysis The game theoretic approaches to design and analysis of large-scale, decentralized wirelessenabled systems discussed in the preceding chapters all have several features in common. These features stem from the general commonality that each system considered comprises of physically separate components which are nevertheless connected in several ways. To provide a deeper understanding, the commonalities of the game theoretic design and analysis process are highlighted below. 7.1.1 System Objectives and Capabilities The first step, assessing overall system objectives and taking stock of available resources, is standard in almost all engineering design or analysis problems, decentralized or not, and the approach here is no different. A mathematical model is developed which adequately represents the fundamental problems to be solved, providing a logical and sound basis for proposed solutions. The components of our models for decentralized missile deflection, monitoring and communication systems are briefly discussed below. Fundamental Considerations: The fundamental decentralized systems models developed here are similar to those encountered in many centralized engineering/decision prob- 208 Chapter 7. Conclusion lems, consisting of a global objective, specification of capabilities, and a description of the operating environment, as listed
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Game theoretic methods for networked sensors and dynamic...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Game theoretic methods for networked sensors and dynamic spectrum access Maskery, Michael 2007
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Game theoretic methods for networked sensors and dynamic spectrum access |
Creator |
Maskery, Michael |
Publisher | University of British Columbia |
Date Issued | 2007 |
Description | Automated devices enabled by wireless communications are deployed for a variety of purposes. As they become more ubiquitous, their interaction becomes increasingly important for coexistence when sharing a scarce resource, and for leveraging potential cooperation to achieve larger design goals. This thesis investigates the use of game theory as a tool for design and analysis of networked systems of automated devices in the areas of naval defence, wireless environmental monitoring through sensor networks, and cognitive radio wireless communications. In the first part, decentralized operation of naval platforms deploying electronic countermeasures against missile threats is studied. The problem is formulated as a stochastic game in which platforms independently plan and execute dynamic strategies to defeat threats in two situations: where coordination is impossible due to lack of communications, and where platforms hold different objectives but can coordinate, according to the military doctrine of Network Enabled Operations. The result is a flexible, robust model for missile deflection for advanced naval groups. Next, the problem of cooperative environmental monitoring and communication in energy-constrained wireless sensor networks is considered from a game-theoretic perspective. This leads to novel protocols in which sensors cooperatively trade off performance with energy consumption with low communication and complexity overhead. Two key results are an on-line adaptive learning algorithm for tracking the correlated equilibrium set of a slowly varying sensor deployment game, and an analysis of the equilibrium properties of threshold policies in a game with noisy, correlated measurements. Finally, the problem of dynamic spectrum access for systems of cognitive radios is considered. A game theoretic formulation leads to a scheme for competitive bandwidth allocation which respects radios' individual interests while enforcing fairness between users. An on-line adaptive learning scheme is again proposed for negotiating fair, equilibrium resource allocations, while dynamically adjusting to changing conditions. |
Extent | 1723389 bytes |
Subject |
game theory sensor networks |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2007-12-06 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0066179 |
URI | http://hdl.handle.net/2429/210 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2008-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2008_spring_maskery_michael.pdf [ 1.64MB ]
- Metadata
- JSON: 24-1.0066179.json
- JSON-LD: 24-1.0066179-ld.json
- RDF/XML (Pretty): 24-1.0066179-rdf.xml
- RDF/JSON: 24-1.0066179-rdf.json
- Turtle: 24-1.0066179-turtle.txt
- N-Triples: 24-1.0066179-rdf-ntriples.txt
- Original Record: 24-1.0066179-source.json
- Full Text
- 24-1.0066179-fulltext.txt
- Citation
- 24-1.0066179.ris