Reinforcement Learning inNon-Stationary GamesbyOmid Namvar GharehshiranM.A.Sc., The University of British Columbia, 2010A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)January 2015c© Omid Namvar Gharehshiran 2015AbstractThe unifying theme of this thesis is the design and analysis of adaptive procedures that areaimed at learning the optimal decision in the presence of uncertainty.The first part is devoted to strategic decision making involving multiple individuals withconflicting interests. This is the subject of non-cooperative game theory. The proliferation ofsocial networks has led to new ways of sharing information. Individuals subscribe to socialgroups, in which their experiences are shared. This new information patterns facilitate theresolution of uncertainties. We present an adaptive learning algorithm that exploits these newpatterns. Despite its deceptive simplicity, if followed by all individuals, the emergent globalbehavior resembles that obtained from fully rational considerations, namely, correlated equilib-rium. Further, it responds to the random unpredictable changes in the environment by properlytracking the evolving correlated equilibria set. Numerical evaluations verify these new informa-tion patterns can lead to improved adaptability of individuals and, hence, faster convergence tocorrelated equilibrium. Motivated by the self-configuration feature of the game-theoretic designand the prevalence of wireless-enabled electronics, the proposed adaptive learning procedure isthen employed to devise an energy-aware activation mechanism for wireless-enabled sensorswhich are assigned a parameter estimation task. The proposed game-theoretic model trades-offsensors’ contribution to the estimation task and the associated energy costs.The second part considers the problem of a single decision maker who seeks the optimalchoice in the presence of uncertainty. This problem is mathematically formulated as a discretestochastic optimization. In many real-life systems, due to the unexplained randomness andcomplexity involved, there typically exists no explicit relation between the performance measureof interest and the decision variables. In such cases, computer simulations are used as modelsof real systems to evaluate output responses. We present two simulation-based adaptive searchschemes and show that, by following these schemes, the global optimum can be properly trackedas it undergoes random unpredictable jumps over time. Further, most of the simulation effortis exhausted on the global optimizer. Numerical evaluations verify faster convergence andimproved efficiency as compared with existing random search, simulated annealing, and upperconfidence bound methods.iiPreface“Someone told me that each equation I included inthe book would halve the sales.”—S. W. Hawking,A Brief History of Time, 1988The work presented in this thesis is based on the research conducted in the StatisticalSignal Processing Laboratory at the University of British Columbia (Vancouver). The researchprogram was designed jointly based on ongoing discussion between the author and Prof. VikramKrishnamurthy. All major research work, including detailed problem specification, theoreticaldevelopments, performance analyses and identification of results was performed by the author,with assistance from Prof. Vikram Krishnamurthy. The weak convergence results throughoutthe thesis are due in part to Prof. George Yin. The numerical examples, simulations, and dataanalyses were performed by the author. The author was responsible for the composition ofall manuscripts with frequent suggestions for improvement from Prof. Vikram Krishnamurthy.Prof. George Yin also provided technical and editorial feedback during the preparation ofmanuscripts. In what follows, a detailed list of publications associated with different chaptersof this thesis is provided.The material in Chapter 2 has appeared in several publications as listed below:Book:• V. Krishnamurthy, O. N. Gharehshiran, and M. Hamdi. Interactive Sensing and DecisionMaking in Social Networks. Now Publishers Inc., Hanover, MA, 2014.The author was responsible for the composition of Chapter 5, Appendix B, parts of Chap-ters 1 and 6, and editorial suggestions and improvements throughout the book.Journal Article:• O. N. Gharehshiran, V. Krishnamurthy, and G. Yin. Distributed tracking of correlatedequilibria in regime switching noncooperative games. IEEE Transactions on AutomaticControl, 58(10): 2435–2450, Oct. 2013.iiiPrefaceConference Proceeding:• V. Krishnamurthy, O. N. Gharehshiran, and A. Danak. Emergence of rationality amongstsimple nodes performing adaptive filtering. In Proceedings of IEEE ICASSP, pages 5792–5795, May 2011.(Amir Danak was involved in the early stages of concept formation, and contributed tothe numerical evaluations and manuscript edits.)• O. N. Gharehshiran, and V. Krishnamurthy. Tracking correlated equilibria in clusteredmulti-agent networks via adaptive filtering algorithms. In Proceedings of IEEE ICASSP,pages 2041–2044, Mar. 2012.• O. N. Gharehshiran, and V. Krishnamurthy. Learning correlated equilibria in noncoop-erative games with cluster structure. Game Theory for Networks, volume 105 of LectureNotes of the Institute for Computer Sciences, Social Informatics and TelecommunicationsEngineering, pages 115–124, Springer, May 2012.• O. N. Gharehshiran, V. Krishnamurthy, and G. Yin. Tracking equilibria with Markovianevolution. In Proceedings of IEEE CDC, pages 7139–7144, Dec. 2012.Although not being presented in this thesis, the ideas and results that has appeared in the follow-ing journal article and conference proceeding was inspired by the work presented in Chapter 2:Journal Article:• O.N. Gharehshiran, W. Hoiles, and V. Krishnamurthy. Reinforcement learning and non-parametric detection of game-theoretic equilibrium play in social networks. submitted toa peer reviewed journal, Dec. 2014.Conference Proceeding:• O. N. Gharehshiran, and V. Krishnamurthy. Diffusion based collaborative decision mak-ing in noncooperative social network games. In Proceedings of IEEE GlobalSIP, pages567–570, Dec. 2013.A version of Chapter 3 has appeared in the following journal article and conference proceedings:Journal Article:• O. N. Gharehshiran, V. Krishnamurthy, and G. Yin. Distributed energy-aware diffusionleast mean squares: Game-theoretic learning. IEEE Journal of Selected Topics in SignalProcessing. 7(5):1932–4553, Oct. 2013.ivPrefaceConference Proceeding:• O. N. Gharehshiran, V. Krishnamurthy, and G. Yin. Game-theoretic learning for ac-tivation of diffusion least mean squares. In Proceedings of EUSIPCO, pages 1–5, Sep.2013.The work presented in Chapter 4 has been submitted for possible publication in a peer reviewedjournal and has appeared in a conference proceedings as listed below:Journal Articles:• O. N. Gharehshiran, V. Krishnamurthy, and G. Yin. Adaptive search algorithms for dis-crete stochastic optimization: A smooth best-response approach. First round of revisionssubmitted to a peer reviewed journal, Sep. 2014.Conference Proceedings:• O. N. Gharehshiran, V. Krishnamurthy, and G. Yin. Best-response search algorithms fornon-stationary discrete stochastic optimization. In Proceedings of IEEE CDC, Dec. 2014.The material in Chapter 5 is based on the following manuscript that has been submitted forpossible journal publication:Journal Article:• O. N. Gharehshiran, V. Krishnamurthy, and G. Yin. Regret-based adaptive search algo-rithms for non-stationary discrete stochastic optimization. Submitted to a peer reviewedjournal, Oct. 2014.Other research work that is not presented in this thesis, however, was heavily influenced by theideas presented in this thesis and performed during the author’s time as a PhD student, aresummarized below:Book Chapter:• O. N. Gharehshiran, A. Attar, and V. Krishnamurthy. Distributed Coalition Formationand Resource Allocation in Cognitive LTE Femto-cells. In M.-L. Ku and J.-C. Lin, editor,Cognitive Radio and Interference Management: Technology and Strategy. pages 127–161,IGI Global, 2013.The chapter proposal was prepared by Dr. Alireza Attar. The author was responsible forthe chapter composition which was based on the collaborative work with Dr. Alireza Attarsummarized in the following publications:vPrefaceJournal and Magazine Articles:• O. N. Gharehshiran, A. Attar, and V. Krishnamurthy. Coalition formation for bearings-only localization in sensor networks—A cooperative game approach. IEEE Transactionson Communications. 61(1):325–334, Jan. 2013.The author was the lead investigator, responsible for all major areas of concept formation,theoretical development, and performance analyses and simulations, as well as the majority ofmanuscript composition. Dr. Alireza Attar contributed to the problem formulation, simulationtest-bed, as well as composition of parts of Sections 1 and 2. He also contributed to manuscriptedits and revisions. Prof. Vikram Krishnamurthy also helped with the manuscript edits andimprovements.• A. Attar, V. Krishnamurthy, and O. N. Gharehshiran. Interference management usingcognitive base-stations for UMTS LTE. IEEE Communications Magazine. 49(8):152–159,Aug. 2011.Dr. Alireza Attar was the lead investigator and contributor. The last section is due in part tothe author’s collaborative work with him.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Interactive Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Stochastic Optimization With Discrete Decision Variables . . . . . . . . 71.2 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2.1 Interactive Decision Making in Changing Environments . . . . . . . . . . 91.2.2 Energy-Aware Sensing Over Adaptive Networks . . . . . . . . . . . . . . 111.2.3 Discrete Stochastic Optimization in Changing Environments . . . . . . . 121.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.3.1 Game-Theoretic Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.3.2 Adaptive Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.3.3 Simulation-Based Discrete Optimization . . . . . . . . . . . . . . . . . . 171.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Part I Adaptation and Learning in Non-Cooperative Games . . . . . . . 222 Adaptive Learning in Regime Switching Non-Cooperative Games . . . . . . 232.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23viiTable of Contents2.1.1 Learning in Non-Cooperative Games . . . . . . . . . . . . . . . . . . . . 242.1.2 Chapter Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.1.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.1.4 Chapter Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.2 Non-Cooperative Game Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.2.1 Strategic Form Non-Cooperative Games . . . . . . . . . . . . . . . . . . 272.2.2 Solution Concepts: Correlated vs. Nash Equilibrium . . . . . . . . . . . 292.2.3 Regime Switching Non-Cooperative Games . . . . . . . . . . . . . . . . . 332.3 Regret-Based Local Adaptation and Learning . . . . . . . . . . . . . . . . . . . 342.3.1 Regret-Matching With Neighborhood Monitoring . . . . . . . . . . . . . 362.3.2 Discussion and Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.4 Emergence of Rational Global Behavior . . . . . . . . . . . . . . . . . . . . . . . 442.4.1 Global Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442.4.2 Hyper-Model for Regime Switching in the Game Model . . . . . . . . . . 452.4.3 Asymptotic Local and Global Behavior . . . . . . . . . . . . . . . . . . . 462.5 Analysis of the Adaptive Learning Algorithm . . . . . . . . . . . . . . . . . . . . 492.5.1 Characterization of the Limit Individual Behavior . . . . . . . . . . . . . 492.5.2 Stability Analysis of the Limit Dynamical System . . . . . . . . . . . . . 542.6 The Case of Slow Regime Switching . . . . . . . . . . . . . . . . . . . . . . . . . 562.7 Numerical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592.8 Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642.9 Proofs of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653 Energy-Aware Sensing Over Adaptive Networks . . . . . . . . . . . . . . . . . 783.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783.1.1 Distributed Parameter Estimation Over Adaptive Networks . . . . . . . 783.1.2 Chapter Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793.1.3 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803.1.4 Chapter Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 813.2 Diffusion Least Mean Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 813.2.1 Centralized Parameter Estimation Problem . . . . . . . . . . . . . . . . . 823.2.2 Standard Diffusion LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . 833.2.3 Revised Diffusion LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853.2.4 Convergence Analysis via ODE Method . . . . . . . . . . . . . . . . . . . 893.3 Energy-Aware Diffusion LMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 923.3.1 Overview of the Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 923.3.2 Activation Control Game . . . . . . . . . . . . . . . . . . . . . . . . . . . 933.3.3 Game-Theoretic Activation Control Mechanism . . . . . . . . . . . . . . 97viiiTable of Contents3.4 Global Network Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 993.4.1 Global Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 993.4.2 Main Result: From Individual to Global Behavior . . . . . . . . . . . . . 1003.5 Convergence Analysis of the Energy-Aware Diffusion LMS . . . . . . . . . . . . 1023.5.1 Limit System Characterization . . . . . . . . . . . . . . . . . . . . . . . . 1023.5.2 Stability Analysis of the Limit System . . . . . . . . . . . . . . . . . . . 1033.6 Numerical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1053.6.1 Why Is Derivation of Analytical Convergence Rate Impossible? . . . . . 1053.6.2 Semi-Analytic Approach to Numerical Study . . . . . . . . . . . . . . . . 1063.6.3 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1063.7 Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109Part II Adaptive Randomized Search for Discrete Stochastic Optimiza-tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1114 Smooth Best-response Adaptive Search for Discrete Stochastic Optimization 1124.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1124.1.1 Simulation-Based Optimization . . . . . . . . . . . . . . . . . . . . . . . 1124.1.2 Randomized Search Methods . . . . . . . . . . . . . . . . . . . . . . . . . 1134.1.3 Chapter Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1144.1.4 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1144.1.5 Chapter Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1154.2 Discrete Stochastic Optimization Problem . . . . . . . . . . . . . . . . . . . . . 1164.3 Non-stationary Discrete Stochastic Optimization Problem . . . . . . . . . . . . . 1174.4 Smooth Best-Response Adaptive Randomized Search . . . . . . . . . . . . . . . 1194.4.1 Smooth Best-Response Sampling Strategy . . . . . . . . . . . . . . . . . 1194.4.2 Adaptive Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 1224.5 Tracking the Non-Stationary Global Optima . . . . . . . . . . . . . . . . . . . . 1234.5.1 Performance Diagnostic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1244.5.2 Hyper-Model for Underlying Time Variations . . . . . . . . . . . . . . . . 1244.5.3 Main Result: Tracking Performance . . . . . . . . . . . . . . . . . . . . . 1254.6 Analysis of the Smooth Best-Response Adaptive Search . . . . . . . . . . . . . . 1274.6.1 Characterization of the Mean Switching ODE . . . . . . . . . . . . . . . 1284.6.2 Stability Analysis of the Mean Switching ODE . . . . . . . . . . . . . . . 1304.6.3 Asymptotic Stability of the Interpolated Processes . . . . . . . . . . . . . 1324.7 The Case of Slow Random Time Variations . . . . . . . . . . . . . . . . . . . . . 1334.8 Numerical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135ixTable of Contents4.8.1 Example 1: Static Discrete Stochastic Optimization . . . . . . . . . . . . 1364.8.2 Example 2: Regime Switching Discrete Stochastic Optimization . . . . . 1394.9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1434.10 Proofs of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1435 Regret-Based Adaptive Search for Discrete Stochastic Optimization . . . . 1545.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1545.1.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1555.1.2 Chapter Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1565.2 Non-stationary Discrete Stochastic Optimization . . . . . . . . . . . . . . . . . . 1575.3 Regret-Based Adaptive Randomized Search . . . . . . . . . . . . . . . . . . . . . 1585.3.1 Potential-Based Sampling Strategy . . . . . . . . . . . . . . . . . . . . . 1585.3.2 Adaptive Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 1615.4 Main Result: Tracking Non-Stationary Global Optima . . . . . . . . . . . . . . . 1625.5 Analysis of the Regret-Based Adaptive Search Algorithm . . . . . . . . . . . . . 1645.5.1 Limit System Characterization as a Switching ODE . . . . . . . . . . . . 1655.5.2 Stability Analysis of the Limit System . . . . . . . . . . . . . . . . . . . 1655.5.3 Asymptotic Stability of the Interpolated Processes . . . . . . . . . . . . . 1665.6 The Case of Slow Random Time Variations . . . . . . . . . . . . . . . . . . . . . 1675.7 Numerical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1705.7.1 Finding Demand with Maximum Probability . . . . . . . . . . . . . . . . 1725.7.2 (s, S) Inventory Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1775.8 Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1815.9 Proofs of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1826 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1916.1 Summary of Findings in Part I . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1916.2 Summary of Findings in Part II . . . . . . . . . . . . . . . . . . . . . . . . . . . 1946.3 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199Appendix A Sensor Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . 222xList of Tables2.1 Agents’ payoffs in normal form. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.1 Convergence rates in static discrete stochastic optimization for the smooth best-response adaptive search scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1385.1 Convergence rates in static discrete stochastic optimization for the regret-basedadaptive search scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1745.2 Convergence and efficiency evaluations in the (s,S) inventory problem. . . . . . . 180A.1 Power consumption for sensing, processing, and transceiver components . . . . . 224A.2 IEEE 802.15.4 standard values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224xiList of Figures2.1 Equilibrium notions in non-cooperative games. . . . . . . . . . . . . . . . . . . . 352.2 Regime switching correlated equilibria set. . . . . . . . . . . . . . . . . . . . . . . 352.3 Regret-based local adaptation and learning. . . . . . . . . . . . . . . . . . . . . . 412.4 Game-theoretic learning with neighborhood monitoring. . . . . . . . . . . . . . . 422.5 Regret-matching adaptive learning: Local and global behavior of agents. . . . . . 592.6 Distance to correlated equilibrium in a static game. . . . . . . . . . . . . . . . . . 622.7 Average overall payoff in a slowly switching game. . . . . . . . . . . . . . . . . . 622.8 Proportion of iterations spent out of the correlated equilibria set in a slow switch-ing game. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632.9 Proportion of iterations spent out of the correlated equilibria set versus theswitching speed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.1 Energy-aware diffusion LMS: Local and global behavior. . . . . . . . . . . . . . . 943.2 Network topology and nodes’ regressors statistics. . . . . . . . . . . . . . . . . . . 1073.3 Trade-off between energy expenditure and estimation accuracy. . . . . . . . . . . 1083.4 Network excess mean square error. . . . . . . . . . . . . . . . . . . . . . . . . . . 1084.1 Proportion of simulations expended on non-optimal solutions in Example 1. . . . 1394.2 Sensitivity of convergence rate and efficiency to exploration parameter γ. . . . . 1404.3 Sample path of the estimate of the global optimum in Example 2. . . . . . . . . . 1414.4 Proportion of simulations on non-optimal feasible solutions in Example 2. . . . . 1424.5 Proportion of time that the global optimum estimate does not match the trueglobal optimum versus the switching speed of global optimizers in Example 2. . . 1425.1 True objective function values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1735.2 Proportion of simulations performed on the global optimizer set. . . . . . . . . . 1735.3 Sample path of the global optimum estimate. . . . . . . . . . . . . . . . . . . . . 1765.4 Proportion of time that global optimum estimate does not match the true globaloptimum versus the switching speed of global optimizers. . . . . . . . . . . . . . 1775.5 Classical (s,S) inventory restocking policy. . . . . . . . . . . . . . . . . . . . . . . 1795.6 Steady-state expected cost per period. . . . . . . . . . . . . . . . . . . . . . . . . 179xiiAcknowledgementsFirst and foremost, I would like to express my sincere gratitude to my academic advisor, Prof.Vikram Krishnamurthy, for his exceptional guidance, constant inspiration and support, andinsightful suggestions that made this quest a success. Through the frequent fruitful discussions,I learned so many things from him that I would have otherwise been clueless about. Hisenthusiasm for tackling new ideas was always contagious and a source of motivation.I am also grateful to Prof. George Yin for his time, effort, and for numerous helpful discus-sions via e-mail during our collaboration in the past four years. His comments, suggestions andattention to details have resulted in much improvement, and were crucial for the completion ofthis work.Heartfelt and deep thanks go to my family, and particularly my parents, who have alwaysbeen a great source of support, encouragement, inspiration, and love despite the self alienationthat came as a result of pursuing this adventurous journey. They passionately shared all thepains, excitements, frustrations, and eventually the ultimate fruit with me.I would like to extend my sincere thanks to all past and present members of the StatisticalSignal Processing Laboratory and my friends. I am thankful for their sharing and caring, forthe laughters and the friendship.I am grateful for the Four Year Doctoral Fellowship from the University of British Columbiathat supported me during my studies.Finally, I confirm my debt to all unacknowledged parties whom I have slighted in thisbrief list of acknowledgments. The people who have inspired and taught me are not only toonumerous to list, but are often unknown to me, and I to them!Omid NamvarUBC, VancouverSeptember 2014xiiiTo the loving memory of my mother,to whom I never got to explain much of this,and to my father,who inspired me to start this journeyxiv1Introduction“Not everything that can be counted counts, and not every-thing that counts can be counted.”—Albert Einstein1.1 OverviewMany social and economic situations involve decision making under uncertainty. Notionally,the simplest decision problems involve only one decision maker who has a choice between anumber of alternatives. In the presence of uncertainty, each decision may lead to differentoutcomes with different probabilities. A rational decision maker seeks the alternative thatyields the most desirable outcome. Oftentimes, individual’s preferences on these outcomes arerepresented mathematically by a utility or reward function. As a result, acting instrumentallyto satisfy best one’s preferences becomes equivalent to the utility maximizing behavior. In suchcase, if the probabilities of outcomes are known, by maximizing the expected utility, one canfind the optimal decision. A simple example is gambling, where one must choose between anumber of possible lotteries, each one having different payoffs and winning probabilities. Inmany other scenarios, however, individuals have to identify the optimal decision in the absenceof knowledge about such probabilities, solely through interaction and some limited feedbackfrom the environment.The process of finding the optimal decision becomes even more complex when the decisionproblem involves various self-interested individuals whose decisions affect one another—in thesense that the outcome does not depend solely on the choice of a single actor, but the decisionsof all others. In such case, to learn the optimal choice, the decision maker has to decide betweentwo general types of decisions: do something which will ensure some known reward, or take therisk and try something new. Of course, animals and humans do learn in this way. Throughimitation, exploration and reward signals, humans shape their behavior to achieve their goals,and obtain their desired outcomes. In fact, such a learning behavior is of critical importanceas many decision problems are either too complex to comprehend, or the changes that theyundergo are too hard to predict.The unifying theme of this thesis is to devise and study algorithms to learn and track the11.1. Overviewoptimal decision in both interactive and non-interactive, uncertain, and possibly non-stationaryenvironments. The problem of finding the optimal decision in non-interactive situations is amathematical discrete stochastic optimization problem, and is an important concept in thedecision theory [42, 96]. The study of strategic decision making in an interactive environment,however, is the subject of game theory, or as called by Robert J. Aumann1 in [26], interactivedecision theory.The rest of this section is devoted to an overview of these subjects, however, in reverseorder of complexity, beginning with decision making in interactive environments. Although itmay seem counterintuitive, this order of presentation is chosen since, somewhat surprisingly,our inspiration for the proposed adaptive search procedures for discrete stochastic optimizationproblems, which are simpler in nature, originated from the study of game-theoretic learningalgorithms developed for decision making in interactive situations.1.1.1 Interactive Decision MakingWhat economists call game theory psychologists call the theory of social situations, which is amore accurate description of what game theory is about. The earliest example of a formal game-theoretic analysis is the study of a duopoly by Antoine Cournot in 1838. It was followed by aformal theory of games suggested in 1921 by the mathematician Emile Borel, and the “theoryof parlor games” in 1928 by the mathematician John von Neumann. It was not, however, untilthe 1944 publication of The Theory of Games and Economic Behaviour by John von Neumannand Oskar Morgenstern [275] that game theory was born as a field on its own right. Theydefined a game as any interaction between agents that is governed by a set of rules specifyingthe possible moves for each participant, and a set of outcomes for each possible combination ofmoves [138]. After thrilling a whole generation of post-1970 economists, it is spreading like abushfire through other disciplines. Examples include:• Electrical engineering: Energy-aware sensor management in wireless sensor networks [120,185], dynamic spectrum access in spectrum-agile cognitive radios [183, 210], interferencemitigation [18] and channel allocation in cognitive radio networks [117, 118, 222], andautonomous demand-side management in smart grid [213].• Computer science: Optimizing task scheduling involving selfish participants [223, 224],design and improvement of e-commerce systems [277], and distributed artificial intelli-gence [161].• Political science: Explaining the commitment of political candidates to ideologies pre-ferred by the median voter [83], and peace among democracies [196].1Robert J. Aumann was awarded the Nobel Memorial Prize in Economics in 2005 for his work on conflictand cooperation through game-theoretic analysis. He is the first to conduct a full-fledged formal analysis of theso-called infinitely repeated games.21.1. Overview• Psychology: Explaining human interactive decision making based on instrumental ra-tionality assumption in game theory [75], and evolution in group-structured popula-tions [262].• Philosophy: Interpretation of justice, mutual aid, commitment, convention [257], andexplaining the rational origins of the social contract, the logic of nuclear deterrence, andthe theory of good habits [258].• Biology: Explaining “limited war” in conflict among animals of the same species [260],and the evolution of communication among animals [261].Two prominent game-theorists, Robert J. Aumann and Sergiu Hart, explain the attraction ofgame theory in [27] the following way:“Game theory may be viewed as a sort of umbrella or ‘unified field’ theory for therational side of social science... [it] does not use different, ad hoc constructs... itdevelops methodologies that apply in principle to all interactive situations.”Game theory provides a mathematical tool to model and analyze the complex behavioramong interacting agents with possibly conflicting interests. Game-theoretic methods can de-rive rich dynamics through the interaction of simple components. It can be used either as adescriptive tool, to predict the outcome of complex interactions by means of the game-theoreticconcept of equilibrium, or as a prescriptive tool, to design systems around interaction rules.The essential goal of game theory is to describe the optimal outcome of a process involvingmultiple decision makers, where the outcome of each individual’s decision is affected not onlyby her own decision, but also by all others’ decisions. Solving the game, therefore, involves acomplex process of players guessing what each other will do, while revealing as little informationas possible about their own strategies [109].There are two main branches of game theory: cooperative and non-cooperative game theory.Cooperative game theory is concerned with interactions wherein multiple individuals cooperateto reach a common goal. For instance, multiple airlines may form alliances to provide convenientconnections between as many airports as possible so as to increase profit [259]. Non-cooperativegame theory, in contrast, deals largely with how intelligent individuals interact with one anotherin an effort to achieve their conflicting goals. For example, in oligopolistic markets, each firm’sprofits depend upon not only its own decisions over pricing and levels of production, but alsoothers’ in the market [99]. The latter is the branch of game theory that will be discussed inthe first part of this thesis. Each decision maker is treated as a player with a set of possibleactions and a utility for each action, which she wishes to maximize. The utility, however, isalso a function of others’ actions, which makes the maximization problem nontrivial.The following scene of the movie “A Beautiful Mind,” based on the life of the Nobel laureatein economics John Nash, explains the concept of a non-cooperative game very well. John Nash31.1. Overviewand his friends are sitting in a bar, where they confront a group of girls, with a stunning blondegirl leading the group. They all find the blond girl attractive, but Nash realizes something andsays: “If we all go for the blonde and block each other, not a single one of us is going to gether. So then we go for her friends, but they will all give us the cold shoulder because no onelikes to be second choice. But what if none of us goes for the blonde? We won’t get in eachother’s way and we won’t insult the other girls. It’s the only way to win.”Non-ccoperative game theory has traditionally been used in economics and social scienceswith a focus on fully rational interactions where strong assumptions are made on the informationpatterns available to individual agents. In many applications of practical interest, however, thisis simply not feasible. No agent in a community of reasonable complexity possesses a globalview of the entire agency. They, in fact, have only local views, goals and knowledge that mayinterfere with, rather than support, other agents’ actions. Coordination is vital to prevent chaosduring such conflicts [195, 283].The game-theoretic concept of equilibrium describes a condition of global coordination whereall decision makers are content with the social welfare realized as the consequence of theirchosen strategies. The prominent equilibrium notions in non-cooperative games include: Nashequilibrium [218], correlated equilibrium [24, 25], and communication equilibrium [90, 140]. TheNash equilibrium and its refinements are without doubt the most well-known game-theoreticequilibrium notion. One can classify Nash equilibria as pure strategy Nash equilibria, whereall agents play a single action, and mixed strategy Nash equilibria, where at least one agentemploys a randomized strategy. The underlying assumption in Nash equilibrium is that agentsact independently. John F. Nash proved in his famous paper [218] that every game with a finiteset of players and actions has at least one mixed strategy Nash equilibrium. However, Robert J.Aumann, who introduced the concept of correlated equilibrium, asserts in the following extractfrom [25] that“Nash equilibrium does make sense if one starts by assuming that, for some specifiedreason, each player knows which strategies the other players are using.”Evidently, this assumption is rather restrictive and, more importantly, is rarely true in anystrategic interactive situation. He adds:“Far from being inconsistent with the Bayesian view of the world, the notion ofequilibrium is an unavoidable consequence of that view. It turns out, though, thatthe appropriate equilibrium notion is not the ordinary mixed strategy equilibrium ofNash (1951), but the more general notion of correlated equilibrium.”This, indeed, is the very reason why correlated equilibrium [25] best suits and is central to theanalysis of strategic decision making in the first part of this thesis.41.1. OverviewThe correlated equilibrium is a generalization of the Nash equilibrium, and describes acondition of competitive optimality. There is much to be said about the correlated equilibrium:It is realistic in multi-agent learning as the past history of agents’ decisions naturally correlatesagents’ future actions. In contrast, Nash equilibrium assumes agents act independently whichis rarely true in any learning scenario. It is structurally simpler, and hence simpler to compute,as compared with the Nash equilibrium [219]. The coordination among agents in the correlatedequilibrium can further lead to potentially higher social welfare than if agents make decisionsindependently as in Nash equilibrium [25]. Finally, there is no natural process that is knownto converge to a Nash equilibrium in a general non-cooperative game, which is not essentiallyequivalent to exhaustive search. There are, however, natural processes that do converge tocorrelated equilibrium under almost no structural assumption on the game model (e.g., [141,143]).The theory of learning in games formalizes the idea that equilibrium arises as a resultof players learning from experience. It further examines how and what kind of equilibriummight arise as a consequence of a long process of adaptation and learning in an interactiveenvironment [108]. Learning dynamics in games can be typically classified into Bayesian learn-ing [2, 65], adaptive learning [142], and evolutionary dynamics [152, 198]. The first part ofthis thesis focuses on adaptive learning models in which decision makers try to maximize theirown welfare while simultaneously learning about others’ play. The question is then when self-interested learning and adaptation will result in the emergence of equilibrium behavior. Theliterature on game-theoretic learning has relied on the idea that learning rules should strike abalance between performance and complexity [108]. That is, simple learning rules are expectedto perform well in simple environments, whereas larger and more complex environments callfor more sophisticated learning mechanisms.Multi-Agent System Design and AnalysisThe game-theoretic learning procedures have a self-organization feature, that is appealing inmany multi-agent systems of practical interest: Simple devices with limited awareness can beequipped with configurable utility functions and adaptive routines. While each unit in thesesystems is not capable of sophisticated behavior on its own, it is the interaction among theconstituents that leads to systems that are resilient to failure, and that are capable of adjustingtheir behavior in response to changes in their environment [249]. By properly specifying theutility functions, these units can be made to exhibit many desired equilibrium behaviors [209].The game-theoretic design approach echoes natural systems: Insects have long since evolvedthe proper procedures and utility preferences to accomplish collective tasks. The same canbe said of humans, who orient their utilities according to economics [297]. These discoverieshave motivated rigorous efforts towards a deeper understanding of information processing and51.1. Overviewgame-theoretic learning over complex multi-agent systems.The self-organizing feature of the game-theoretic design helps to realize decentralized au-tonomous systems. By spreading intelligence throughout the system, such systems can eliminatethe need to transport information to and from a central point, while still allowing local informa-tion exchange to any desired degree. Such decentralized structure brings about several benefitsas well as challenges. The benefits include robustness, scalability, self-configuration, low com-munication overhead, while the challenges include limited sensing capability and computationalintelligence of devices, and restrictive analytical assumptions [209].Research GoalsIn the first part of the thesis, the aim is to formulate a general game-theoretic framework thatincorporates the following two components into the standard non-cooperative game model:1. a graph-theoretic model for the information flow among decision makers;2. a regime switching model for the time variations underlying the interactive situations.We then proceed to devise learning and adaptation methods suited to the proposed model.More precisely, the aim is to answer the following question:When agents are self-interested and possess limited sensing and communicationcapabilities, can a network of such agents achieve sophisticated global behavior thatadapts to the random unpredictable changes underlying the environment?To answer the above question, we focus on regret-based adaptive learning procedures [145] andborrow techniques from the theories of stochastic approximation [40, 192], weak convergence [80,192], Lyapunov stability of dynamical systems [179], and Markov switched systems [291, 293].Among the above tools, the theory of stochastic approximation is perhaps the most well-knownin the signal processing community, and particularly in the area of adaptive signal processing.In control/systems engineering too, stochastic approximation is the main paradigm for on-linealgorithms for system identification and adaptive control. Borkar explains in [50] that thefrequent use of stochastic approximation methods is not accidental:“Stochastic approximation has several intrinsic traits that make it an attractiveframework for adaptive schemes. It is designed for uncertain (read ‘stochastic’)environments, where it allows one to track the ‘average’ or ‘typical’ behaviour ofsuch an environment. It is incremental, i.e., it makes small changes in each step,which ensures a graceful behaviour of the algorithm. This is a highly desirablefeature of any adaptive scheme. Furthermore, it usually has low computational andmemory requirements per iterate, another desirable feature of adaptive systems.Finally, it conforms to our anthropomorphic notion of adaptation: It makes small61.1. Overviewadjustments so as to improve a certain performance criterion based on feedbacksreceived from the environment.”Motivated by the benefits of the game-theoretic design, the next goal is to examine thisadaptive learning procedure for autonomous energy-aware activation in a network of nodes,which aim to solve an inference problem in a fully distributed manner. Nodes posses adaptationand learning abilities and interact with each other through local in-network processing. Eachnode is not only capable of experiencing the environment directly, but also receives informationthrough interactions with its neighbors, and processes this information to drive its learningprocess. Such adaptive networks have been extensively studied in [248, 249] and shown toimprove the estimation performance, yet yield savings in computation, communication andenergy expenditure. The question that we pose is:Can one make such adaptive networks even more power-efficient by allowing eachnode to activate, sense the environment, and process local information only whenits contribution outweighs the associated energy costs?The intertwined role of adaptation and learning, crucial to the self-configuration of nodes,makes game-theoretic learning an appealing theoretical framework. The aim is to convey adesign guideline for self-configuration mechanisms in such multi-agent systems in situationswhere agents cooperate to accomplish a global task while pursuing diverging local interests.1.1.2 Stochastic Optimization With Discrete Decision VariablesThe second part of this thesis is concerned with identifying the best decision among a set ofcandidate decisions in the presence of uncertainty. In its simplest form, this problem can beformulated mathematically as a discrete stochastic optimization of the formmaxs∈SF (s) = E {f(s,xn(s))}where S denotes the set of available decisions, xn(s) is a stochastic process that representsthe uncertainty in the outcome observed by choosing s, and f is a deterministic function thatmeasures the “performance” of the decision. This topic has received tremendous attention fromthe research community in several disciplines in the past two decades. Examples include:• Electrical engineering: Finding the optimal spreading codes of users in a code-divisionmultiple-access system [186], selecting the optimal subset of antennas in a multiple-input multiple output system [41], topology control in mobile ad hoc networks [133], andtransmit diversity and relay selection for decode-and-forward cooperative multiple-inputmultiple-output systems [74].71.1. Overview• Supply chain management: Choosing the best supplier [82], finding the optimal inven-tory restocking policy [103, 267], and selecting among several supply chain configurationdesigns [102].• Logistics and Transportations Networks: Selecting the vehicle route for distribution ofproducts to customers [73], and allocating resources in a transportation network [207, 237].• Manufacturing engineering: Finding the optimal machine quantity in a factory [274], andvehicle allocation for an automated materials handling system [159].• Management science: Project portfolio management [16, 17], and selecting among severalproject activity alternatives [234].• Operations research: Deciding on resources required for operating an emergency depart-ment of a hospital to reduce patient time in the system [3].If the distribution of the random process xn(s) in (1.1) is known for each candidate decisions, the expectation in the objective function can be evaluated analytically. Subsequently, theoptimal decision can be found using standard integer programming techniques. However, often-times little is known about the structure of the objective function in applications of practicalinterest. For instance, the stochastic profile of the random elements that affect the problemmay be unknown, or the relation between the decision variable and the associated outcomemay be too complex that it cannot be formulated as a mathematical function. In such cases,the decision maker has to probe different candidates with the hope of obtaining a somewhataccurate estimate of their objective values. This is referred to as exploration. Discovering newpossibilities, conducting research, varying product lines, risk taking, and innovation all fallunder the realm of exploration.These problems with unknown structure are particularly interesting because they addressoptimization of the performance of complex systems that are realistically represented via sim-ulation models. Of course, one can argue that the uncertainty underlying the decision problemin such systems is eliminated by performing a large number of simulation runs at differentcandidates to obtain an accurate estimate of the objective function. However, such simulationsare typically computationally expensive. Therefore, there are two difficulties involved in solvingthese optimization problems with discrete decision variables: (i) the optimization problem itselfis NP-hard, and (ii) experimenting on a trial decision can be too (computationally) “expensive”to be repeated numerous times.In view of the above discussion and great practical significance of simulation-based discreteoptimization problems, it is of interest to devise adaptive search algorithms that enable local-izing the globally optimal candidate with as little expense as possible. The second part of thisthesis is devoted to sampling-based random search solutions to the discrete stochastic optimiza-tion problem (1.1). The appeal of sampling-based methods is because they often approximatewell, with a relatively small number of samples, problems with a large number of scenarios;see [55] and [199] for numerical reports. The random search methods are adaptive in the sense81.2. Main Contributionsthat they use information gathered during previous iterations to adaptively decide how to ex-haust the simulation budget in the current iteration. A proper solution has to maintain anappropriate balance between exploration, exploitation, and estimation. Exploration refers tosearching globally for promising candidates within the entire search space, exploitation involveslocal search of promising subregions of the set of candidate solutions, and estimation refers toobtaining more precise estimates at desirable candidates.Research GoalsIn the second part of the thesis, the aim is to study simulation-based discrete stochastic opti-mization problems with the following two extensions:1. the data collected via simulation experiments (or feedback from the environment) is tem-porally correlated;2. the problem is non-stationary and either the stochastic profile of the random elements orthe objective function evolves randomly over time.We then proceed to devise sampling-based adaptive search algorithms that are both capable totrack the non-stationary global optimum, and efficient in the sense that they localize the opti-mum by taking and evaluating as few samples as possible from the search space. The resultingalgorithms can thus be deployed as online control mechanisms that enable self-configuration oflarge-scale stochastic systems. To this end, we focus on sampling strategies that are inspiredby adaptive learning algorithms in game theory, and borrow techniques from the theories ofstochastic approximation [40, 192], Lyapunov stability of dynamical systems [179], Markovswitched systems [291, 293], and weak convergence methods [80, 192].1.2 Main ContributionsThis section briefly describes the major novel contributions of the chapters comprising thisthesis. These contributions fall into three general categories: novel modeling approaches, noveladaptive learning algorithms, and the analytical approaches applied to evaluate the performanceof the proposed adaptive learning algorithms in each chapter. These contributions are reviewedbelow for each chapter in the order that they appear in the thesis. A more detailed descriptionof the contributions of each chapter is described in individual chapters.1.2.1 Interactive Decision Making in Changing EnvironmentsChapter 2 is inspired by the work of Hart & Mas-Colell in [141, 143], and focuses on learningalgorithms in interactive situations where decision makers form social groups and disclose in-formation of their decisions only within their social groups. This new information flow pattern91.2. Main Contributionsis motivated by the emergence of social networking services, such as Twitter, Google+, andFacebook, which has led to new ways of sharing information within interested communities. Itcalls for new adaptive learning methodologies to allow exploiting these valuable pieces of infor-mation to facilitate coordination of these units by speeding up the resolution of uncertainties.Fast and unpredictable changes in trends also emphasize the intertwined role of adaptation andlearning, and asks for algorithms that are agile in responding to such changes based only onlimited communication with fellow group members and feedback from the environment.The main contributions of Chapter 2 are summarized below:1. A graph-theoretic neighborhood monitoring model that captures formation of social groups.Each decision maker shares her decisions with her fellow group members only after they aremade, and is oblivious to the existence of others whose decisions may affect the outcome ofher decision.2. A regime switching model that captures random evolution of parameters affecting the deci-sion problem or the interactive environment (or both). The decision makers are oblivious tothe dynamics of such evolution, and only realize it through their interactions once a changeoccurs. This model is mainly meant for trackability analysis of the devised adaptive learningscheme.3. A regret-matching type adaptive learning algorithm that prescribes how to make decision ateach stage based only on the realizations of the outcome of previous decisions, and exchangeof information within social group. The proposed rule-of-thumb learning strategy is simplya non-linear adaptive filtering algorithm that seeks low internal regret [48]. The internalregret compares the loss of a decision as compared with other available decisions—for exam-ple, “What would have your overall satisfaction been if every time you bought coffee fromStarbucks, you had bought from a local store?”4. Performance analysis of the proposed adaptive learning algorithm. We prove that, if thedecision maker individually follows the proposed regret-based learning strategy and the ran-dom evolution of the parameters underlying the game occur on the same timescale as theadaptation speed of the algorithm, she will experience a worst case regret of at most (for asmall number ) after sufficient repeated plays of the game, and is able to keep it as the gameevolves. Further, if now all players start following the proposed adaptive learning algorithmindependently and parameters of the game evolve on a slower timescale than the adaptationspeed of the algorithm, their global behavior will track the regime switching set of correlatedequilibria. Differently put, decision makers can coordinate their strategies in a distributedfashion so that the distribution of their joint behavior belongs to the correlated equilibriaset.5. Numerical evaluation and comparison with adaptive variants of existing learning algorithms101.2. Main Contributionsin the literature. Two scenarios are considered: static, and regime switching games. In thestatic setting, the numerical studies illustrate that an order of magnitude faster coordinationcan be obtained by exploiting the information disclosed within social groups, as comparedwith the regret-matching reinforcement learning algorithm [143], in a small game comprisingof three players. In the regime switching setting, it is shown that the proposed learning rulebenefits from this excess information and yields faster adaptability as compared with theregret-matching [143] and fictitious play [105] procedures.These results set the stage for Chapter 3 to design an energy-aware activation controlmechanism and examine the benefits of the developed framework and the proposed adaptivelearning algorithm in an application that has recently attracted much attention in the signalprocessing and machine learning societies.1.2.2 Energy-Aware Sensing Over Adaptive NetworksChapter 3 is inspired by the recent literature on cooperative diffusion strategies [204, 249, 250]to solve global inference and optimization problems in a distributed fashion. It aims to provideguidelines on the game-theoretic design process for multi-agent systems in which agents trade-offtheir local interests with the contribution to the global task assigned to them. The applicationwe focus on is a distributed solution to a centralized parameter estimation task via the diffusionleast mean squares (LMS) filter proposed in [64, 204]. The diffusion protocol implements acooperation strategy among sensors by allowing each sensor to share its estimate of the trueparameter with its neighbors. These estimates are then fused and fed into an LMS-type adaptivefilter to combine it with sensor’s new measurements, and output the new estimate. Besidesthe stabilizing effect on the network, this cooperation scheme has been proved to improvethe estimation performance, and yield savings in computation, communication and energyexpenditure [249].The main contributions of Chapter 3 are summarized below:1. An ordinary differential equation (ODE) approach for convergence analysis of diffusion LMS.By properly rescaling the periods at which measurements and data fusion take place, thediffusion LMS adaptive filter is reformulated as a classical stochastic approximation algo-rithm [192]. This new formulation allows us to use the ODE method, which is perhaps themost powerful and versatile tool in the analysis of stochastic recursive algorithms, to obtainsimpler derivation of the known convergence results. Other advantages include: (i) one canconsider models with correlated noise in the measurements of sensors, and (ii) the weakconvergence methods [192] can be applied to analyze the tracking capability in the case oftime-varying true parameter.111.2. Main Contributions2. A game-theoretic formulation for energy-aware activation control problem. Sensors are en-abled to periodically make decisions as whether to activate or sleep. Associated with eachdecision, there corresponds a reward or penalty that captures the spatial-temporal corre-lation among sensors measurements, and trades-off sensor’s contribution to the estimationtask with the associated energy costs. This problem is formulated as a non-cooperative gamethat is being repeatedly played among sensors.3. An energy-aware diffusion LMS adaptive filter. A novel two timescale stochastic approx-imation algorithm is proposed that combines the diffusion LMS (slow timescale) with agame-theoretic learning functionality (fast timescale) that, based on the adaptive learningalgorithm of Chapter 2, prescribes the sensor when to activate and run the diffusion LMSadaptive filter. Existing game-theoretic treatments in sensor networks tend to focus on activ-ities such as communication and routing [120, 185, 281]. To the best of our knowledge, thisis the first energy saving mechanism that directly interacts with the parameter estimationalgorithm.4. Convergence analysis of the energy-aware diffusion LMS algorithm. We prove that if eachsensor individually follows the proposed energy-aware diffusion LMS algorithm, the local es-timates converge to the true parameter across the network, yet the global activation behavioralong the way tracks the evolving set of correlated equilibria of the underlying activationcontrol game.5. Numerical evaluation. Finally, simulation results verify theoretical findings and illustrate theperformance of the proposed scheme, in particular, the trade-off between the performancemetrics and energy savings.1.2.3 Discrete Stochastic Optimization in Changing EnvironmentsThe second part of this thesis concerns finding the optimal decision in an uncertain, dynamic,and non-interactive situations. Fast and unpredictable changes in trends in recent years asksfor adaptive algorithms that are capable to respond to such changes in a timely manner. Whenlittle is known about the structure of such decision problems and the objectives cannot beevaluated exactly, one has to experience each candidate decision several times so as to obtain asomewhat accurate estimate of its objective value. In real life, however, some experiments maybe too expensive to be repeated numerous times. Many systems of practical interest are, also,too complex, hence, simulating them numerous times at a candidate configuration to obtaina more accurate estimate of its objective value is computationally prohibitive. Therefore, itis of interest to design specialized search techniques that are both attracted to the globallyoptimum candidate, and efficient, in the sense that they localize the optimum with as littlecomputational expense as possible.121.2. Main ContributionsThe adaptive search algorithms presented and analyzed in the second part are inspired byadaptive learning algorithms in game theory. Chapter 4 introduces an adaptive search algorithmthat relies on a smooth best-response sampling strategy, and is inspired by fictitious play [106]learning rules in game theory. The main contributions of Chapter 4 are summarized below:1. A regime switching model that captures random evolution of parameters affecting the stochas-tic behavior of the elements underlying the discrete stochastic optimization problem or theobjective function (or both). The decision maker is oblivious to the dynamics of such evo-lution, and only realizes it through repeated observations of the outcome of her decisions.This model is mainly meant for analyzing the tracking capability of the devised adaptivesearch schemes.2. A smooth best-response adaptive search scheme. Inspired by stochastic fictitious play learn-ing rules in game theory [106], we propose a class of adaptive search algorithms that relies ona smooth best-response sampling strategy. The proposed algorithm assumes no functionalproperties such as sub-modularity, symmetry, or exchangeability on the objective function,and relies only on the data collected via observations, measurements or simulation experi-ments. It further allows temporal correlation of the collected data, which is more realisticin practice.3. Performance analysis of the smooth best-response adaptive search algorithm. We prove that,if the evolution of the parameters underlying the discrete stochastic optimization problemoccur on the same timescale as the updates of the adaptive search algorithm, it can properlytrack the global optimum by showing that a regret measure can be made and kept arbitrarilysmall infinitely often. Further, if now the time variations occur on a slower timescale thanthe adaptation rate of the proposed algorithm, the most frequently sampled element of thesearch space tracks the global optimum as it jumps over time. This in turn implies that theproposed scheme exhausts most of its simulation budget on the global optimum.4. Numerical evaluation and comparison with existing random search and upper confidencebound methods. Two scenarios are considered: static, and regime switching discrete stochas-tic optimization. In the static setting, the numerical studies illustrate faster convergenceto the global optimum as compared with the random search [9] and the upper confidencebound [22] algorithms. In the regime switching setting, the proposed scheme yields fasteradaptability as compared with the two adaptive variants of the random search and upperconfidence bound methods, presented in [290] and [115], respectively.The sampling-based adaptive search scheme presented in Chapter 4 relies on a samplingstrategy that is of best-response type. That is, it keeps sampling the candidate which is,thus far, conceived to guarantee the highest “performance.” This is an appealing feature asone aims to localize the globally optimum candidate with as little computational expense as131.3. Related Workpossible. Chapter 5 presents a class of regret-based adaptive search algorithms that is of better-reply type, and is inspired by the regret-matching adaptive learning rule presented in Chapter 2.That is, it keeps sampling and evaluating all those candidates that the decision maker has foundpromising based on the limited data collected thus far. Somewhat surprisingly, it will be shownin Chapter 5 that, by properly choosing the potential function in the sampling strategy, thisalgorithm exhibits the same performance characteristics as the smooth best-response adaptivesearch algorithm of Chapter 4.The main contributions of Chapter 5 are summarized below:1. A regret-based adaptive search scheme. Inspired by the regret-matching learning rule inChapter 2, we propose a class of regret-based adaptive search algorithms that relies on apotential-based sampling strategy. The proposed algorithm assumes no functional propertiessuch as sub-modularity, symmetry, or exchangeability on the objective function, and allowscorrelation among the collected data, which is more realistic in practice.2. Performance analysis of the regret-based adaptive search algorithm. We prove that, if theevolution of the parameters underlying the discrete stochastic optimization problem occuron the same timescale as the updates of the adaptive search algorithm, it can properly trackthe global optimum by showing that a regret measure can be made and kept arbitrarily smallinfinitely often. Further, if now the time variations occur on a slower timescale than theadaptation rate of the proposed algorithm, the most frequently sampled candidate tracks theglobal optimum as it jumps over time. Finally, the proportion of experiments completed ina non-optimal candidate solution is inversely proportional to how far its associated objectivevalue is from the global optimum.3. A discrete-event simulation optimization of an inventory management system. We run adiscrete-event simulation for a classic (s, S) inventory restocking policy to illustrate theperformance of the proposed algorithm in a practical application. Comparison against therandom search methods in [9, 290], the simulated annealing algorithm in [5], and the upperconfidence bound schemes of [22, 115] confirms faster convergence and superior efficiency.Numerical studies further show that, although being different in nature of the samplingstrategy, the performance characteristics of the regret-based adaptive search is quite similarto the adaptive search algorithm of Chapter 4.1.3 Related WorkThis section is devoted to the literature review of progress and advances in the fields andsubjects connected to the main chapters in the order that they appear in this thesis.141.3. Related Work1.3.1 Game-Theoretic LearningThe basic game-theoretic approach in this thesis is well documented in the mathematics andeconomics literature. Comprehensive texts on the subject include [27, 28, 106, 229]. Parallelto the growing interest in social networks, a new game-theoretic paradigm has been developedto incorporate some aspects of social networks such as information exchange and influencestructure into a game formalism. One such example is the so-called graphical games, whereeach player’s influence is restricted to his immediate neighbors [173, 174]. Early experimentalstudies on these games indicate significant network effects. For instance, Kearns et al. presentin [175] the results of experiments in which human subjects solved a version of the vertexcoloring problem2 that was converted into a graphical game. Each participant had control ofa single vertex in the network. The task was to coordinate (indirectly) with neighbors so thatthey all had the same color. The interested reader is further referred to [121, 184] for recentadvances in interactive sensing and decision making by social agents.The correlated equilibrium arguably provides a natural way to capture conformity to socialnorms [61]. It can be interpreted as a mediator instructing people to take actions according tosome commonly known probability distribution. Such a mediator can be thought of as a socialnorm that assigns roles to individuals in a society [284]. If it is in the interests of each individualto assume the role assigned to him by the norm, then the probability distribution over rolesis a correlated equilibrium [25, 81, 90]. The fact that actions are conditioned on signals orroles indicates how conformity can lead to coordinated actions within social groups [167, 254].Norms of behavior are important in various real-life behavioral situations such as public goodprovision3, resource allocation, and the assignment of property rights.The works [62, 86] further report results from an experiment that explores the empiricalvalidity of correlated equilibrium. It is found in [86] that when the private recommendationsfrom the mediator are not available to the players, the global behavior is characterized well bymixed-strategy Nash equilibrium. Their main finding, however, was that players follow recom-mendations from the third party only if those recommendations are drawn from a correlatedequilibrium that is “payoff-enhancing” relative to the available Nash equilibria.Hart & Schmeidler provide an elementary proof of the non-emptiness of the set of correlatedequilibria in [146] using the minimax theorem. Several mechanisms have been proposed torealize the correlated equilibrium, using either a central authority who recommends or verifiesproposed actions [24, 109, 193], or using a verification scheme among the players [29, 31, 35,91, 131, 265].2The vertex coloring problem is an assignment of labels (colors) to the vertices of a graph such that no twoadjacent vertices share the same label. The model has applications in scheduling, register allocation, patternmatching, and community identification in social networks [208, 264].3The public good game is a standard of experimental economics. In the basic game, players secretly choosehow many of their private tokens to put into a public pot and keep the rest. The tokens in this pot are multipliedby a factor (greater than one) and this “public good” payoff is evenly divided among players.151.3. Related WorkThe simplest information flow model in a game is one in which players’ decisions can beobserved by all other players at the end of each round. The most well-known adaptive learn-ing procedure for such setting, namely, fictitious play [241], was introduced in 1951 and hasbeen extensively studied later in [36, 105, 106, 151]. In fictitious play, the player behaves asif she is Bayesian. That is, she believes that the opponents’ play corresponds to draws froman unknown stationary distribution and simply best-responds to her belief about such a dis-tribution4. Another simple model, namely, stochastic fictitious play [106], forms beliefs as infictitious play, however, chooses actions according to a stochastic best-response function. Thisnew model brings about two advantages: (i) it avoids the discontinuity inherent in fictitiousplay, where a small change in the belief can lead to an abrupt change in behavior; (ii) it isHannan-consistent [137]: its time average payoff is at least as good as maximizing against thetime-average of opponents’ play, which is not true for fictitious play. Hannan-consistent strate-gies have been obtained in several works; see, e.g., [92, 94, 98, 105]. Fudenberg & Kreps provein [104] that, in games with a unique mixed strategy Nash equilibrium, play under this learningscheme converges to the Nash equilibrium. These results were then extended to a general classof 2× 2 games (2 players, 2 possible actions each) in [171], n×m games in which players haveidentical payoff functions in [214], 2× 2 games with countably many Nash equilibria, and sev-eral classes of n× 2 coordination and anti-coordination games in [37]. These works are furtherclosely connected to the literature on worst case analysis in computer science [89, 97, 200, 276].Regret-matching [141–143, 145] as a strategy of play in long-run interactions has been known,for over a decade, to guarantee convergence to the correlated equilibria set. The regret-basedadaptive procedure in [141] assumes the same information model as fictitious play, whereasthe regret-based reinforcement learning algorithm in [143] assumes a set of isolated playerswho neither know the game, nor observe the decisions of other players, and rely only on theirrealized payoffs. In the learning, experimental, and behavioral literature, there are variousmodels that are similar to the regret-matching procedure. Examples include [59, 88, 242, 255],to name a few, with [88, 242] being the closest. Regret measures also feature in the recentneuroeconomics literature on decision-making [60, 76], and have further been incorporated intothe utility functions to provide alternative theories of decision making under uncertainty [34,202]. Another interesting aspect captured by the regret-matching procedure is the fact thatpeople stick to their actions for a disproportionately long time as in the “status quo bias” [245].The evolutionary game theory also provides a rich framework for modeling the dynamics ofadaptive opponent strategies for large population of players. The interested reader is referredto [152, 246] for recent surveys.4Obviously, if all players make decisions according to the fictitious play, the actual environment is not sta-tionary, hence, players have the wrong model of the environment.161.3. Related Work1.3.2 Adaptive NetworksAdaptive networks are comprised of a collection of agents with adaptation and learning abilities,which are linked together through a topology, and interact to solve inference and optimizationproblems in a fully distributed and online manner. Adaptive networks are well-suited to modelseveral forms of complex behavior exhibited by biological [58, 170] and social networks [87, 164]such as fish schooling [232], prey-predator maneuvers [136, 212], bird flight formations [162],bee swarming [33, 165], bacteria motility [233, 282], and social and economic interactions [129].These studies have brought forward notable examples of complex systems that derive their so-phistication from coordination among simpler units, and from the aggregation and processing ofdecentralized pieces of information. A comprehensive treatment of the adaptation and learningcapabilities of such networks can be found in [249].The cooperation strategies, which enable fully distributed implementations of the globalinference or optimization task in such networks, can be classified as: (i) incremental strate-gies [43, 166, 203, 239], (ii) consensus strategies [44, 79, 228, 240, 268], and (iii) diffusionstrategies [64, 69, 70, 204, 251]. In the first class of strategies, information is passed from oneagent to the next over a cyclic path until all agents are visited. In contrast, in the latter two,cooperation is enforced among multiple agents rather than between two adjacent neighbors. Ithas been shown in [269] that diffusion strategies outperform consensus strategies. Therefore, weconcentrate on diffusion strategies to implement cooperation among agents in Chapter 3. Thediffusion strategies have been used previously for distributed estimation [64, 204], distributedcollaborative decision making [119, 270], and distributed Pareto optimization [70].The literature on energy saving mechanisms for a distributed implementation of an esti-mation task mostly focuses on how to transmit information [205, 253, 288], rather than howto sense the environment; see [281] for a survey. There are only few works [120, 185] thatpropose game-theoretic methods for energy-efficient data acquisition. However, none explic-itly take into account the diffusion of information and network connectivity structure to studydirect interaction of the game-theoretic activation mechanism with the parameter estimationalgorithm.1.3.3 Simulation-Based Discrete OptimizationThe research and practice in simulation-based discrete optimization have been summarizedin several review papers, see, e.g., [101, 227]. These papers also survey optimization add-ons for discrete-event simulation softwares, which have been undergoing fast development overthe past decade, e.g., OptQuestr5 and SIMUL8r6. Andrado´ttir also presents a review onsimulation optimization techniques in [10], mainly focusing on gradient estimation, stochastic5OptQuestr is a registered trademark of OptTek Systems, Inc. (www.optquest.com).6www.simul8.com171.3. Related Workapproximation, and random search methods.If the number of candidate solutions is small, say less than 20, then ranking, selection, andmultiple comparisons procedures can be used [32, 147, 150, 158]. In [127], ranking and selectionand multiple comparison procedures are reviewed. Further details on how such procedures areconstructed, and a review of the key results that are useful in designing such procedures, aregiven in [180]. The common feature of these methods is that at their termination they provide astatistical guarantee regarding the quality of the chosen system. For high dimensional problems,random search and ordinal optimization methods have been proposed, e.g., [149, 156, 285].Additional references on the topic include [77, 78, 148].Random search usually assumes very little about the topology of the underlying optimiza-tion problem and the objective function observations. Andrado´ttir provides an overview of therandom search methods in [12], and discusses their convergence and desirable features thatensure attractive empirical performance. Some random search methods spend significant effortto simulate each newly visited state at the initial stages to obtain an estimate of the objectivefunction. Then, deploying a deterministic optimization mechanism, they search for the globaloptimum; see, e.g., [68, 153, 181, 244]. The adaptive search algorithms of Chapters 4 and 5 arerelated to another class, namely, discrete stochastic approximation methods [192, 263], whichdistributes the simulation effort through time, and proceeds cautiously based on the limitedinformation available at each time. Algorithms from this class primarily differ in the choiceof the strategy that prescribes how to sample candidates iteratively from the search space.The sampling strategies can be classified as: i) point-based, leading to methods such as simu-lated annealing [5, 95, 116, 135, 238], tabu search [126], stochastic ruler [6, 7, 287], stochasticcomparison and descent algorithms [9, 11, 14, 290], ii) set-based, leading to methods such asbranch-and-bound [225, 226], nested partitions [236, 256], stochastic comparison and descentalgorithms [155], and iii) population-based, leading to methods such as genetic algorithms. Therandom search method has further been applied to solve simulation optimization problems withcontinuous decision variables [15].Random search algorithms typically generate candidate solutions from the neighborhoodof a selected solution in each iteration. Some algorithms have a fixed neighborhood structure,e.g., [5, 6, 8, 9, 130, 287]; Others change their neighborhood structure based on the informationgained through the optimization process, e.g., [236, 256]. Roughly speaking, they use a largeneighborhood at the beginning of the search when knowledge about the location of good solu-tions is limited, and the neighborhood shrinks as more and more information on the objectivefunction is collected. When good solutions are clustered together, which is often true in simula-tion, algorithms with an adaptive neighborhood structure work better than those with a fixedneighborhood structure. With respect to the size of the feasible set, [6, 8] can solve problemswith a countably infinite number of feasible solutions, wile others are suited to problems with181.3. Related Workfinite feasible sets.The convergence properties of simulation-based discrete optimization algorithms can alsobe divided into three categories: no guaranteed convergence, locally convergent, and globallyconvergent. Most algorithms used in commercial softwares are heuristics, such as tabu searchand scatter search [125] in OptQuest, and provide no convergence guarantee. Random searchalgorithms, e.g., [6, 9, 236, 256, 287] are globally convergent. Any globally convergent algorithmthat assumes no structure on the objective function requires visiting every feasible solutioninfinitely often to guarantee convergence. However, when the computational budget is low,requiring the algorithm to visit a large number of solutions, if not every solution, is unreasonable.Therefore, [155] proposes a locally convergent scheme that focuses its sampling effort in thecurrent most promising region of the feasible set. In [154], a coordinate search method isproposed that is also almost surely locally convergent, however, is not of random search type.Although the traditional role of ranking and selection methods is to select the best systemamong a small number of simulation alternatives, recently, they are embedded into randomsearch methods to tackle problems with large feasible sets. For instance, [49] suggest applyingtheir ranking and selection procedure to identify the best design from among “good” alternativesidentified by a random search method. In [236], the ranking and selection procedure is usedto aid the random search algorithm to make better moves more efficiently. More recently, [13]has developed ranking and selection procedures that identify the system with the best primaryperformance measure among a finite number of simulated systems in the presence of a stochasticconstraint on a single real-valued secondary performance measure.The concept of regret, well-known in the decision theory, has also been introduced to therealm of random sampling [84, 85]. These methods are of particular interest in the multi-armedbandit literature [22, 23, 52, 252, 273], which is concerned with optimizing the cumulativeobjective function values realized over a period of time [22, 23], and the pure explorationproblem [21, 53, 54, 110, 111], which involves finding the best arm after a given number ofarm pulls. In such problems, the regret value of a candidate arm measures the worst-caseconsequence that might possibly result from selecting another arm. Such a regret is sought tobe minimized via devising random sampling schemes that establish a proper balance betweenexploration and exploitation. These methods, similar to the random search methods, usuallyassume that the problem is static in the sense that the arms’ reward distributions are fixed overtime, and exhibit reasonable efficiency mostly when the size of the search space is relativelysmall. For upper confidence bound policies for non-stationary bandit problems, see [115].This work also connects well with earlier efforts [114, 163], which view optimizing the per-formance of a system comprising several decision variables, each taking value in a finite set, as anon-cooperative game with identical payoff functions, and propose heuristics based on fictitiousplay to reach the Nash equilibrium.191.4. Thesis Outline1.4 Thesis OutlineThis rest of this thesis is divided into two parts and five chapters as outlined below:Part I concerns interactive decision making in an uncertain and randomly changing envi-ronment, and is comprised of two chapters:• Chapter 2 provides a review of the non-cooperative games and their well-known solu-tion concepts. It then introduces a class of regime switching non-cooperative games withneighborhood monitoring, and proposes a regret-based adaptive learning algorithm thatexploits the excess information received from fellow neighbors to facilitate global coordi-nation. It is then shown that, under certain conditions on the speed of time variations,the global behavior emerging from individual decision makers following the proposed al-gorithm can properly track the regime switching set of the correlated equilibria. Thischapter is concluded with a numerical example to illustrate and compare the performanceof the proposed adaptive learning scheme.• Chapter 3 provides guidelines on the game-theoretic design of multi-agent systems thatcan benefit form the framework and algorithm developed in Chapter 2. It starts withthe introduction of the diffusion least mean squares (LMS) algorithm that has recentlyattracted attention in the signal processing community. The diffusion LMS is then re-formulated as a classical stochastic approximation algorithm. This enables using thewell-known ODE approach for convergence analysis which leads to simpler derivation theknown results. It is then combined with a game-theoretic energy-aware activation mecha-nism based on the adaptive learning algorithm in Chapter 2. Finally, the performance ofthe proposed energy-aware diffusion LMS is analyzed, and numerical results are providedto substantiate the theoretical findings.Part II concerns the problem of a single decision maker who seeks the optimal choice in thepresence of uncertainty and non-stationarities underlying the environment.• Chapter 4 introduces a class of regime switching discrete stochastic optimization prob-lems, whose objective functions have unknown structure. An adaptive search algorithmis proposed that relies on a smooth best-response sampling strategy. Subsequently, it isshown that the proposed algorithm is globally convergent, and efficiency guarantees areprovided. This chapter is concluded with numerical examples that illustrate and comparethe proposed scheme with existing methods in terms of both convergence and efficiency.• Chapter 5 presents a regret-based adaptive search algorithm, that is inspired by theregret-matching adaptive learning algorithm of Chapter 2. The algorithm differs form thesmooth best-response adaptive search in Chapter 4 in that it relies on a novel potential-based sampling strategy that is of better-reply type. Subsequently, it is shown that the201.4. Thesis Outlineproposed algorithm is globally convergent and some efficiency guarantees are provided.This chapter is concluded with two numerical examples that illustrate the performanceand efficiency of the proposed algorithm. In the second example, the algorithm is testedin a discrete-event simulation optimization of an inventory management system.Finally, Chapter 6 briefly summarizes the results, provides concluding remarks, and com-ments on the future directions for research in the areas related to the material presented in thisthesis.21Part IAdaptation and Learning inNon-Cooperative Games222Adaptive Learning in RegimeSwitching Non-Cooperative Games“Rational behavior requires theory. Reactive behavior re-quires only reflex action.”—W. Edwards Deming2.1 IntroductionThis chapter focuses on the decision problems involving multiple self-interested individualswhose decisions affect one another. The study of strategic decision making in such interactivesituations is the subject of non-cooperative game theory, which characterizes the optimal solu-tion by an equilibrium state. The game-theoretic concept of equilibrium describes a conditionof global coordination where all decision makers are content with the social welfare realized asthe consequence of their choices. Reaching an equilibrium, however, involves a complex processof individuals guessing what each other will do. The question is then if and what local learningstrategies will lead to the global equilibrium behavior.Among the two prominent equilibrium notions in non-cooperative game theory, namely,Nash equilibrium [218] and correlated equilibrium [25], this chapter focuses on the latter. Theset of correlated equilibria describes a condition of competitive optimality among decision mak-ers, and is arguably the most natural attractive set for adaptive learning algorithms. Severaladvantages makes correlated equilibrium more appealing: It is realistic in multi-agent learningas the past history of individuals’ decisions naturally correlates their future actions. In con-trast, Nash equilibrium assumes agents act independently which is rarely true in any learningscenario. It is structurally and computationally simpler than the Nash equilibrium [219]. Thecoordination among agents in the correlated equilibrium can further lead to potentially highersocial welfare than if agents make decisions independently, as in Nash equilibrium [25]. Finally,there is no general procedure that is known to converge to a Nash equilibrium in non-cooperativegames, which is not essentially equivalent to exhaustive search. In contrast, there exist adaptivelearning algorithms that converge to correlated equilibrium under almost no assumption on the232.1. Introductiongame model. Indeed, Hart and Mas-Colell observe, while describing such adaptive algorithmsin [141], that:“At this point a question may arise: Can one actually guarantee that the smallerset of Nash equilibria is always reached? The answer is definitely “no.” On the onehand, in our procedure, as in most others, there is a natural coordination device:the common history, observed by all players. It is thus not reasonable to expectthat, at the end, independence among players will not obtain. On the other hand,the set of Nash equilibria is a mathematically complex set (a set of fixed-points;by comparison, the set of correlated equilibria is a convex polytope), and simpleadaptive procedures cannot be expected to guarantee the global convergence tosuch a set.”2.1.1 Learning in Non-Cooperative GamesThe theory of learning in games formalizes the idea that equilibrium arises as a result of playerslearning from experience. This chapter focuses on adaptive learning models in which individ-uals try to increase their own payoffs while simultaneously learning about others’ strategies.Regret-matching [141, 143] is one such adaptive procedure, and has been known to guaran-tee convergence to the correlated equilibria set [25]. The regret-matching adaptive procedurein [141] assumes that all individuals share their decisions once they are made. This procedureis viewed as particularly simple and intuitive as no sophisticated updating, prediction, or fullyrational behavior is essential on the part of agents. The procedure can be simply described asfollows: At each period, an individual may either continue with the same action as the previousperiod, or switch to other actions, with probabilities that are proportional to a “regret mea-sure.” Willingness to proceed with the previous decision mimics the “inertia” that is existentin human’s decision making process.Assuming that each decision maker is capable to observe the decisions of all others is ratherimpractical and unrealistic. It, however, sheds light and provides insight on the grounds ofthe decision making process, and how it can be deployed in interactive situations with morecomplex patterns of information flow. The reinforcement learning procedure in [143] assumesthat decision makers are isolated, and can only observe the outcome of their decisions via afeedback from the environment. The disadvantages of relying only on the realized outcomesare very low convergence speed, and convergence to an approximate correlated equilibrium,namely, correlated -equilibrium.2.1.2 Chapter GoalMotivated by social networking services, such as Twitter, Google+, and Facebook, and wireless-enabled electronics, in this chapter, we focus on scenarios where self-interested decision makers242.1. Introductionsubscribe to social groups within which they share information about their past experiences.Individuals are in fact oblivious to the existence of other actors, except their fellow groupmembers, whose decisions affect the outcome of their choices. The question that this chapterposes and answers is then:Can one devise an adaptive learning algorithm that captures the emerging infor-mation flow patterns to ensure faster coordination among agents’ strategies, andadaptability to the regime switching underlying the decision problem?To answer this question, we focus on regret-matching adaptive procedures and take our inspira-tion from Hart & Mas-Colell’s works [141, 143]. The literature on learning in games verifies thebalance between the complexity of the learning rules and the decision problem: Simple learningrules are expected to perform well in simple environments, whereas larger and more complexenvironments call for more sophisticated learning procedures.2.1.3 Main ResultsThe main results in this chapter are summarized below:1. A graph-theoretic neighborhood monitoring model that captures formation of social groups,and a regime switching model that captures random and unpredictable evolution of param-eters affecting the decision problem or the interactive environment (or both).2. A regret-matching type adaptive learning algorithm that prescribes how to make decision ateach stage based only on the realizations of the outcome of previous decisions, and exchangeof information within social group. The proposed rule-of-thumb learning strategy is simplya non-linear adaptive filtering algorithm that seeks low internal regret [48].3. Performance analysis of the proposed adaptive learning algorithm. We prove that, if thedecision maker individually follows the proposed regret-based learning strategy and the ran-dom evolution of the parameters underlying the game occur on the same timescale as theadaptation speed of the algorithm, she will experience a worst case regret of at most aftersufficient repeated plays of the game, and is able to keep it as the game evolves. Further,if now all players start following the proposed adaptive learning algorithm and parametersof the game evolve on a slower timescale than the adaptation speed of the algorithm, theirglobal behavior will track the regime switching set of correlated -equilibria.4. Numerical evaluation and comparison with existing schemes in the literature. Two scenariosare considered: static, and regime switching games. In the static setting, the numerical stud-ies illustrate that an order of magnitude faster coordination can be obtained by exploitingthe information disclosed within social groups, as compared with the regret-matching rein-forcement learning algorithm [143], even for a small game comprising of three players. In the252.1. Introductionregime switching setting, it is shown that the proposed learning rule benefits from this excessinformation, and yields faster adaptability as compared with the regret-matching [143] andfictitious play [105] procedures.It is well known in the theory of stochastic approximations that, if the underlying param-eters change too drastically, there is no chance one can track the time-varying properties viaan adaptive algorithm. Such a phenomenon is known as trackability ; see [40] for related dis-cussions. On the other hand, if the parameters evolve on a slower timescale as compared tothe stochastic approximation algorithm, they remain constant in the fast timescale and theirvariation can be asymptotically ignored. The performance analysis in this chapter considersthe non-trivial and more challenging case where the random evolution of the non-cooperativegame model occurs on the same timescale as the updates of the adaptive learning algorithm.In contrast to the standard treatment of stochastic approximation algorithms, where a limitmean ordinary differential equation (ODE) captures the dynamics of the discrete-time iterates,the limiting dynamical system will be shown to follow a switched differential inclusion in thiscase. Differential inclusions are generalizations of ODEs, and arise naturally in game-theoreticlearning since the strategies according to which others play are unknown.The performance analysis proceeds as follows: First, by a combined use of an updated treat-ment on stochastic approximation [192], and Markov-switched systems [291, 293], we show thatthe limiting behavior of the discrete-time iterates of the adaptive learning algorithm convergesweakly to a system of interconnected differential inclusions modulated by a continuous-timeMarkov chain. Next, using Lyapunov function methods [197], it is shown that the limitingdynamical system is globally asymptotically stable almost surely, and its global attractors arecharacterized. Up to this point, we have shown that each agent asymptotically experiences atmost regret by following the proposed adaptive learning scheme. In the final step, it will beshown that the global behavior emergent from such limit individual dynamics in the game isattracted to the correlated -equilibria set when the game evolves on a slower timescale.2.1.4 Chapter OrganizationThe rest of this chapter is organized as follows: Section 2.2 introduce a class of regime switch-ing non-cooperative games with neighborhood monitoring after reviewing the standard non-cooperative game theory. It further introduces the two prominent solution concepts and elab-orates on their distinct properties. Section 2.3 then presents the proposed adaptive learningprocedure and highlights its characteristics. Subsequently, the local and global behavior emerg-ing from individuals following the proposed adaptive learning scheme is characterized in Sec-tion 2.4. A step-by-step proof of the main results is summarized in Section 2.5. The case ofthe slowly regime switching game is analyzed in Section 2.6. Finally, numerical examples areprovided in Section 2.7 followed by the closing remarks in Section 2.8. The proofs are relegated262.2. Non-Cooperative Game Theoryto Section 2.9 for clarity of presentation.2.2 Non-Cooperative Game TheoryThis section starts with introducing the standard representation of non-cooperative games inSection 2.2.1. We then proceed to introduce and elaborate on two prominent solution conceptsof such games in Section 2.2.2. Finally, Section 2.2.3 presents the regime switching game model.2.2.1 Strategic Form Non-Cooperative GamesThe initial step in performing a non-cooperative game-theoretic analysis is to frame the situationin terms of all conceivable actions of agents and their associated payoffs. The standard repre-sentation of a non-cooperative game G, known as normal form or strategic form game [215, 229],is comprised of three elements:G =(K,(Ak)k∈K,(Uk)k∈K). (2.1)Each element is described below:1. Player Set: K = {1, · · · ,K}. A player essentially models an entity that is entitledto making decisions, and whose decisions affect others’ decisions. Players may be people,organizations, firms, etc., and are indexed by k ∈ K.2. Action Set: Ak ={1, · · · , Ak}. It denotes the set of actions, also referred to as purestrategies, available to player k at each decision point. A generic action taken by player k isdenoted by ak. The actions of players may range from deciding to establish or abolish linkswith other agents [30] to choosing among different products/technologies [295].A generic joint action profile of all players is denoted bya = (a1, · · · , aK) ∈ AK, where AK = A1 × · · · × AKand × denotes the Cartesian product. Following the common notation in game theory, one canrearrange the action profile asa =(ak,a−k), where a−k =(a1, · · · , ak−1, ak+1, · · · , aK)denotes the action profile of all players excluding player k.3. Payoff Function: Uk : AK → R. It is a bounded function that determines the payoffto player k as a function of the action profile a taken by all players. The interpretation of272.2. Non-Cooperative Game Theorysuch a payoff is the aggregated rewards and costs associated with the chosen action as theoutcome of the interaction. The payoff function can be quite general. For instance, it couldreflect reputation or privacy, using the models in [134, 217], or benefits and costs associatedwith maintaining links in a social network, using the model in [30, 294]. It could also reflectbenefits of consumption and the costs of production, download, and upload in content produc-tion and sharing over peer-to-peer networks [128, 231], or the capacity available to users incommunication networks [168, 194].In contrast to the pure strategy that specifies a possible decision or action, a mixed strategyis a probability distribution that assigns to each available action in the action set a likelihoodof being selected. More precisely, a mixed strategy for player k is of the formpk =(pk1, · · · , pkAk), where 0 ≤ pki ≤ 1, and∑i∈Ak pki = 1. (2.2)A mixed strategy can be interpreted as how frequently each action is to be played in a repeatedgame. Note that, even if the set of pure strategies is finite, there are infinitely many mixedstrategies. The set of all mixed strategies available to player k forms an Ak-simplex∆Ak ={p ∈ RAk ; 0 ≤ pi ≤ 1,∑i∈Ak pi = 1}. (2.3)A player uses a mixed strategy only when: (i) she is indifferent between several pure strategies;(ii) keeping the opponent guessing is desirable, i.e., the opponent can benefit from knowingthe next move; or (iii) the player needs to experiment in order to learn her optimal strategy.The significance of mixed strategies becomes more clear when existence of Nash equilibrium isdiscussed in Section 2.2.2.In framing the game-theoretic analysis, another important issue is the information availableto individuals at each decision point. This essentially has bearing on the timing of interactionsand is ascertained via answering the following questions:• Are the interactions repeated? Repeated games refer to a situation where the same basegame, referred to as stage game, is repeatedly played at successive decision points. For in-stance, suppliers and buyers make deals repeatedly, nations engage in ongoing trades, bossestry to motivate workers on an ongoing basis, etc. Repeated interaction allows for sociallybeneficial outcomes, essentially substituting for the ability to make binding agreements. Itfurther allows learning as the outcome of each interaction conveys information about actionsand preferences of other players.• Do players make decisions simultaneously or sequentially? Simultaneous movegames arise when players make decisions simultaneously, without being aware of the choicesand preferences of others. For instance, two firms independently decide whether to develop282.2. Non-Cooperative Game Theoryand market a new product, or individuals independently make a choice as which social clubto join. In contrast, moving after another player in sequential games gives the advantage ofknowing and accounting for the previous players’ actions prior to taking an action.• What does each player realize/observe once the interaction is concluded? Em-bodied in the answer to this question is the social knowledge of each player. With theprevalence of social networks, individuals may obtain information about the preferences andchoices of (some of) other decision makers in addition to the payoffs that they realize as theoutcome of making choices. A particular choice made by a self-interested rational player isunderstood to be a best-response to his limited perception of the outside world. Observingsuch choices, other players could adjust their strategies so as to maximize their payoffs viadecreasing others.In this chapter, we concentrate on simultaneous move non-cooperative games that are in-finitely repeated. The simplest model for the information flow and social knowledge of playersis one where agents observe the past decisions of all other agents playing the same game. Thischapter focuses on a more sophisticated model where agents form social groups within whichshare their past experiences.2.2.2 Solution Concepts: Correlated vs. Nash EquilibriumHaving framed the interactions as a non-cooperative game, one can then make predictions aboutthe behavior of decision makers. Strategic dominance is the most powerful analytical tool tomake behavioral predictions. A dominant strategy is one that ensures the highest payoff for anagent irrespective of the action profile of other players. A more precise definition is providedbelow.Definition 2.1 (Dominant Strategy). An action akd ∈ Ak is a dominant strategy for player kif for all other actions ak ∈ Ak −{akd}:Uk(akd,a−k)> Uk(ak,a−k), ∀a−k ∈ A−k. (2.4)Dominant strategies make game-theoretic analysis relatively easy. They are also powerfulfrom the decision makers perspective since no prediction of others’ behavior is required. How-ever, such strategies do not always exist, hence, one needs to resort to equilibrium notions inorder to make behavioral predictions.The Nash equilibrium [218] and its refinements are without doubt the most well-knowngame-theoretic equilibrium notions in both economics and engineering literature. One canclassify Nash equilibrium into two types:(i) Pure strategy Nash equilibrium, where all agents are playing pure strategies;292.2. Non-Cooperative Game Theory(ii) Mixed strategy Nash equilibrium, where at least one agent is playing a mixed strategy.The underlying assumption in Nash equilibrium is that players act independently. That is, theprobability distribution on the action profiles is a product measure. More precisely,p(i1 , · · · , iK):= P(a1 = i1 , · · · , aK = iK)=K∏k=1pk(ik) (2.5)where pk(ik) denotes the probability of agent k playing action ik.A Nash equilibrium is a strategy profile with the property that no single player can benefitby deviating unilaterally to another strategy; see below for precise definitions.Definition 2.2 (Nash Equilibrium). An action profile a =(ak,a−k)is a pure strategy Nashequilibrium ifUk(ak,a−k)≥ Uk(âk,a−k), ∀âk ∈ Ak, ∀k ∈ K. (2.6)A profile of mixed strategies p =(p1, · · · ,pK)forms a mixed strategy Nash equilibrium if, forall p̂k ∈ ∆Ak (see (2.3)) and k ∈ K,Uk(pk,p−k)≥ Uk(p̂k,p−k)(2.7)where p−k is the joint mixed strategy of all players excluding player k (see (2.5)), andUk(pk,p−k)=∑ak,a−kpk(ak)p−k(a−k)Uk(ak,a−k).John Nash proved in his famous paper [218] that every game with a finite set of playersand actions has at least one mixed strategy Nash equilibrium. However, Robert Aumann,who introduced the concept of correlated equilibrium, asserts in [25] that Nash equilibriummakes sense only in situations where, for some specified reason, each player knows others’strategies. Evidently, this assumption is rather restrictive and, more importantly, is rarely truein any strategic interactive situation. Aumann further justifies in [25] that, from the Bayesianrationality viewpoint, the appropriate equilibrium notion is not the ordinary mixed strategyNash equilibrium, but the more general notion of correlated equilibrium. This, indeed, is thevery reason why correlated equilibrium [25] is adopted as the solution concept in this chapter.Aumann also noted in [24] that every mixed strategy Nash equilibrium induces a correlatedequilibrium. This also implies that for every game there are correlated equilibria. However,Hart and Schmeidler noted in [146] that correlated equilibria are defined by a finite set of linearinequalities, and showed (without using Nash’s result) that the feasible set induced by theseinequalities is non-empty for all games. Their result also deals with games with an infinitenumber of players, for which there is no Nash equilibrium in general.302.2. Non-Cooperative Game TheoryCorrelated equilibrium is a generalization of the well-known Nash equilibrium and describesa condition of competitive optimality. It can most naturally be viewed as imposing equilibriumconditions on the joint action profile of agents (rather than on individual actions as in Nashequilibrium). An intuitive interpretation of correlated equilibrium is coordination in decision-making as described by the following example.Example 2.1. Consider a traffic game where two drivers meet at an intersection and have todecide whether to “go” or “wait”. If they simultaneously decide to go, they will crash and bothincur large losses. If one waits and the other one goes, the one that passes the intersectionreceives positive utility, whereas the one who waits receives zero utility. Finally, if they bothdecide to wait, they both will receive small negative utilities due to wasting their time. Thenatural solution is the traffic light. In game-theoretic terminology, the traffic light serves asa fair randomizing device that recommends actions to the agents—signals one of the cars togo and the other one to wait; it is fair in the sense that at different times it makes differentrecommendations.In light of the above example, suppose that a mediator is observing a non-cooperative gamebeing repeatedly played among a number of agents. The mediator, at each round of the game,gives private recommendations as what action to choose to each agent. The recommendationsare correlated as the mediator draws them from a joint probability distribution on the actionprofile of all agents; however, each agent is only given recommendations about her own action.Each agent can freely interpret the recommendations and decide if to follow. A correlatedequilibrium results if neither of agents wants to deviate from the provided recommendation.That is, in correlated equilibrium, agents’ decisions are coordinated as if there exists a globalcoordinating device that all agents trust to follow.Let us now formally define the set of correlated equilibria for the game G, defined in Sec-tion 2.2.1.Definition 2.3 (Correlated Equilibrium). Let p denote a probability distribution on the spaceof action profiles AK, i.e.,0 ≤ p(a) ≤ 1, ∀a ∈ AK, and∑a p(a) = 1. (2.8)The set of correlated -equilibria is the convex polytopeC ={p :∑a−kpk(i,a−k)[Uk(j,a−k)− Uk(i,a−k)]≤ , ∀i, j ∈ Ak, k ∈ K}(2.9)where pk(i,a−k) denotes the probability of player k playing action i and the rest playing a−k. If = 0, the convex polytope represents the set of correlated equilibria, and is denoted by C.312.2. Non-Cooperative Game TheoryCorrelated equilibrium is viewed as the result of Bayesian rationality [24, 25]. Some advan-tages that make it even more appealing include:• Realistic. Correlated equilibrium is realistic in multi-agent learning. Indeed, Hart andMas-Colell observe in [143] that for most simple adaptive procedures, “. . . there is a naturalcoordination device: the common history, observed by all players. It is thus reasonable toexpect that, at the end, independence among players will not obtain.” In contrast, Nashequilibrium assumes agents act independently which is rarely true in any learning scenario.• Structural Simplicity. The correlated equilibria set constitutes a compact convex poly-hedron, whereas the Nash equilibria are isolated points at the extrema of this set [219].• Computational Simplicity. Computing correlated equilibrium requires solving a linearfeasibility problem (linear program with null objective function) that can be done in poly-nomial time, whereas computing Nash equilibrium requires finding fixed points.• Payoff Gains. The coordination among agents in the correlated equilibrium can lead topotentially higher payoffs than if agents take their actions independently (as required byNash equilibrium) [25].• Learning. There is no natural process that is known to converge to a Nash equilibrium in ageneral non-cooperative game that is not essentially equivalent to exhaustive search. Thereexist, however, simple procedures that do converge to correlated equilibria (the so-called lawof conservation of coordination [144]) under almost no structural assumption on the gamemodel, e.g., regret-matching [141].Existence of a centralized coordinating device neglects the distributed essence of many ap-plications of practical interest. Limited information at each agent (about the strategies ofothers) further complicates the process of computing correlated equilibrium. In fact, even ifagents could compute correlated equilibria, they would need a mechanism that facilitates coor-dinating on the same equilibrium state in the presence of multiple equilibria—each describing,for instance, a stable coordinated behavior of manufacturers on targeting influential nodes inthe competitive diffusion process [272]. This highlights the significance of adaptive learningalgorithms that, through repeated interactive play and simple strategy adjustments by agents,ensure reaching correlated equilibrium. The most well-known of such algorithms, fictitious play,was first introduced in 1951 [241], and is extensively treated in [106]. It, however, requires mon-itoring the behavior of all other agents that contradicts the information exchange structure inreal-life applications.The focus of this chapter is on the more recent regret-matching learning algorithms [39,56, 141, 143]. We use tools from stochastic approximation [192], to adapt these algorithmsto the information exchange structure in social networks, and Markov-switched systems [291,322.2. Non-Cooperative Game Theory293], to track time-varying equilibria under random evolution of the parameters underlying theenvironment, e.g., market trends in adopting new technologies.Figure 2.1 illustrates how the various notions of equilibrium are related in terms of theirrelative size and inclusion in other equilibria sets. As discussed earlier in this subsection, dom-inant strategies and pure strategy Nash equilibria do not always exist—the game of “MatchingPennies” being a simple example. Every finite game, however, has at least one mixed strategyNash equilibrium. Therefore, the “nonexistence critique” does not apply to any notion thatgeneralizes the mixed strategy Nash equilibrium in Figure 2.1. A Hannan consistent strategy(also known as universally consistent strategies [105]) is one that ensures, no matter what otherplayers do, the player’s average payoff is asymptotically no worse than if she were to play anyconstant strategy in all previous periods. Hannan consistent strategies guarantee no asymp-totic external regrets and lead to the so-called coarse correlated equilibrium [216] notion thatgeneralizes the Aumann’s correlated equilibrium.2.2.3 Regime Switching Non-Cooperative GamesMotivated by applications in cognitive networks [122, 185, 210], the non-cooperative gamemodel, formulated in Section 2.2.1, may evolve with time due to: (i) agents joining or leaving thegame, and (ii) changes in agents’ incentives (payoffs). Suppose that all time-varying parametersare finite-state, and absorbed to a vector. For simplicity, we index these vectors by Q ={1, . . . ,Θ} and work with the indices instead. Evolution of the underlying parameters can thenbe captured by an integer-valued process θ[n] ∈ Q. It is assumed that θ[n], n = 1, 2, . . ., isunobservable with unknown dynamics.For better exposition, let us assume that the payoff functions of players vary with the processθ[n] in the rest of this chapter. More precisely, the payoff function for each agent k takes theformUk(ak,a−k, θ[n]): Ak ×A−k ×Q → R. (2.10)The main assumption here is that agents do not observe θ[n], nor are they aware of its dynamics.Agents however may realize time-dependency of the payoff functions as taking the same actionprofile at different time instants may result in different payoffs.As the payoff functions evolve with time following the jumps of θ[n], so does the correlatedequilibria set as illustrated in Figure 2.2. Definition 2.3 thus has to be modified in such away that the dependence of the payoff function on the underlying time-varying parameter isproperly captured.Definition 2.4 (Regime-Switching Correlated Equilibrium). Let p denote a probability dis-tribution on the space of action profiles AK. The set of correlated -equilibria of the regime-switching game is the convex polytope332.3. Regret-Based Local Adaptation and LearningC(θ[n]) ={p :∑a−kpk(i,a−k)[Uk(j,a−k, θ[n])− Uk(i,a−k, θ[n])]≤ ,∀i, j ∈ Ak,∀k ∈ K}(2.11)where pk(i,a−k) denotes the probability of agent k picking action i and the rest playing a−k. If = 0, the convex polytope represents the regime-switching set of correlated equilibria, and isdenoted by C(θ[n]).In the rest of this chapter, the theory of stochastic approximation [40, 192] and Markov-switched systems [291, 293] will be used to devise an adaptive learning algorithm based on theregret-matching procedures in [141–143] that ensures reaching and tracking such non-stationaryequilibria.2.3 Regret-Based Local Adaptation and LearningThis section presents an adaptive decision-making procedure that, if locally followed by everyindividual decision maker, leads to a sophisticated rational global behavior. The global behaviorcaptures the decision strategies of all individuals taking part in the decision-making process,and will be defined precisely in the next section. This procedure is viewed as particularly simpleand intuitive as no sophisticated updating, prediction, or fully rational behavior is essential onthe part of agents. The procedure can be simply described as follows: At each period, an agentmay either continue with the same action as the previous period, or switch to other actions,with probabilities that are proportional to a “regret measure” [145]. Willingness to proceedwith the previous decision mimics the “inertia” that is existent in human’s decision-makingprocess.In what follows, we define a simple and intuitive regret measure that allows the decisionstrategies to track the time variations underlying the game. Subsequently, an adaptive learningalgorithm is presented that uses the theory of stochastic approximation to prescribe individualshow to update their regrets and, therefore, their decision strategies. The presented algorithmis largely dependent on the communication among agents and, hence, their level of “socialknowledge,” which can naturally be captured by a connectivity graph . Below, the connectivitygraph and the associated neighborhood structure are formally defined.Definition 2.5. The connectivity graph is a simple1 graph G = (E ,K), where agents form1A simple graph, also called a strict graph [271], is an unweighted, undirected graph containing no self loopsor multiple edges [124, p. 2].342.3. Regret-Based Local Adaptation and LearningDominant StrategiesPure Strategy Nash EquilibriaMixed Strategy Nash EquilibriaCorrelated EquilibriaHannan Consistent StrategiesFigure 2.1: Equilibrium notions in non-cooperative games. Enlarging the equilibria set weakensthe behavioral sophistication on the player’s part in order to reach equilibrium through repeatedplays of the game.timeθ [ ] = 3θ [ ] = 1θ [ ] = 2Figure 2.2: Polytope of correlated equilibria C(θ[n]) randomly evolving with time according toan unobserved process θ[n]. Tracking such regime switching equilibria is attainable by eachagent individually following the regret-based adaptive learning algorithm presented in Sec-tion 2.3 if the correlated equilibria set evolves on a timescale that is slower than that determinedby the adaptation rate of the adaptive learning algorithm.352.3. Regret-Based Local Adaptation and Learningvertices of the graph, and(k, l) ∈ E ⇔agents k and l exchange decisions,ak[n] and al[n], respectively,at the end of period n.The open and closed neighborhoods of each agent k are, respectively, defined byNk := {l ∈ K; (k, l) ∈ E} , and Nkc := Nk ∪ {k} . (2.12)The above communication protocol allows agents to exchange information only with neigh-bors (e.g. subscribers of the same social group), and will be referred to as neighborhood moni-toring [123]. This generalizes the perfect monitoring assumption, standard in the game theoryliterature [142]. Perfect monitoring constitutes the highest level of social knowledge, and canbe described as follows: Once a decision is made by an agent, it can be observed by all otheragents2,3. More precisely,The graph G is complete4and Nkc = K. (2.13)Hart and Mas-Colell proposed an adaptive regret-matching procedure in [141] as a strategy ofplay in long-run interactions. They proved that, under the perfect monitoring assumption, ifevery decision maker follows the proposed procedure, their collective behavior converges to theset of correlated equilibria in static non-cooperative games.2.3.1 Regret-Matching With Neighborhood MonitoringA natural extension to the perfect monitoring assumption, which is quite restrictive and imprac-tical in real-life applications, is to devise a regret-based adaptive learning procedure that relieson the neighborhood monitoring model introduced above. Agents only share their past decisionswith their fellow social group members. They are, in fact, oblivious to the existence of otheragents except their neighbors on the connectivity graph G (see Definition 2.5), nor are theyaware of the dependence of the outcome of their decisions on those of other individuals outsidetheir social group. In the rest of this section, we present and elaborate on a regret-matching2Note that sharing actions differs from exchanging strategies according to which the decisions are made.The latter is more cumbersome and less realistic, however, conveys information about the payoff functions ofopponents, hence, can lead to faster coordination.3In certain situations, the aggregate decisions of all agents affects individual agent’s strategy update, e.g.,proportion of neighbors adopting a new technology, or proportion of neighboring sensors that have chosen toactivate. Therefore, one can assume that this aggregate information is available to each agent instead of individualdecisions at the end of each period.4A simple graph is complete if every pair of distinct vertices is connected by a unique edge [45].362.3. Regret-Based Local Adaptation and Learningadaptive learning algorithm in the presence of such social groups.Social groups are characterized by the coalition of agents that perform the same localizedtask and interact locally. However, not only does the payoff of each agent depend on herlocal interaction and contribution to carrying out the local tasks, but also it is affected by theinherent interaction with those outside her social group. In particular, the payoff that eachagent receives as the outcome of her decision comprises of two terms:1. Local payoff, which is associated to the ‘local interaction’ (e.g., performing a localizationtask) within social groups;2. Global payoff, which pertains to the implicit strategic ‘global interaction’ with agentsoutside social group.More precisely, the payoff function can be expressed as [123]Uk(ak,a−k, θ[n]):= UkL(ak,aNk , θ[n])+ UkG(ak,aSk , θ[n]). (2.14)where θ[n] is the integer-valued process that captures the time variations underlying the gamemodel introduced in Section 2.2.3. Recall the open Nk and closed Nkc neighborhoods fromDefinition 2.5, and let Sk = K −Nkc denote the set of non-neighbors of agent k. In (2.14), theaction profile of all decision makers excluding agent k, denoted by a−k, has been rearrangedand grouped into two vectors aNk and aSk that, respectively, represent the joint action profileof neighbors and non-neighboring agents of agent k.Now, suppose the game G, defined in Section 2.2.3, with payoff function (2.14) is beingrepeatedly played over time. At each time n = 1, 2, . . ., each agent k makes a decision ak[n]according to a probability distribution pk[n], and receives a payoffUkn(ak[n]):= Uk(ak[n],a−k[n], θ[n]). (2.15)The realized payoff is in fact non-stationary partly due to the non-stationarity inherent in thegame model and partly due to the other agents’ action profile a−k[n] changing with time. Now,suppose that agents know their local payoff functions at each time n. Since the non-stationarityunderlying the game model is unobservable, we defineUkL,n(ak,aNk):= UkL(ak,aNk , θ[n])(2.16)and assume that UkL,n is known to agent k at each time n. Therefore, knowing the action profileof neighbors, agents are able to evaluate their local stage payoffs. In contrast, even if agentsknow their global payoff functions, they could not directly compute global payoffs as they donot acquire the action profile of agents outside their social groups. It is, however, possible to372.3. Regret-Based Local Adaptation and Learningevaluate the realized global payoffs viaUkG,n(ak[n]):= Ukn(ak[n])− UkL,n(ak[n],aNk [n]). (2.17)where Ukn and UkL,n are defined in (2.15) and (2.16), respectively.The regret-matching procedure that is adapted to the neighborhood monitoring model thenproceeds as follows [123]: Define the local regret matrix RL,k[n], where each element rL,kij [n] isdefined asrL,kij [n] = (1− µ)n−1[UkL,1(j,aNk [1])− UkL,1(ak[1],aNk [1])]I(ak[1] = i)+ µ∑2≤τ≤n(1− µ)n−τ[UkL,τ(j,aNk [τ ])− UkL,τ(ak[τ ],aNk [τ ])]I(ak[τ ] = i)(2.18)and is updated recursively viarL,kij [n] = rL,kij [n− 1] + µ([UkL,n(j,aNk [n])− UkL,n(i,aNk [n])]· I(ak[n] = i)− rL,kij [n− 1]). (2.19)In both (2.18) and (2.19), 0 < µ 1 is a small parameter that represents the adaptation rateof the revised regret-matching procedure, and I(X) denotes the indicator operator: I(X) = 1if X is true and 0 otherwise. The second term on the r.h.s. of (2.19) is the instantaneous localregret as it evaluates the losses in local payoff for not choosing action j instead of the presentlypicked action i. The expression (2.18) has a clear interpretation as the discounted average localregret for not having played action j in exactly those periods when action i was picked. Ateach time n, the losses experienced in m < n periods before are weighted by µ(1 − µ)m. Thisparticular choice of the weighting factor allows to obtain the recursive updating rule (2.19) inthe form of a classical stochastic approximation algorithm, which in turn qualifies employingthe powerful ordinary differential equation (ODE) approach [40, 192] for the tracking analysisin Section 2.5.In contrast, since agent k knows neither her global payoff function nor the decisions of thoseoutside her social group, she is unable to perform the thought experiment to evaluate globalpayoffs for alternative actions as she does in (2.18). The global regret matrix RG,k[n] is thusdefined based only on the realizations of the global payoffs, defined in (2.17). Each elementrG,kij [n] of the regret matrix is recursively updated viarG,kij [n] = rG,kij [n− 1] + µ[pki [n]pkj [n]UkG,n(ak[n])· I(ak[n] = j)− UkG,n(ak[n])· I(ak[n] = i)− rG,kij [n− 1]](2.20)where 0 < µ 1 is the same adaptation rate as in (2.19). Here, the third term on the r.h.s. is382.3. Regret-Based Local Adaptation and Learningsimilar to the third term on the r.h.s. of (2.19), and is simply the realized global payoff if actioni is picked. The second term, roughly speaking, estimates the global payoff if action j replacedaction i in the same period5. To this end, it only uses realized payoffs of those periods whenaction j was actually played. The normalization factor pki [n]/pkj [n], intuitively speaking, makesthe lengths of the respective periods comparable [143]. Each element rG,kij [n] can be definednon-recursively byrG,kij [n] = (1− µ)n−1[pki [1]pkj [1]UkG,1(ak[1])· I(ak[1] = j)− UkG,1(ak[1])· I(ak[1] = i)](2.21)+ µ∑2≤τ≤n(1− µ)n−τ[pki [τ ]pkj [τ ]UkG,τ(ak[τ ])· I(ak[τ ] = j)− UkG,τ(ak[τ ])· I(ak[τ ] = i)].Here, pki [τ ] denotes the probability of choosing action i by agent k at period τ . In view ofthe above discussion, rG,kij [n] can be interpreted as the discounted average global regret for nothaving picked action j, every time action i was picked in the past.At each time n, the agent makes a decision ak[n] according to a decision strategypk[n] =(pk1[n], · · · , pkAk [n])which combines the local and global regrets as follows:pki[n] = P(ak[n] = i∣∣ak[n− 1] = j)=(1− δ) min{1ρk∣∣∣rL,kji [n] + rG,kji [n]∣∣∣+, 1Ak}+ δAk , i 6= j,1−∑6`=i pk` [n], i = j.(2.22)Here, |x|+ = max{0, x}, ρk > 0 is a large enough number to ensure pk[n] is a valid probabilitydistribution6, and Ak represents the cardinality of agent k’s action space. Further, 0 < δ < 1is the parameter which determines the relative weight of exploration versus exploitation. Notein (2.22) that j represents the decision made in period n− 1 and, therefore, is known at periodn. The regret-matching procedure is illustrated in a block diagram in Figure 2.3.Intuitively speaking, the regret-matching procedure governs the decision-making process bypropensities to depart from the current decision. It is natural to postulate that, if a decisionmaker decides to switch her action, it should be to actions that are perceived as being morerewarding. The regret-matching procedure assigns positive probabilities to all such better5Strictly speaking, the limit sets of the process (2.19) and the true global regret updates if agents wereable to observe the action profile of non-neighboring agents coincide. The details are discussed in Section 2.5.1.Alternatively, one can show that the conditional expectation of the difference between the global regret estimatesand the true global regrets is zero; see [143] for further details.6It suffices to let ρk > Ak∣∣Ukmax − Ukmin∣∣, where Ukmax and Ukmin represent the upper and lower bounds on thepayoff function Uk, respectively, and Ak denotes the cardinality of the action space for agent k.392.3. Regret-Based Local Adaptation and Learningchoices. In particular, the more rewarding an alternative action seems, the higher will be theprobability of choosing it next time. The probabilities of switching to different actions are thusproportional to their regrets relative to the current action, hence the name ‘regret-matching’.Let 1S denote an S × 1 vector of ones. The regret-matching procedure with neighborhoodmonitoring is summarized below in Algorithm 2.1. This procedure is to be executed indepen-dently by individual agents, and is expressed for a generic agent k in Algorithm 2.1.2.3.2 Discussion and IntuitionThis subsection elaborates on the distinct properties of the regret-matching procedure withneighborhood monitoring.Decision Strategy. The randomized strategy (2.22) is simply a weighted average of two prob-ability vectors: The first term, with weight 1 − δ, is proportional to the positive part of thecombined regrets. Taking the minimum with 1/Ak guarantees that pk[n] is a valid probabilitydistribution, i.e.,∑i pki [n] = 1. The second term, with weight δ, is just a uniform distributionover the action space [123, 143].Adaptive Behavior. In (2.24) and (2.25), µ essentially introduces an exponential forgettingof the experienced regrets in the past and facilitates adaptivity to the evolution of the non-cooperative game model over time. As agents successively take actions, the effect of the oldexperiences on their current decisions vanishes. As will be shown later in Section 2.4, this enablestracking time variations on a timescale that is as fast as the adaptation rate of Algorithm 2.1.Inertia: The choice of ρk guarantees that there is always a positive probability of playing thesame action as the last period. Therefore, ρk can be viewed as an “inertia” parameter: A higherρk yields switching with lower probabilities. It plays a significant role in breaking away frombad cycles. Clearly, the speed of convergence to the correlated equilibria set will be closelyrelated to this inertia parameter. This indeed is the very reason that makes convergence to thecorrelated equilibria set possible under (almost) no structural assumptions on the underlyinggame [141].Better-response vs. Best-response: In light of the above discussion, the most distinctive featureof the regret-matching procedure, that differentiates it from other works such as [93, 106, 107],is that it implements a better-response rather than a best-response strategy7. The inertiaassigns positive probabilities to any action that is just better. Indeed, the behavior of a regret-matching decision maker is very far from that of a rational decision maker that makes optimal7This has the additional effect of making the behavior continuous, without need for approximations [141].402.3. Regret-Based Local Adaptation and LearningDecisionStrategy U pdateDecisionOutcomeInteractiveEnvironmentInformationExchangeObservation /Data AcquisitionInteractiveEnvironmentRegretUpdateSocial SensorFigure 2.3: Regret-based local adaptation and learning. Individuals update regrets after ob-serving the outcome of the previous decisions and, accordingly, make next decisions. These newdecisions will then be shared within social groups characterized by the connectivity graph.Algorithm 2.1 Local Adaptation and Learning for Decision Makers Forming Social GroupsInitialization: Set the tunable exploration factor 0 < δ < 1, and inertia parameter ρk >Ak∣∣Ukmax − Ukmin∣∣, where Ukmax and Ukmin denote the upper and lower bounds on agent k’ payofffunction Uk, respectively. Set 0 < µ 1, and initializepk[0] = 1Ak · 1Ak , and RL,k[0] = RG,k[0] = 0.Step 1: Action Selection. Choose ak[n] ∼ pk[n] = (pk1[n], · · · , pkAk [n]), wherepki [n] ={(1− δ) min{1ρk∣∣∣rL,kji [n] + rG,kji [n]∣∣∣+, 1Ak}+ δAk , ak[n− 1] = j 6= i1−∑j 6=i pkj [n], ak[n− 1] = i(2.23)where |x|+ = max{0, x}.Step 2: Local Information Exchange. Share decisions ak[n] with neighbors Nk.Step 3: Local Regret Update.RL,k[n+ 1] = RL,k[n] + µ[BL,k(ak[n],aNk [n], n)−RL,k[n]](2.24)BL,k =[bL,kij], where bL,kij(ak[n],aNk [n], n)=[UkL,n(j,aNk)− UkL,n(ak[n],aNk)]I(ak[n] = i)Step 3: Global Regret Update.RG,k[n+ 1] = RG,k[n] + µ[BG,k (a[n], n)−RG,k[n]](2.25)BG,k =[bG,kij], where bG,kij (a[n], n) =[pki [n]pkj [n]UkG,n(ak[n])I(ak[n] = j)− UkG,n(ak[n])I(ak[n] = i)]Recursion. Set n← n+ 1, and go Step 1.412.3. Regret-Based Local Adaptation and LearningRegret-MatchingDecision-Making Process Information Exchange Within Social Group12345Adaptive FilterRealizedPayoffSocial Groups’ Decisions1 2 4, ,a a a1a 2a4a5aFigure 2.4: Game-theoretic learning with neighborhood monitoring. Agent 5 discloses decisionwith her social group members N5 = {1, 2, 4}. The exchanged decisions, together with herrealized payoff, are then fed into a regret-matching adaptive filter to prescribe the next action.decisions given his (more or less well-formed) beliefs about the environment. Instead, it re-sembles the model of a reflex-oriented individual that reinforces decisions with “pleasurable”consequences [143].Computational Complexity: The computational burden (in terms of calculations per iteration)of the regret-matching procedure is small. It does not grow with the number of agents, hence,is scalable. At each iteration, each agent needs to execute two multiplications, two additions,one comparison and two table lookups (assuming random numbers are stored in a table) tocalculate the next decision. It is therefore suitable for implementation in many applications ofpractical interest such as unattended ground sensor networks, where individual agents posseslimited local computational capability.Decentralized Adaptive Filtering: As will be shown later in Section 2.4, if every agent followsAlgorithm 2.1, the global behavior tracks the set of correlated equilibria. Theoretically, findinga correlated equilibrium in a game is equivalent to solving a linear feasibility problem; see (2.9).Algorithm 2.1 is thus an adaptive filtering algorithm for solving this problem, which is quiteinteresting due to its decentralized nature.Reinforcement Learning of Isolated Agents. Assuming that no player is willing to form a socialgroup, the above revised regret-matching procedure simply implements a reinforcement learningscheme that only requires the realized payoff to prescribe the next decision. The probabilityof choosing a particular action i at time n + 1 gets “reinforced” by picking i at time n, whilethe probabilities of other actions j 6= i decrease. In fact, the higher the payoff realized as theconsequence of choosing i at time n, the greater will be this reinforcement [143].422.3. Regret-Based Local Adaptation and LearningExploration vs. Exploitation. The second term, with weight δ, simply forces every action to beplayed with some minimal frequency (strictly speaking, with probability δ/Ak). The explorationfactor δ is essential to be able to estimate the contingent payoffs using only the realized payoffs;it can, as well, be interpreted as exogenous statistical “noise.” As will be discussed later, largerδ will lead to the convergence of the global behavior to a larger -distance of the correlatedequilibria set.Markov Chain Construction. The sequence{ak[n], RL,k[n], RG,k[n]}is a Markov chain with state space Ak, and transition probability matrixPr(ak[n] = j∣∣ak[n− 1] = i)= Pij(RL,k[n], RG,k[n])The transition matrix of the Markov chain thus depends on the local and global regret matrices.It is continuous in both variables RL,k and RG,k. It is also primitive and aperiodic for eachpair (RL,k, RG,k). The primitiveness is essentially due to the exploration present in the decisionstrategy (2.23): at each time, a small probability of δ/Ak is placed on picking each action in theaction set. It can be easily verified that Pmij (RL,k, RG,k) ≥ δm/Ak for all i, j ∈ Ak and m ≥ 1in the transition matrix (2.26). The aperiodicity is further due to the inertia existent in thedecision strategy. The parameter ρk in (2.23) ensures that the sum of non-diagonal elements ofthe transition matrix remains strictly less than one at each time; therefore, Pmii (RL,k, RG,k) > 0for all m ≥ 1.The above Markov chain is (conditionally) independent of the action profile of neighborsand non-neighbors; however, these profiles may be correlated. More precisely, leth[n] ={(a[τ ])}nτ=1 , where a[τ ] =(ak[τ ],aNk [τ ],aSk [τ ])denote the history of decisions made by all agents up to time n. Then,P(ak[n] = i,aNk [n] = a,aSk [n] = a∣∣ h[n− 1])= P(ak[n] = i∣∣ h[n− 1])P(aNk [n] = a,aSk [n] = a∣∣ h[n− 1])= Pak[n−1]i(RL,k[n], RG,k[n])P(aNk [n] = a,aSk [n] = a∣∣ h[n− 1]).(2.26)The sample path of this Markov chain{ak[n]}is fed back into the stochastic approximationalgorithms that update RL,k[n] and RG,k[n], which in turn affect its transition matrix. Thisinterpretation will be used in Section 2.9 when deriving the limit dynamical system representing432.4. Emergence of Rational Global Behaviorthe behavior of Algorithm 2.1.Noise Corrupted Payoff Observations. If an agent’s observations of realized payoffs are per-turbed by noise, the learning procedure summarized in Algorithm 2.1 still yields the convergenceproperties that will be discussed next in Section 2.4 so long as the observations satisfy a weaklaw of large numbers. More precisely, let {Ûkn(a, θ)} denote the sequence of noisy observationsof agent k’s payoff Uk(a, θ) for a particular joint action profile a when θ[n] = θ is held fixed.(Recall from Section 2.2.1 that a = (ak,a−k).) Then, the convergence results of Section 2.4remain valid if, for all a ∈ AK and θ ∈ Q,1NN+`−1∑n=`E{Ûkn(a, θ)}→ Uk(a, θ) in probability as N →∞.2.4 Emergence of Rational Global BehaviorThis section characterizes the global behavior emerging from individuals following the localadaption and learning algorithm presented in Section 2.3. Section 2.4.1 provides a formaldefinition for the global behavior of agents. Section 2.4.2 then postulates a hyper-model forthe random unpredictable changes that the game model undergoes. Finally, Section 2.4.3presents the main theorem of this chapter that shows the local and global behavior manifestedas the consequence of following the regret-matching algorithm with neighborhood monitoringcan properly track the regime switching in the game model.2.4.1 Global BehaviorThe global behavior z[n] at each time n is defined as the discounted empirical frequency of jointplay of all agents up to period n. Formally,z[n] = (1− µ)n−1ea[1] + µ∑2≤τ≤n(1− µ)k−τea[τ ] (2.27)where ea[τ ] is a unit vector on the space of all possible action profiles AK with the elementcorresponding to the joint play a[τ ] being equal to one. The small parameter 0 < µ 1 is thesame as the adaptation rate of the regret-matching algorithm in Section 2.3. It introduces anexponential forgetting of the past decision profiles to enable adaptivity to the evolution of thegame model. That is, the effect of the old game model on the decisions of individuals vanishesas agents repeatedly take actions. Given z[n], the average payoff accrued by each agent can bestraightforwardly evaluated, hence the name global behavior. It is more convenient to define442.4. Emergence of Rational Global Behaviorz[n] via the stochastic approximation recursionz[n] = z[n− 1] + µ[ea[n] − z[n− 1]]. (2.28)The global behavior z[n] is a system “diagnostic” and is only used for the analysis of theemergent collective behavior of agents. That is, it does not need to be computed by individualagents. In real-life application such as smart sensor networks, however, a network controllercan monitor z[n], and use it to adjust agents’ payoff functions to achieve the desired globalbehavior [210].2.4.2 Hyper-Model for Regime Switching in the Game ModelA typical method for analyzing the performance of an adaptive algorithm is to postulate ahyper-model for the underlying time variations [40]. Here, we assume that dynamics of θ[n]in (2.10) follow a discrete-time Markov chain with infrequent jumps. The following assumptionformally characterizes the hyper-model.Assumption 2.1. Let {θ[n]} be a discrete-time Markov chain with finite state space Q ={1, 2, . . . ,Θ} and transition probability matrixP ε := IΘ + εQ (2.29)where ε > 0 is a small parameter, IΘ denotes the Θ × Θ identity matrix, and Q = [qij ] isirreducible and represents the generator of a continuous-time Markov chain satisfying8qij ≥ 0 for i 6= j,|qij | ≤ 1 for all i, j ∈ Q,Q1Θ = 0Θ.(2.30)It is important to emphasize that the above Markovian hyper-model is not used in imple-mentation of Algorithm 2.1 since it cannot be observed by agents. It is only used in this andthe next section to analyze how well these algorithms can track regime switching in the game.Choosing ε small enough in (2.29) ensures that the entries of P ε are all non-negative. Dueto the dominating identity matrix in (2.29), the transition probability matrix is “close” to theidentity matrix, hence, θ[n] varies slowly with time.8Note that this is no loss of generality. Given an arbitrary non-truly absorbing Markov chain with transitionprobability matrix Pµ = I + %Q, one can formQ =1qmax·Q, where qmax = maxi∈Ms|qii|.To ensure that the two Markov chains generated by Q and Q evolve on the same time scale, ε = % · qmax .452.4. Emergence of Rational Global BehaviorThe small adaptation rate µ in Algorithm 2.1 defines how fast the regrets are updated. Theimpact of the switching rate of the hyper-model θ[n] on convergence properties of the proposedregret-matching scheme is captured by the relationship between µ and ε in the transitionprobability matrix (2.29). If 0 < ε µ, the game rarely changes and the correlated equilibriaset is essentially fixed during the transient interval of the regret-matching algorithm. We thuspractically deal with a static game model. On the other hand, if ε µ, the game changes toodrastically, and there is no chance one can track the regime switching correlated equilibria setvia an adaptive stochastic approximation algorithm. In this chapter, we consider the followingno-trivial and more challenging case.Assumption 2.2 (Matching Speed). ε = O(µ) in the transition probability matrix P ε inAssumption 2.1.The above condition states that the time-varying parameters underlying the game modelevolve with time on a scale that is commensurate with that determined by the step-sizes µin Algorithm 2.1. This leads to an asymptotic behavior that is fundamentally different ascompared with the case 0 < ε µ, and will be discussed in Section 2.5.2.4.3 Asymptotic Local and Global BehaviorIn what follows, the main theorem of this chapter is presented that reveals both the local andglobal behavior emerging from each agent individually following the regret-matching procedurewith neighborhood monitoring. The regret matrices RL,k[n] and RG,k[n] in Algorithm 2.1 andz[n] will be used as indicatives of agent’s local and global experience, respectively.We use the so-called ordinary differential equation (ODE) method, which is perhaps themost powerful and versatile tool in the analysis of stochastic recursive algorithms, in order tocharacterize the asymptotic behavior of Algorithm 2.1. The ODE methods was first introducedby Ljung in [201] and extensively developed in several later works [189, 190, 192]. The basicidea is that, via a ‘local’ analysis, the noise effects in the stochastic algorithm is averagedout so that the asymptotic behavior is determined by that of a ‘mean’ ODE or an ordinarydifferential inclusion [38, 188]. To get the limit mean dynamical system, in lieu of working withthe discrete iterates directly, one works with continuous-time interpolations of the iterates.Accordingly, define byRL,k,µ(t) = RL,k[n], RG,k,µ(t) = RG,k[n], zµ(t) = z[n] for t ∈ [nµ, (n+ 1)µ) (2.31)the piecewise constant interpolations of the discrete-time iterates associated with the regret-matching procedure in Section 2.3.1. The main theorem that follows uses these interpolatedprocesses to characterize the asymptotic local and global behavior of the Algorithm 2.1.462.4. Emergence of Rational Global BehaviorThe following theorem simply asserts that, if an agent individually follows the Algorithm 2.1,she will experience regret of at most after sufficient repeated plays of the game. Indeed, can be made and kept arbitrarily small by properly choosing the exploration parameter δ inAlgorithm 2.1. This theorem further states that if, now, all agents start following Algorithm 2.1independently, the global behavior tracks the regime switching correlated equilibria set if therandom unpredictable regime switching underlying the game occurs sufficiently slowly. Differ-ently put, agents can coordinate their strategies in a distributed fashion so that the distributionof their joint behavior belongs to the correlated equilibria set. From the game-theoretic pointof view, we show non-fully rational local behavior of individuals—due to utilizing a “better-response” rather than a “best-response” strategy—can lead to the manifestation of globallysophisticated and rational behavior.In the following theorem, with slight abuse of notation, denote by RL,k[n] and RG,k[n] theregret matrices rearranged as vectors of length (Ak)2—rather than Ak ×Ak matrices—and letRL,k,µ(·) and RG,k,µ(·) represent the associated interpolated vector processes; see (2.31). Letfurther ‖·‖ denote the Euclidean norm, and R− represent the negative orthant in the Euclideanspace of appropriate dimension.Theorem 2.1. Suppose the regime switching game G, defined in Section 2.2.3, is being repeat-edly played, and Assumptions 2.1 and 2.2 hold. Consider the neighborhood monitoring modelin Section 2.3.1. Let tµ be any sequence of real numbers satisfying tµ → ∞ as µ → 0. Then,for each , there exists an upper bound δ̂() on the exploration parameter δ such that, if everyagent follows Algorithm 2.1 with δ < δ̂(), as µ→ 0, the following results hold:(i) Under Assumption 2.2, the combined local and global regrets RL,k,µ(·+tµ)+RG,k,µ(·+tµ)converges weakly to an -distance of the negative orthant. That is, for any β > 0,limµ→0P(dist[RL,k,µ(·+ tµ) +RG,k,µ(·+ tµ),R−]− > β)= 0 (2.32)where dist[·, ·] denotes the usual distance function.(ii) Suppose Assumption 2.2 is replaced with 0 < ε µ. Then, zµ(· + tµ) converges inprobability to the correlated -equilibria set C(θ) for each θ ∈ Q in the sense thatdist[zµ(·+ tµ), C(θ)]= infz∈C(θ)‖zµ(·+ tµ)− z‖ → 0 in probability. (2.33)Proof. See Section 2.5 and 2.6 for detailed proof.Note in the above theorem that the convergence arguments are to a set rather than aparticular point in that set. In fact, once z[n] is inside the regime switching polytope ofcorrelated equilibria C(θ[n]), it can move around until the hyper-model θ[n] jumps, which maypush z[n] outside the correlated equilibria set of the new game model.472.4. Emergence of Rational Global BehaviorRemark 2.1 (Static Games). In the case of a static game, i.e., when θ[n] = θ is held fixed,Algorithm 2.1 can be modified as follows: The constant step-size µ is replaced with a decreasing step-size µ[n] = 1/n and theconstant exploration parameter δ is replaced with 1/nα, where 0 < α < 1/4.With the above modifications, Algorithm 2.1 reduces to the reinforcement learning procedureproposed in [143] if no agent is willing to form a social group. It is proved in [143] that, ifE = ∅ in the connectivity graph in Definition 2.5 and all agents follow this algorithm, the globalbehavior converges almost surely to the set of correlated -equilibria. Although not being thefocus of this chapter, one can in fact use the results of [143] to show that, if all agents followAlgorithm 2.1 with E 6= ∅ and the above modifications, the global behavior still convergesalmost surely to the set of correlated -equilibria. It will be shown in Section 2.7 that allowinginformation exchange among decision makers forming social groups can lead to an order ofmagnitude faster convergence as compared to the regret-matching procedure proposed in [143].Note that the above modifications result in achieving stronger convergence results in thesense that one can now prove almost sure convergence rather than weak convergence. Moreprecisely, there exists with probability one N() > 0 such that for n > N(), one can find acorrelated equilibrium distribution at most at -distance of z[n] [141]. However, this will beachieved with the price that the agents can no longer respond to time variations in the gamemodel.Remark 2.2. The main assumption in this chapter is that θ[n] cannot be observed by de-cision makers, yet their global behavior can track the time-varying polytope of correlated -equilibrium. Assuming that agents observe θ[n] leads to the less challenging case: they canform RL,k,θ[n] and RG,k,θ[n] and run Algorithm 2.1 with the modifications described aboveseparately for each θ ∈ Q. In such settings, in view of Remark 2.1, one can show that agents’collective behavior almost surely converges to C(θ) for each θ ∈ Q.Remark 2.3 (Generalized Regret-Matching Decision Strategies). The decision strategy (2.22)can be generalized similar to [139, Section 10.3]. Define a continuously differentiable andnonnegative potential functionV(RL,k, RG,k): Rr × Rr → R, where r = Ak ×Aksuch that:1. V(RL,k, RG,k)= 0 if and only if RL,k + RG,k ∈ Rr−, where Rr− represents the negativeorthant in r-dimensional Euclidean space;2. ∇V(RL,k, RG,k)≥ 0;482.5. Analysis of the Adaptive Learning Algorithm3.〈∇V(RL,k, RG,k), RL,k +RG,k〉F > 0 if RL,k +RG,k /∈ Rr−, where ∇V = [∇ijV ] such that∇ijV = ∂V∂rL,k(i,j) +∂V∂rG,k(i,j) and 〈·, ·〉F denotes the Frobenius inner product.Then, if any agent employs of a decision strategy of the formpk[n] = (1− δ)ϕk[n] + δAk · 1Ak (2.34)in lieu of (2.22), where ϕk[n] is a probability distribution on the action space of agent k, andis a solution to the set of linear feasibility constraints∑j∈Ak−{i}ϕkj [n]∇jiV(RL,k[n], RG,k[n])= ϕki [n]∑j∈Ak−{i}∇ijV(RL,k[n], RG,k[n]), for all i, j ∈ Akthen the statements of Theorem 2.1 still hold.2.5 Analysis of the Adaptive Learning AlgorithmThis section is devoted to the proof of the main results summarized in Theorem 2.1. For betterreadability, details for each step of the proof are postponed to Section 2.9. The convergenceanalysis is based on [39] and is organized into three steps as follows: First, it will be shown inSection 2.5.1 that the limit system for the discrete time iterates of Algorithm 2.1 is a regimeswitching system of interconnected differential inclusions. Differential inclusions arise naturallyin game-theoretic learning, since the strategies according to which others play are unknown.Next, using multiple Lyapunov function methods [197], stability of the limit dynamical system isstudied in Section 2.5.2, and its set of global attractors is characterized. Accordingly, asymptoticstability of the interpolated processes associated with the discrete-time iterates is proved. Upto this point, we have shown that each agent asymptotically experiences at most regret byfollowing Algorithm 2.1, which is the first result in Theorem 2.1. Section 2.6 will then showthat the global behavior emergent from such limit individual dynamics in a slowly switchinggame is attracted to the correlated -equilibria set.2.5.1 Characterization of the Limit Individual BehaviorThe first step in the proof uses weak convergence methods to characterize the limit individualbehavior of agents following Algorithm 2.1 as a regime switching dynamical system representedby a system of interconnected differential inclusions. Differential inclusions are generalizationsof the ODEs [20]. Below, we provide a precise definition.Definition 2.6. A differential inclusion is a dynamical system of the formddtX ∈ F (X) (2.35)492.5. Analysis of the Adaptive Learning Algorithmwhere X ∈ Rr and F : Rr → Rr is a Marchaud map [19]. That is:1. The graph and domain of F are nonempty and closed.2. The values F (X) are convex.3. The growth of F is linear. That is, there exists C > 0 such that, for every X ∈ Rr,supY ∈F(X)‖Y ‖ ≤ C (1 + ‖X‖) (2.36)where ‖ · ‖ denotes any norm on Rr.We borrow techniques from the theory of stochastic approximations [192] and work withthe piecewise constant continuous-time interpolations of the discrete-time iterates of regrets.The regret matrices incorporate all information recorded in the global behavior vector z[n],however, translated based on each agent’s payoff function. Since it is more convenient to workwith the regrets matrices, we derive the limiting process associated with these regret matricesin what follows.Different from the usual approach of stochastic approximation [40], where ε = o(µ), Assump-tion 2.1 supposes ε = O(µ). In this case, the standard ODE method cannot be carried overdue to the fact that the game is now non-stationary, and the adaptation rate of the algorithmis the same as that of the timescale of the time variations underlying the game. By a com-bined use of the updated treatment on stochastic approximations [192] and Markov-switchedsystems [291, 292], it will be shown that one can still obtain a limit system. However, verydifferent from the existing literature, the limit system is no longer a single ODE, but a system ofinterconnected ordinary differential inclusions modulated by a continuous-time Markov chain.Hence, the limit is not deterministic, but stochastic.Before proceeding with the proof, we shall study properties of the sequence of decisionsmade according to Algorithm 2.1. Recall from Section 2.3.1 that the sequence of decisions{ak[n]} simply forms a finite-state Markov chain. Standard results on Markov chains show thatthe transition probabilities in (2.22) admit (at least) one invariant measure, denoted by σk.The following lemma characterizes the properties of such an invariant measure.Lemma 2.1. The invariant measure σk(RL,k, RG,k) of the transition probabilities (2.22) takesthe formσk(RL,k, RG,k)= (1− δ)χk(RL,k, RG,k)+ δAk · 1 (2.37)where χk(RL,k, RG,k) satisfies∑j 6=iχkj∣∣∣rL,kji + rG,kji∣∣∣+= χki∑j 6=i∣∣∣rL,kij + rG,kij∣∣∣+(2.38)and |x|+ = max{x, 0}.502.5. Analysis of the Adaptive Learning AlgorithmProof. Suppose χk is the invariant measure of the transition probabilities (2.22) when δ = 0.Then,χki = χki1−∑j 6=i∣∣rL,kij + rG,kij∣∣+ρk+∑j 6=iχkj∣∣rL,kji + rG,kji∣∣+ρk= χki −χkiρk∑j 6=i∣∣rL,kij + rG,kij∣∣+ + 1ρk∑j 6=iχkj∣∣rL,kji + rG,kji∣∣+Canceling out χki on both sides, and multiplying by ρk, (2.38) follows. Now, if δ > 0, thestrategy (2.22) randomizes uniformly over the action space with probability δ, and with theremaining probability 1−δ chooses the next action according to regrets. The invariant measureinherits the same property and, hence, it takes the form of (2.37).Note that (2.38) together withχk(Rk)≥ 0, and∑i χki(Rk)= 1forms a linear feasibility problem (null objective function). Existence of χk(Rk)can alterna-tively be proved using strong duality; see [51, Section 5.8.2] for details.Before proceeding with the theorem, a few definitions and notation are in order. Let us firstdefine weak convergence that is a generalization of convergence in distribution to a functionspace, and is central to the tracking analysis9.Definition 2.7 (Weak Convergence). Let Z[n] and Z be Rr-valued random vectors. Z[n]converges weakly to Z, denoted by Z[n]⇒ Z, if for any bounded and continuous function Ω(·),EΩ(Z[n])→ EΩ(Z) as n→∞.Let nk denote a probability measure over the joint action space of the open neighborhoodof agent k, and denote by ∆ANk the simplex of all such measures. The expected local payoffto agent k given the joint strategy nk is then defined byUkL(ak,nk):=∑aNk nk(aNk)UkL(ak,a−k)(2.39)where nk(aNk) denotes the probability of all neighbors of agent k jointly picking aNk . It isnow time to present the theorem that characterizes the limit system associated with the regretiterates in Algorithm 2.1.9The interested reader is referred to [192, Chapter 7] for further details on weak convergence and relatedmatters.512.5. Analysis of the Adaptive Learning AlgorithmTheorem 2.2. Consider the interpolated processes RL,k,ε(·) and RG,k,ε(·) defined in (2.31).Then, under Assumptions 2.1 and 2.2, as µ→ 0,(RL,k,µ(·), RG,k,µ(·), θµ(·))⇒ (RL,k(·), RG,k(·), θ(·))where (RL,k(·), RG,k(·)) is a solution of the system of interconnected differential inclusions{dRL,kdt ∈ HL,k(RL,k, RG,k, θ(t))−RL,kdRG,kdt = HG,k(RL,k, RG,k, θ(t))−RG,kand (2.40)HL,k =[hL,kij], hL,kij :={[UkL(j,nk, θ(t))− UkL(i,nk, θ(t))]σki(RL,k, RG,k); nk ∈ ∆ANk}HG,k =[hG,kij], hG,kij :=[UkG,t(j, θ(t))− UkG,t(i, θ(t))]σki(RL,k, RG,k).Here, UkG,t(·) is the interpolated process of the global payoffs accrued from the game, θ(t) is acontinuous-time Markov chain with the same generator matrix Q as in Assumption 2.1, andthe strategy σk is given in (2.37).Proof. See Section 2.9.The asymptotics of a stochastic approximation algorithm is typically captured by an ODE,which is in contrast to the limit system obtained in (2.40). An intuitive argument to justify thisdifference is as follows: Although agents observe aNk , they are oblivious to the strategies nkfrom which aNk has been drawn, which may in fact vary over time. Different strategies nk formdifferent trajectories of RL,k[n]. The interpolated process RL,k,µ(·) thus follows the trajectoryof a differential inclusion rather than an ODE . The same argument holds for RG,k,µ(·) exceptthat the limit dynamics follow an ODE. This is due to agents being oblivious to the facts: (i)non-neighbors exist, and (ii) global payoffs are affected by non-neighbors’ decisions. However,they realize the time-dependency of UkG,n as taking the same action at various times results indifferent payoffs. The interconnection of the dynamics in (2.40) is hidden in σk(RL,k, RG,k).Note further in the above theorem that, since the r.h.s. in (2.40) is a compact convex set, thelinear growth condition (2.36) in Definition 2.6 is trivially satisfied.The Case of Observing aSk [n]. Suppose agent k knows the functionUkG,n(ak,aSk)= UkG(ak,aSk , θ[n])at each time n and, in contrast to the neighborhood monitoring assumption in Section 2.3.1,the non-neighbors’ joint action profile aSk [n] is revealed to agent k. In lieu of (2.25), agent kcan then use a stochastic approximation algorithm similar to (2.24) to update global regrets.522.5. Analysis of the Adaptive Learning AlgorithmMore precisely, the global regret matrix is updated recursively viaR̂G,k[n+ 1] = R̂G,k[n] + µ[B̂G,k(ak[n],aSk [n], n)− R̂G,k[n]](2.41)B̂G,k =[b̂G,kij], where b̂G,kij(ak[n],aSk [n], n)=[UkG,n(j,aSk)− UkG,n(ak[n],aSk)]I(ak[n] = i)Suppose agent k employs a strategy p̂k[n] of the form (2.22), in which RG,k[n] is now replacedwith R̂G,k[n]. Then, similar to Theorem 2.2, it can be shown that the limit dynamical systemrepresenting the iterates (RL,k[n], R̂G,k[n]) can be expressed as{dRL,kdt ∈ HL,k(RL,k, R̂G,k, θ(t))−RL,kdR̂G,kdt ∈ HG,k(RL,k, R̂G,k, θ(t))− R̂G,kwhere (2.42)HL,k =[hL,kij], hL,kij :={[UkL(j,nk, θ(t))− UkL(i,nk, θ(t))]σki(RL,k, R̂G,k); nk ∈ ∆ANk}HG,k =[hG,kij], hG,kij :={[UkG(j, sk, θ(t))− UkG(i, sk, θ(t))]σki(RL,k, R̂G,k); sk ∈ ∆ASk}.Here, nk and sk denote probability measures over the joint action space of all agent k’s neighborsANk and non-neighbors ASk , respectively. The simplex of all such measures are further denotedby ∆ANk and ∆ASk , respectively. The expected local and global payoffs, given the probabilitymeasures nk and sk, are defined similar to (2.39). The invariant measure σk also possesses theproperties discussed in Lemma 2.1.Recall thatUkG,n(i) = UkG(i,aSk [n], θ[n])and knowing aSk [n] only affects the ability to compute it. Therefore,UkG,n (i) ∈{UkG(i, sk, θ[n]); sk ∈ ∆ASk}and (2.42) represents the same continuous-time dynamics as (2.40). That is, (RL,k[n], RG,k[n])and (RL,k[n], R̂G,k[n]) are stochastic approximations of the same differential inclusion (2.42).This property allows us to invoke Theorem 2.3 below, which proves that the strategy pk, definedin (2.22), induces the same asymptotic behavior for the pair processes (RL,k[n], RG,k[n]) and(RL,k[n], R̂G,k[n]).Theorem 2.3. Using the decision strategy pk, defined in (2.22), the limit sets of the pairprocesses (RL,k[n], RG,k[n]) and (RL,k[n], R̂G,k[n]) coincide.Proof. See Section 2.9.The above theorem simply asserts that realizing stage global payoffs UkG,τ (ak[τ ]) provides532.5. Analysis of the Adaptive Learning Algorithmsufficient information to construct an unbiased estimator of R̂G,k[n]; hence, the two processes(RL,k[n], RG,k[n]) and (RL,k[n], R̂G,k[n])exhibit the same asymptotic behavior. This result will be used in Section 2.6 to prove conver-gence of the global behavior to the polytope of correlated -equilibia.2.5.2 Stability Analysis of the Limit Dynamical SystemThis step examines stability of the limit system that has been shown to represent the localbehavior of agents in Section 2.5.1. The set of global attractors of this limit system is thenshown to comprise an -neighborhood of the negative orthant (see Figure 2.5a). Therefore,agents asymptotically experience at most regret via the simple local behavior prescribed byAlgorithm 2.1.We start by defining stability of switched dynamical systems; see [67] and [293, Chapter 9]for further details. In what follows, dist[·, ·] denotes the usual distance function.Definition 2.8. Consider the switched systemx˙(t) = f (x(t), θ(t))x[0] = x0, θ[0] = θ0, x(t) ∈ Rr, θ(t) ∈ Q,(2.43)where θ(t) is the switching signal, and f : Rr × Q → Rr is locally Lipschitz for all θ ∈ Q. Aclosed and bounded set H ⊂ Rr ×Q is1. stable in probability if for any η > 0 and γ > 0, there is a γ > 0 such thatP(supt≥0 dist [(x(t), θ(t)),H] < γ)≥ 1− η, whenever dist [(x0, θ0),H] < γ;2. asymptotically stable in probability if it is stable in probability andP(limt→∞ dist[(x(t), θ(t)),H] = 0)→ 1, as dist[(x0, θ0),H]→ 0;3. asymptotically stable almost surely iflimt→∞ dist [(x(t), θ(t)),H] = 0 almost surely.To obtain the stability results, we unfold the regret matrices and rearrange their elementsas a vector. With a slight abuse of notation, we continue to use the same notation for theseregret vectors. The following theorem examines the stability of the deterministic subsystemspertaining to each θ when θ(t) = θ is held fixed in (2.40). The slow switching condition then542.5. Analysis of the Adaptive Learning Algorithmallows us to apply the method of multiple Lyapunov functions [197, Chapter 3] to analyzestability of the switched limit dynamical systems.Theorem 2.4. Consider the limit dynamical system (2.41) associated with the discrete-timeiterates of Algorithm 2.1. LetR− ={(x,y) ∈ R(Ak)2 × R(Ak)2 ;∣∣x + y∣∣+ ≤ 1}. (2.44)Let further(RL,k(0), RG,k(0))=(RL,k0 , RG,k0)and θ(0) = θ0.Then, for each ≥ 0, there exists δ̂ () ≥ 0 such that if δ ≤ δ̂ () in (2.37) (or equivalently inthe decision strategy (2.22)), the following results hold:(i) If θ(t) = θ is held fixed, the set R− is globally asymptotically stable for the deterministicsystem associated with each θ ∈ Q, andlimt→∞dist[(RL,k(t), RG,k(t)),R−]= 0. (2.45)(ii) The set R−×Q is globally asymptotically stable almost surely for the Markovian switchedsystem. More precisely,a) P(∀η > 0, ∃γ(η) > 0 s.t. dist[(RL,k0 , RG,k0 ),R−]≤ γ(η)⇒ supt≥0 dist[(RL,k(t), RG,k(t)),R−]< η)= 1b) P(∀γ, η′ > 0,∃T (γ, η′) ≥ 0 s.t. dist[(RL,k0 , RG,k0 ),R−]≤ γ⇒ supt≥T (γ,η′) dist[(RL,k(t), RG,k(t)),R−]< η′)= 1.Proof. See Section 2.9.The above theorem states that the set of global attractors of the limit Markovian switcheddynamical system (2.40) is the same as that for all non-switching subsystems—when θ(t) = θ isheld fixed—and constitutes R−. This sets the stage for Section 2.6 where the global attractorsets (in regret space) are proved to represent the correlated -equilibria polytope (in strategyspace).In Theorem 2.2, we considered µ small and n large, but µn remained bounded. This gives alimit Markovian switched system of differential inclusions for the sequence of interest as µ→ 0.The next step is to study asymptotic stability. It will be established that the limit points of theMarkovian switched system of differential inclusions (2.40) and the stochastic approximationiterates (2.24)–(2.25) coincide as t→∞. We thus consider the case where µ→ 0 and n→∞,552.6. The Case of Slow Regime Switchinghowever, µn → ∞ now. Nevertheless, instead of considering a two-stage limit by first lettingµ→ 0 and then t→∞, we study(RL,k,µ(t+ tµ), RG,k,µ(t+ tµ))and require tµ →∞ as µ→ 0. The following corollary asserts that the results of Theorem 2.4also hold for the interpolated processes. Hence, the interpolated regret processes also convergeto the set of global attractors of the associated limit dynamical system.Corollary 2.1. Denote by {tµ} any sequence of real numbers satisfying tµ → ∞ as µ → 0.Suppose {RL,k[n], RG,k[n] : µ > 0, n < ∞} is tight or bounded in probability. Then, for each ≥ 0, there exists δ̂ () ≥ 0 such that if δ ≤ δ̂ () in (2.22),(RL,k,µ(·+ tµ), RG,k,µ(·+ tµ))→ R− in probabilitywhere R− is defined in (2.44).Proof. See Section 2.9.Remark 2.4. It is well known that the asymptotic covariance of the limit dynamics providesinformation about the rate of convergence of stochastic recursive algorithms; see [192, Chapter10]. This approach, however, is not applicable in our analysis of Algorithm 2.1 since the limitdynamical system is a differential inclusion rather than an ordinary differential equation—see (2.40). That is, for each initial condition, the sample path of regrets belongs to a set, andindependent runs of the algorithm result in different sample paths. This prohibits derivingan analytical rate of convergence, though it is obvious that convergence rate degrades as thecardinality of the player set increases.2.6 The Case of Slow Regime SwitchingThis section is devoted to the proof of the second result in Theorem 2.1, and concerns theasymptotic behavior of Algorithm 2.1 when the random evolution of the parameters underlyingthe game occurs on a timescale that is much slower than that determined by the step-size µ.More precisely, we replace Assumption 2.2 with the following assumption.Assumption 2.3. 0 < ε µ in the transition probability matrix P ε in Assumption 2.1.The above assumption introduces a different timescale to Algorithm 2.1, which leads to anasymptotic behavior that is fundamentally different as compared with the case ε = O(µ) that weanalyzed in Section 2.5. Under Assumption 2.3, θ(·) is the slow component and (RL,k(·), RG,k(·))is the fast transient in (2.40). It is well-known that, in such two timescale systems, the slow562.6. The Case of Slow Regime Switchingcomponent is quasi-static—remains almost constant—while analyzing the behavior of the fasttimescale. Recall from (2.31) the interpolated processes associated with the regret matrices.The weak convergence argument then shows that(RL,k,µ(t)(·), RG,k,µ(t)(·), θµ(·))⇒ ((RL,k(·), RG,k(·)), θ)as µ→ 0 such that the limit (RL,k(·), RG,k(·)) is a solution to{dRL,kdt ∈ HL,k(RL,k, RG,k, θ)−RL,kdRG,kdt = HG,k(RL,k, RG,k, θ)−RG,kwhere (2.46)HL,k =[hL,kij], hL,kij :={[UkL(j,nk, θ)− UkL(i,nk, θ)]σki(RL,k, RG,k); nk ∈ ∆ANk}HG,k =[hG,kij], hG,kij :=[UkG,t(j, θ)− UkG,t(i, θ)]σki(RL,k, RG,k).Here, as before, nk denotes the joint strategy of agent k’s neighbors, and ∆ANk represents thesimplex of all such strategies. Further, σk is the stationary distribution of agent k’s decisionstrategy, which is characterized in Lemma 2.1. Technical details are omitted for brevity; see [192,Chapter 8] for details.The first result in Theorem 2.4 shows that the set R− is globally asymptotically stable foreach θ ∈ Q. Subsequently, Corollary 2.1 shows that for each > 0, there exists δ̂ () > 0 suchthat if δ ≤ δ̂ () in (2.22),(RL,k,µ(·+ tµ), RG,k,µ(·+ tµ))→ R− in probability as µ→ 0where {tµ} is any sequence of real numbers satisfying tµ → 0 as µ→ 0.Under Assumption 2.3, one can further show that, if all agents follow the proposed adaptivelearning algorithm, the global behavior tracks the regime switching polytope of correlated -equilibria. The following theorem shows that the regrets of individual agents converging toan -neighborhood of the negative orthant provides the necessary and sufficient condition forconvergence of global behavior to the correlated -equilibria set; see Figure 2.5.Theorem 2.5. Recall the interpolated processes for global behavior zµ(·) and the local and globalregrets (RL,k,µ(·), RG,k,µ(·)), defined in (2.31). Then,zµ(·)→ C(θ) in probabilityif and only if(RL,k,µ(·), RG,k,µ(·))→ R− in probability for all decision makers k ∈ Kwhere R− is defined in (2.44).572.6. The Case of Slow Regime SwitchingProof. The proof relies on how the ‘regrets’ are defined. Associated with the discrete-timeiterates (2.41), define the piecewise constant interpolated processR̂G,k,µ(t) = R̂G,k[n] for t ∈ [nµ, (n+ 1)µ).Recall further from (2.31), the interpolated processes for local regret matrix RL,k,µ(·) and theglobal behavior zµ(·). Then, assuming that θ[τ ] = θ is held fixed for τ ≥ 0,rL,k,εij (t) + r̂G,k,εij (t) = (1− µ)n−1[UkL(j,aNk [1], θ)− UkL(i,aNk [1], θ)]I(ak[1] = i)+ µ∑2≤τ≤n(1− µ)n−τ[UkL(j,aNk [τ ], θ)− UkL(i,aNk [τ ], θ)]I(ak[τ ] = i)+ (1− µ)n−1[UkG(j,aSk [1], θ)− UkG(i,aSk [1], θ)]I(ak[1] = i)+ µ∑2≤τ≤n(1− µ)n−τ[UkG(j,aSk [τ ], θ)− UkG(i,aSk [τ ], θ)]I(ak[τ ] = i)=∑a−k zk,µ(i,a−k)(t)[UkL(j,aNk , θ)− UkL(i,aNk , θ)+ UkG(j,aSk , θ)− UkG(i,aSk , θ)]=∑a−k zk,µ(i,a−k)(t)[Uk(j,a−k, θ)− Uk(i,a−k, θ)](2.47)where zk,µ(i,a−k)(t) denotes the interpolated empirical distribution of the joint action profile inwhich agent k chooses action i and the rest play a−k, and a−k = (aNk ,aSk). On any convergentsubsequence {z[n′]}n′≥0 → pi, with slight abuse of notation, letzµ(t) = z[n′], RL,k,µ(t) = RL,k[n′], and R̂G,k,µ(t) = R̂G,k[n′] for t ∈ [n′µ, n′µ+ µ).Consequently,limt→∞rL,k,µij (t) + r̂G,k,µij (t) =∑a−kpik(i,a−k)[Uk(j,a−k, θ)− Uk(i,a−k, θ)](2.48)where pik(i,a−k) denotes the probability of agent k choosing action i and the rest playing a−k. Byvirtue of Theorem 2.3, and using (2.48), we obtainlimt→∞ rL,k,µij (t) + rG,k,µij (t) = limt→∞ rL,k,µij (t) + r̂G,k,µij (t)=∑a−k pik(i,a−k)[Uk(j,a−k, θ)− Uk(i,a−k, θ)](2.49)Finally, combining (2.49) with Corollary 2.1, and comparing the result with (2.11) completesthe proof.The above theorem, together with Corollary 2.1, completes the proof for the second resultin Theorem 2.1.582.7. Numerical StudyRegret Space(a)2U3U1U(b)Figure 2.5: Local and global behavior of agents. (a) Regrets converge to the negative orthantlocally. (b) A typical correlated equilibria polytope in a three-player game. The red linedepicts the trajectory of average payoffs accrued by all agents. It represents the global behaviorof agents and converges to the correlated equilibria polytope.2.7 Numerical StudyThis section illustrates the performance of the proposed adaptive learning algorithm with anumerical example. Consider game G, defined in (2.1), and supposeK = {1, 2, 3} , and Ak = {0, 1} , for all k ∈ K.The connectivity graph is given byG = (E ,K), where E = {(1, 2), (2, 1)}.The Markov chain θ[n] representing the random unpredictable jumps in the parameters of thegame is further defined as follows:Q = {1, 2}, and P ρ = I + ρQ, where Q =[−0.5 0.50.5 −0.5].Agents’ local and global payoffs are given in normal form in Table 2.1.As benchmarks to compare the performance of Algorithm 2.1, we consider the following twoalgorithms which, to the best of our knowledge, are the only algorithms proved to converge tocorrelated -equilibria set in scenarios where not all agents exchange their past decisions:1. Regret-based reinforcement learning [143]. Agents only observe the stream of pay-offs, and rely on these perceived payoffs to extract information of other agents’ behavior.In [143], it is proved that, for decreasing step-size µ[n] = 1/n, any limit point of 1n∑τ≤n ea[τ ]592.7. Numerical StudyTable 2.1: Agents’ payoffs in normal form.Local Payoffs:(U1l , U2l), θ[n] = 1(U1l , U2l), θ[n] = 2S2 : 0 S2 : 1S1 : 0 (−1, 3) (2,−1)S1 : 1 (1,−1) (1, 4)S2 : 0 S2 : 1(1,−1) (0, 3)(3, 3) (−1, 0)Global Payoffs:(U1g , U2g , U3g), θ[n] = 1S2 : 0 S2 : 1S1 : 0 (−1, 3, 1) (2,−1, 3)S1 : 1 (1,−1, 3) (1, 4, 1)S2 : 0 S2 : 1(1,−1, 3) (0, 3, 1)(3, 3, 1) (−1, 0, 3)S3 : 0 S3 : 1Global Payoffs:(U1g , U2g , U3g), θ[n] = 2S2 : 0 S2 : 1S1 : 0 (0, 6, 2) (2,−4, 6)S1 : 1 (4, 0, 8) (−2, 6, 6)S2 : 0 S2 : 1(2,−2, 4) (−4, 4, 10)(0, 2, 10) (4,−2, 4)S3 : 0 S3 : 1Algorithm 2.2 Unobserved Conditional Smooth Fictitious PlayInitialization: Set λ. Initialize Uk[0] = 0Ak×Ak .Step 1: Action Selection. Pick action ak[n] ∼ pk[n] = (pk1[n], . . . , pkAk [n]): If ak[n− 1] = j,pk[n] = 1∑i exp(1λUkji[n]) .[exp( 1λUkj1[n]), . . . , exp( 1λUkjAk [n])](2.50)Step 2: Average Utility Update. For all i, j ∈ Ak,Ukij [n] = Ukij [n− 1] + µ[pki [n]pkj [n]Ukn(ak[n])I(ak[n] = i)− Ukij [n− 1]](2.51)where Ukn(ak[n])denotes the realized payoff by agent k at time n.Recursion. Set n← n+ 1, and go Step 1.602.7. Numerical Studyis a correlated -equilibrium of the stage game. For constant step-size µ, similar techniques asin [192] can be used to establish weak convergence of the discounted empirical frequency z[n],defined in (3.38), to the set of correlated -equilibria.2. Unobserved conditional smooth fictitious play. Standard fictitious play requires eachagent to record and update the empirical frequency of actions taken by other agents throughtime and best-respond to it; see [106] for an extensive treatment. Here, since agents do notobserve the actions of non-neighbors, we present a variant which relies only on the realizations ofthe agents’ payoffs. The procedure is summarized above in Algorithm 2.2. It is proved in [107]that the conditional smooth fictitious play is -conditionally consistent. The authors in [106]also prove that -conditional consistency is equivalent to reaching correlated -equilibrium when|Ak| = 2 for all k ∈ K.The above procedures make no use of the information disclosed within social groups. Through-out this section, we set µ = 0.001 and δ = 0.1 in Algorithm 2.1, and λ = 1 in Algorithm 2.2above. Figure 2.6 illustrates how the distance to the correlated -equilibrium polytope decreaseswith n in the static game G when θ[n] = 1 is held fixed for n ≥ 1. The distance diminishesfor both algorithms as the learning proceeds. However, comparing the slopes of the lines inFigure 2.6, m1 = −0.182 for Algorithm 2.1 and m2 = −0.346 for regret-based reinforcementlearning [143], shows that exploiting information disclosed within social groups can lead to anorder of magnitude faster convergence to the polytope of correlated -equilibrium in a smallgame involving three players.Figure 2.7 considers a slowly switching game where the payoffs undergo a jump from θ = 1to θ = 2 at n = 250. Each point on the graphs is an average over 100 independent runs of thealgorithms. Figure 2.7 illustrates the sample paths of the average payoffs of agents. As shown,all algorithms respond to the jump of the Markov chain. However, Algorithm 2.1 outperformsthe others as it approaches faster to the expected payoffs in correlated equilibrium.Figures 2.8 and 2.9 illustrate the proportion of time that each algorithm spends out of thepolytope of correlated -equilibrium, which is a measure of the expected time taken to adapt tothe random jumps in payoffs. Figure 2.8 shows the proportion of the iterations spent out of thecorrelated -equilibria set when ε = 0.001. As can be seen, both algorithms spend more time inthe switching correlated -equilibria polytope as time goes by. However, Algorithm 2.1 exhibitssuperior performance for both small and large n. Figure 2.9 also illustrates the time spent outof the correlated -equilibria set for different values of ε when µ = 0.001. As expected, theperformance of both algorithms degrade as the speed of the Markovian jumps increases. Thisis typical in stochastic approximation algorithms since dynamics of the Markov chain is notaccounted for in the algorithm.612.7. Numerical StudyIteration Number n102 103 104DistancetoCorrelatedEquilibrium10-1100101Regret-based Adaptive Learningwith Neighborhood MonitoringRegret-basedReinforcement Learning [143]X: 100Y: 0.8347 X: 1000Y: 0.8035X: 1000Y: 0.3761X: 100Y: 1.221Figure 2.6: Distance to correlated equilibrium in a static game: θ[n] = 1, for n ≥ 1.Iteration Number n0 50 100 150 200 250 300 350 400 450 500AverageOverallUtility12345678910Regret-based Adaptive Learningwith Neighborhood MonitoringRegret-basedReinforcement Learning [143]Unobserved ConditionalSmooth Fictitious PlayExpected Overal Utilityin Correlated EquilibriumAgent 1Agent 3Agent 2Figure 2.7: Average overall payoff in a slowly switching game. The blue, red, and black linesillustrate the sample paths of average payoffs for agents 1, 2, and 3, respectively. The dashedlines represent the expected payoffs in correlated equilibrium.622.7. Numerical StudyIteration Number n0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000ProportionofTimeSpentoutofCorrelated0-Equilibrium00.010.020.030.040.050.060.070.080.090.1Regret-basedReinforcement Learning [143]Regret-based Adaptive LearningWith Neighborhood MonitoringFigure 2.8: Proportion of iterations spent out of the correlated -equilibria set for ε = 10−3.ε10-5 10-4 10-3 10-2%TimeSpentoutofCorrelatedε-Equilibrium0510152025Regret-basedReinforcement Learning [143]Regret-based Adaptive Learningwith Neighborhood MonitoringSpeed of Markovianswitching ε matchesalgorithm step-size µFigure 2.9: Proportion of iterations spent out of the correlated -equilibria set versus theswitching speed ε of the hyper-model.632.8. Closing Remarks2.8 Closing RemarksThis chapter considered devising and analysis of an adaptive learning algorithm that is suitedto the information flow patterns emerging as a consequence of the recent popularity of socialnetworking services. The proposed adaptive learning strategy was based on a natural measureof regret which, intuitively, evaluates the possible increase in the welfare of the decision makerhad one chosen a different course of decisions in the past. The regret was used to determineindividual’s propensity to switch to a different action the next time she faces the same decisionproblem, and as such, embodies commonly used rules of behavior. Despite its deceptive sim-plicity, the adaptive learning algorithm leads to a behavior that is similar to that obtained fromfully rational considerations, namely, correlated equilibria, in the long run. Further, it is capableto respond to the random and unpredictable changes occurring in the environment by properlytracking the evolving correlated -equilibria. Finally, the numerical examples confirmed that, asone would expect, the more decision maker share information of their past decisions with eachother, the faster will be their coordination and, hence, convergence to correlated equilibria.It is then natural to ask whether it is possible to use variants of the adaptive learningalgorithm to guarantee convergence to and tracking of Nash equilibria. The answer seems tobe negative. The reason for the failure turns out to be quite instructive as summarized in thefollowing extract form [145]:“[The reason for failure] is not due to some aspect of the adaptive property, butrather to one aspect of the simplicity property: namely, that in the strategies underconsideration players do not make use of the payoffs of the other players. Whilethe actions of the other players may be observable, their objectives and reasons forplaying those actions are not known (and at best can only be inferred). This simpleinformational restriction—which we call uncoupledness—is the key property thatmakes Nash equilibria very hard if not impossible to reach.”The game-theoretic learning procedures have a self-organization feature, that is appealingin many multi-agent systems of practical interest. Motivated by the development of wireless-enabled sensors, the results obtained thus far sets the stage for the next chapter, which providesguidelines on the game-theoretic design process for multi-agent systems where agents trade-offlocal interests with the contribution to the global task assigned to them. The particular appli-cation of interest is a fully decentralized solution to a centralized parameter estimation task viaa network of wireless-enabled sensors which cooperate by exchanging and combining local esti-mates. Each sensor has limited sensing and communication capabilities. The adaptive learningalgorithm proposed in this chapter is then deployed to devise an autonomous energy-awareactivation mechanism for sensors that captures the spatial-temporal correlation among theirmeasurements. The coordinated activation behavior of sensors then allows fast and successful642.9. Proofs of Resultsestimation of the common parameter of interest at each sensor.2.9 Proofs of Results2.9.1 Proof of Theorems 2.2Here, we consider a general setting where the limit system of the pair of two processes (x[n],y[n])constitutes a regime switching system of interconnected differential inclusions. Theorem 2.2 canbe considered as a special case where the limit dynamics associated with y[n] forms an ODE.Consider the pair (x[n],y[n]) ∈ Rr × Rr defined byx[n+ 1] = X[n] + µ[U1 (Φ[n],Ψ[n], θ[n])−X[n]],y[n+ 1] = Y [n] + µ[U2 (Φ[n],Ξ[n], θ[n])− Y [n]],Φ[n],Ψ[n],Ξ[n] ∈M = {1, . . . ,m}, and θ[n] ∈ Q = {1, . . . ,Θ}(2.52)where θ[n] is the discrete-time Markov chain in Assumption 2.1. For simplicity, and withoutloss of generality, it is assumed that Φ[n], Ψ[n], and θ[n] all take values from the sets of thesame cardinality. It is further assumed that the initial data x[0] and y[0] are independent ofthe step-size µ. Suppose that {x[n],y[n],Φ[n− 1]} is a Markov chain such that the transitionmatrix is given byP (Φn = j|Φn−1 = i,x[n] = X,y[n] = Y ) = Pij (X,Y ) . (2.53)Let P (X,Y ) = [Pij(X,Y )] ∈ Rm×m denote the (X,Y )-dependent transition matrix of thisMarkov chain. Such a case is referred to as state-dependent noise in [192, p. 185]. To proceed,we make the following assumptions:Assumption 2.4. The transition matrix P (X,Y ) possesses the following properties:1. It is continuous in both variables.2. For each (X,Y ), it is irreducible and aperiodic.The second statement in the above assumption ensures the Markov chain is ergodic in thesense that there is a unique stationary distributionϕ(X,Y ) = (ϕ1(X,Y ), . . . , ϕm(X,Y ))such that[P (X,Y )]n → 1Mϕ(X,Y ) as n→∞ (2.54)which is a matrix with identical rows consisting of the stationary distribution. The convergencein fact takes place exponentially fast.652.9. Proofs of ResultsAssumption 2.5. {Ψ[n]} and {Ξ[n]} are sequences of bounded real-valued random variablesthat are independent of {Φ[n]} such that, for each i ∈ M and θ ∈ Q, there exist two setsH1(i, θ) and H2(i, θ) such thatdist[1nn+m−1∑`=mEm{U1 (i,Ψ[`], θ)},H1(i, θ)]→ 0 in probabilitydist[1nn+m−1∑`=mEm{U2 (i,Ξ[`], i0)},H2(i, θ)]→ 0 in probability.(2.55)Here, dist[·, ·] denotes the usual distance function, and Em represents conditional expectationgiven the σ-algebra generated by {x[`],y[`],Φ[`− 1],Ψ[`− 1],Ξ[`− 1] : ` ≤ m}. Assume furtherthat θ[n] is independent of Φ[n], Ψ[n], and Ξ[n].Assumption 2.6. Consider the sets H1(i, θ) and H2(i, θ) in (2.55). The setsH1(X,Y, θ) =∑i∈M ϕi(X,Y )H1(i, θ)H2(X,Y, θ) =∑i∈M ϕi(X,Y )H2(i, θ)(2.56)are closed, convex, and upper semi-continuous (see [192, pp.108-109]), where ϕ(X,Y ) is thestationary distribution defined in (2.54).We shall now interpret the above assumptions in terms of the setup in Algorithms 2.1. InAssumption 2.5, H1(i, θ) and H2(i, θ) are the convex hulls corresponding to the mean of localand global regrets associated to various sample paths of neighbors and non-neighbors actionprofiles, respectively, conditioned on agent k picking ak[`] = i. More precisely, H1(i, θ) is aset-valued Ak ×Ak matrix with elementsH1pq(i, θ) ={ {UkL(q,nk, θ)− UkL(i,nk, θ); nk ∈ ∆ANk}p = i,0 p 6= i.One can similarly define H2(i, θ) by replacing the local payoff function with the global payofffunction. In Assumption 2.6, H1(X,Y, θ) and H2(X,Y, θ) are the convex hulls corresponding tothe mean (no conditioning on ak[`] = i) of local and global regrets associated to various samplepaths of neighbors and non-neighbors action profiles, respectively. More precisely, H1(X,Y, θ)is the convex combination of the sets H1(i, θ) such that the ij-th element is given byH1ij(X,Y, θ) ={[UkL(j,nk, θ)− UkL(i,nk, θ)]ϕi(X,Y ); nk ∈ ∆ANk}.Similarly, H2(X,Y, θ) can be defined by replacing the local payoff function with the globalpayoff function.662.9. Proofs of ResultsBefore proceeding with the theorem, a few definition and notations are in order.Definition 2.9. Let Z[n] be an Rr-valued random vector. We say the sequence {Z[n]} is tightif for each η > 0, there exists a compact set Kη such that P (Z[n] ∈ Kη) ≥ 1− η for all n.The definitions of weak convergence (see Definition 2.7) and tightness extend to randomelements in more general metric spaces. On a complete separable metric space, tightness isequivalent to relative compactness, which is known as Prohorov’s Theorem [46]. By virtueof this theorem, we can extract convergent subsequences when tightness is verified. In whatfollows, we use a martingale problem formulation to establish the desired weak convergence.To this end, we first prove tightness. The limit process is then characterized using a certainoperator related to the limit martingale problem. We refer the reader to [192, Chapter 7] forfurther details on weak convergence and related matters.Define the piecewise constant interpolated processesxµ (t) = x[n], yµ (t) = y[n] on t ∈ [nµ, nµ+ µ) .Similarly, define the piecewise constant interpolated process for the hyper-parameterθµ (t) = θ[n] on t ∈ [nµ, nµ+ µ) .The following theorem characterizes the limit dynamical system associated with (2.52).Theorem 2.6. Suppose Assumptions 2.4–2.6 hold. Then, ((xµ(·),yµ(·)), θµ(·)) is tight inD([0,∞) : R2r ×Q), and any convergent sequence has limit ((x(·),y(·)), θ(·)) that is a solutionto {x˙ ∈ H(x,y, θ(t))− x,y˙ ∈ H(x,y, θ(t))− y.(2.57)Here, θ(·) is a continuous-time Markov chain with generator Q, defined in Assumption 2.1.Proof. The proof involves two steps:Step 1. Tightness of ((xµ(·),yµ(·)), θµ(·)). Consider (2.52) for the sequence of R2r-valuedvectors {(x[n],y[n])}. Noting the boundedness of {U1(Φ[n],Ψ[n], θ[n])} and {U2(Φ[n],Ξ[n], θ[n])},and using Ho¨lder’s and Gronwall’s inequalities, for any 0 < T <∞, we obtainsupn≤T/µE∥∥∥∥∥(x[n]y[n])∥∥∥∥∥2<∞, (2.58)where in the above and hereafter t/µ is understood to be the integer part of t/µ for each t > 0,and ‖ ·‖ represents the Euclidean norm. Next, considering (xµ(·),yµ(·)), for any t, s > 0, δ > 0,672.9. Proofs of Resultsand s < δ, it is fairly easy to verify that(xµ(t+ s)yµ(t+ s))−(xµ(t)yµ(t))=(µ∑(t+s)/µ−1τ=t/µ [U1(Φ[τ ],Ψ[τ ], θ[τ ])− x[τ ]µ∑(t+s)/µ−1τ=t/µ [U2(Ξ[τ ],Ψ[τ ], θ[τ ])− y[τ ]). (2.59)By virtue of the tightness criteria [190, Theorem 3, p. 47] or [192, Chapter 7], it suffices toverifylimδ→0lim supµ→0Esup0≤s≤δEµt∥∥∥∥∥(xµ(t+ s)yµ(t+ s))−(xµ(t)yµ(t))∥∥∥∥∥2= 0 (2.60)where Eµt denotes conditional expectation given the σ-algebra generated by the µ-dependentpast data up to time t. Noting the boundedness of U1(·, ·, θ) and U2(·, ·, θ) for all θ ∈ Q, it canbe shown thatEµt∥∥∥∥∥(µ∑(t+s)/µ−1τ=t/µ [U1(Φ[τ ],Ψ[τ ], θ[τ ])− x[τ ]µ∑(t+s)/µ−1τ=t/µ [U2(Ξ[τ ],Ψ[τ ], θ[τ ])− y[τ ])∥∥∥∥∥2≤ Cµ2( t+ sµ −tµ)2= O(s2). (2.61)Combining (2.59) and (2.61), the tightness criteria (2.60) is verified; therefore, (x[n],y[n]) istight in D([0,∞) : R2r). Here, D([0,∞) : R2r) denotes the space of functions that are defined in[0,∞) taking values in R2r, and that are right continuous and have left limits (namely, Ca`dla`gfunctions) with Skorohod topology (see [192, p. 228]). In view of [290, Proposition 4.4], θµ(·)is also tight and θµ(·) ⇒ θ(·) such that θ(·) is a continuous time Markov chain with generatorQ (see Assumption 2.1). Therefore, ((xµ(·),yµ(·)), θµ(·)) is tight in D([0,∞] : R2r ×Q).2. Characterization of the limit process. By virtue of the Prohorov theorem [192], onecan extract a convergent subsequence. For notational simplicity, we still denote the subsequenceby (xµ(·),yµ(·)) with limit (x(·),y(·)). Thus, (xµ(·),yµ(·)) converges weakly to (x(·),y(·)).Using the Skorohod representation theorem [192], with a slight abuse of notation, one canassume (xµ(·),yµ(·))→ (x(·),y(·)) in the sense of w.p.1 and the convergence is uniform on anyfinite interval. We now proceed to characterize the limit (x(·),y(·)) using martingale averagingmethods.To obtain the desired limit, it will be proved that the limit ((x(·),y(·)), θ(·)) is the solutionof the martingale problem with operator L defined as follows: For all i ∈ Q,Lf(x,y, i) = ∇>x f(x,y, i)[H1(x,y, i)− x]+∇>y f(x,y, i)[H2(x,y, i)− y]+Qf(x,y, ·)(i),Qf(x,y, ·)(i) =∑j∈Q qijf(x,y, j),(2.62)and, for each i ∈ Q, f(·, ·, i) : Rr × Rr 7→ R with f(·, ·, i) ∈ C10 (C1 function with compact682.9. Proofs of Resultssupport). Further, ∇>x f(x,y, i) and ∇>y f(x,y, i) denote the gradients of f(x,y, i) with respectto x and y, respectively. H1(x,y, i) andH2(x,y, i) are also defined in (2.56). Using an argumentsimilar to [291, Lemma 7.18], one can show that the martingale problem associated with theoperator L has a unique solution. Thus, it remains to prove that the limit (x(·),y(·), θ(·)) isthe solution of the martingale problem. To this end, it suffices to show that, for any positivearbitrary integer κ0, and for any t, u > 0, 0 < tι ≤ t for all ι ≤ κ0, and any bounded continuousfunction h(·, ·, i) for all i ∈ Q,Eh(x(tι),y(tι), θ(tι) : ι ≤ κ0)×[f (x(t+ s),y(t+ s), θ(t+ s))− f (x(t),y(t), θ(t))−∫ t+st Lf (x(u),y(u), θ(u)du)]= 0.(2.63)To verify (2.63), we work with (xµ(·),yµ(·), θµ(·)) and prove that the above equation holds asµ→ 0.By the weak convergence of (xµ(·),yµ(·), θµ(·)) to (x(·),y(·), θ(·)) and Skorohod represen-tation, it can be seen thatEh (xµ(tι),yµ(tι), θµ(tι) : ι ≤ κ0) [(xµ(t+ s),yµ(t+ s), θµ(t+ s))− (xµ(t),yµ(t), θµ(t))]→ Eh (x(tι),y(tι), θ(tι) : ι ≤ κ0) [(x(t+ s),y(t+ s), θ(t+ s))− (x(t),y(t), θ(t))] .Now, choose a sequence of integers {nµ} such that nµ →∞ as µ→ 0, but δµ = µnµ → 0, andpartition [t, t+ s] into subintervals of length δµ. Then,f (xµ(t+ s),yµ(t+ s), θµ(t+ s))− f (xµ(t),yµ(t), θµ(t))=∑t+s`:`δµ=t[f(x[`nµ + nµ],y[`nµ + nµ], θ[`nµ + nµ])− f(x[`nµ],y[]`nµ], θ[`nµ])]=∑t+s`:`δµ=t[f(x[`nµ + nµ],y[`nµ + nµ], θ[`nµ + nµ])− f(x[`nµ],y[`nµ], θ[`nµ + nµ])]+∑t+s`:`δµ=t[f(x[`nµ],y[`nµ], θ[`nµ + nµ])− f(x[`nµ],y[`nµ], θ[`nµ])](2.64)where∑t+s`:`δµ=t denotes the sum over ` in the range t ≤ `δµ ≤ t+ s.692.9. Proofs of ResultsFirst, we consider the second term on the r.h.s. of (2.64):limµ→0 Eh(xµ(tι),yµ(tι), θµ(tι) : ι ≤ κ0)×[∑t+s`:`δµ=t[f(x[`nµ],y[`nµ], θ[`nµ + µ])− f(x[`nµ],y[`nµ], θ[`nµ])]]= limµ→0 Eh(xµ(tι),yµ(tι), θµ(tι) : ι ≤ κ0)×[∑t+s`:`δµ=t∑Θi0=1∑Θj0=1∑`nµ+nµ−1k=`nµ[f(x[`nµ],y[`nµ], j0)P (θ[τ + 1] = j0|θ[τ ] = i0)− f(x[`nµ],y[`nµ], i0)]I (θ[τ ] = i0)]= Eh(x(tι),y(tι), θ(tι) : ι ≤ κ0) [∫ t+st Qf(x(u),y(u), θ(u))du]. (2.65)Concerning the first term on the r.h.s. of (2.64), we obtain:limµ→0 Eh(xµ(tι),yµ(tι), θµ(tι))×[∑t+s`:`δµ=t[f(x[`nµ + nµ],y[`nµ + nµ], θ[`nµ + nµ])− f(x[`nµ],y[`nµ], θ[`nµ + nµ])]]= limµ→0 Eh(xµ(tι),yµ(tι), θµ(tι))×[∑t+s`:`δµ=t[f(x[`nµ + nµ],y[`nµ + nµ], θ[`nµ])− f(x[`nµ],y[`nµ], θ[`nµ])]]= limµ→0 Eh(xµ(tι),yµ(tι), θµ(tι))×[∑t+s`:`δµ=t∇xf>[x[`nµ + nµ]− x[`nµ]] +∇yf>[y[`nµ + nµ]− y[`nµ]]]= limµ→0 Eh(xµ(tι),yµ(tι), θµ(tι))(2.66)×[∑t+s`:`δµ=t δµ∇xf>(1nµ∑mj=1∑`nµ+nµ−1τ=`nµU1(j,Ψ[τ ], θ[τ ])I (Φ[τ ] = j)− 1nµ∑`nµ+nµ−1τ=`nµx[τ ])+∑t+s`:`δµ=t δµ∇yf>(1nµ∑mj=1∑`nµ+nµ−1τ=`nµU2(j,Ξ[τ ], θ[τ ])I (Φ[τ ] = j)− 1nµ∑`nµ+nµ−1τ=`nµy[τ ])]where ∇xf and ∇yf denote the gradient vectors with respect to x and y, respectively. For no-tational simplicity, we denote ∇xf(x[`nµ],y[`nµ], θ[`nµ) and ∇yf(x[`nµ],y[`nµ], θ[`nµ]) simplyby ∇xf and ∇yf , respectively.limµ→0 Eh(xµ(tι),yµ(tι), θµ(tι))×[∑t+s`:`δµ=t δµ∇xf>(∑mj=11nµ∑`nµ+nµ−1τ=`nµU1(j,Ψ[τ ], θ[τ ])I (Φ[τ ] = j))]= limµ→0 Eh(xµ(tι),yµ(tι), θµ(tι))[∑t+s`:`δµ=t δµ∇xf>(∑Θi1=1∑Θi0=1∑mj=11nµ(2.67)×[∑`nµ+nµ−1τ=`nµE`nµ{U1(j,Ψ[τ ], i1)}E`nµ {I (θ[τ ] = i1) |I (θ[`nµ] = i0)}E`nµ {I (Φ[τ ] = j)}])].We will concentrate on the term involving the Markov chain θ[τ ]. In view of Assumption 2.1,and by virtue of [290, Proposition 4.4], for large τ with `nµ ≤ τ ≤ `nµ + nµ and τ − `nµ →∞,702.9. Proofs of Resultsand some τ̂ > 0,(I + εQ)τ−`nµ = Z((τ − `nµ)ε) +O(ε+ exp(−τ̂(τ − `nµ)),dZ(t)dt = Z(t)Q, Z(0) = I.(2.68)For `nµ ≤ τ ≤ `nµ + nµ, ε = O(µ) yields that (τ − `nµ)ε → 0 as µ → 0. For such τ ,Z((τ − `nµ)ε)→ I. Therefore, by the boundedness of U1(·, ·, i1) for all i1 ∈ Q, it follows that,as µ→ 0,1nµ∑`nµ+nµ−1τ=`nµ∥∥E`nµ{U1(j,Ψ[τ ], i1)}∥∥ ·∣∣E`nµ {I (Φ[τ ] = j)}∣∣×∣∣E`nµ{I (θ[τ ] = i1)∣∣I (θ[`nµ] = i0)}− I (θ[`nµ] = i0)∣∣ → 0.(2.69)Therefore, the limit in the first line of (2.67) can eb replaced bylimµ→0 Eh(xµ(tι),yµ(tι), θµ(tι))×[∑t+s`:`δµ=t δµ∇xf>(∑mj=11nµ∑`nµ+nµ−1τ=`nµU1(j,Ψ[τ ], θ[τ ])I (Φ[τ ] = j))]= limµ→0 Eh(xµ(tι),yµ(tι), θµ(tι))[∑t+s`δµ=t δµ∇xf>(∑Θi0=1∑mj=11nµ×[∑`nµ+nµ−1τ=`nµE`nµ{U1(j,Ψ[τ ], i0)}E`nµ {I (Φ[τ ] = j)} I (θ[`nµ] = i0)])]= limµ→0 Eh(xµ(tι),yµ(tι), θµ(tι))×[∑t+s`δµ=t δµ∇xf>(∑Θi0=1∑mj=11nµ∑`nµ+nµ−1τ=`nµ∑mj0=1 I (Φ[`nµ] = j0)×[E`nµ{U1(j,Ψ[τ ], i0)}ϕj(x[`nµ],y[`nµ])− E`nµ{U1(j,Ψ[τ ], i0)}(2.70)×(P (τ−`nµ)j0j (x[`nµ],y[`nµ])− ϕj(x[`nµ],y[`nµ]))]I (θ[`nµ] = i0))].Here, P (τ−`nµ)(x,y) represents the (τ − `nµ)-step transition probability matrix. Except theindicator function on the l.h.s. of (2.71), the rest of the terms in the summand are independentof j0. Therefore,∑mj0=1 I (Φ[`nµ] = j0)E`nµ{U1(j,Ψ[τ ], i0)}ϕj(x[`nµ],y[`nµ])= E`nµ{U1(j,Ψ[τ ], i0)}ϕj(x[`nµ],y[`nµ]).(2.71)By virtue of Assumption 2.4, regardless of the choice of j0, as µ→ 0 and τ − `nµ →∞, for eachfixed (X,Y ), P (τ−`nµ)j0j (X,Y ) − ϕj(X,Y ) → 0 exponentially fast. Thus, the last line in (2.70)712.9. Proofs of Resultsconverges in probability to 0 as µ→ 0, andlimµ→0 Eh(xµ(tι),yµ(tι), θµ(tι))×[∑t+s`:`δµ=t δµ∇xf>(∑mj=11nµ∑`nµ+nµ−1τ=`nµU1(j,Ψ[τ ], θ[τ ])I (Φ[τ ] = j))]= limµ→0 Eh(xµ(tι),yµ(tι), θµ(tι))×[∑t+s`δµ=t δµ∇xf>(∑Θi0=1∑mj=11nµ∑`nµ+nµ−1τ=`nµ∑mj0=1 I (Φ[`nµ] = j0) (2.72)×[E`nµ{U1(j,Ψ[τ ], i0)}ϕj(x[`nµ],y[`nµ])I (θ[`nµ] = i0)])].Note that θ[`nµ] = θµ(µ`nµ). By the weak convergence of θµ(·) to θ(·), the Skorohod represen-tation, using µ`nµ → u and Assumption 2.5, it can be shown thatlimµ→0 dist[1nµ∑`nµ+nµ−1τ=`nµE`nµ{U1(j,Ψ[τ ], i0)I (θµ(µ`nµ) = i0)},H1(j, i0)I (θ(u) = i0)]= 0.(2.73)By using the technique of stochastic approximation (see, e.g., [192, Chapter 8]), we also haveEh(xµ(tι),yµ(tι), θµ(tι) : ι ≤ κ0)[∑t+s`:`δµ=t δµ∇xf>(1nµ∑`nµ+nµ−1τ=`nµx[τ ])]→ Eh(x(tι),y(tι), θµ(tι) : ι ≤ κ0) [∫ t+st [∇xf(x(u),y(u), θ(u))]>x(u)du]as µ→ 0.(2.74)In view of Assumption 2.6, combining (2.72)–(2.74) yieldsx(t) ∈ H1(x(t),y(t), θ(t))− x(t). (2.75)The limit of yµ(·) can be derived similarly. The details are omitted.2.9.2 Proof of Theorem 2.3Recall from Section 2.5.1 that(RL,k[n], RG,k[n]) and (RL,k[n], R̂G,k[n])form discrete time stochastic approximation iterates of the same differential inclusion (2.42).Assuming that agent k adopts a strategy pk of the form (2.22), we look at the coupled systems(RL,k(t), RG,k(t)) and (RL,k(t), R̂G,k(t)).It has to be emphasized that both systems apply the same strategy pk, which uses RG,k[n],and not R̂G,k[n]. That is, the following two scenarios are run in parallel and compared: (i)The decision maker is oblivious to the existence of agents outside its neighborhood, as in722.9. Proofs of ResultsSection 2.3.1, and only realizes its global payoff after making her decision. The agent employsa decision strategy pk, which is updated via Algorithm 2.1. (ii) The decision maker knows theglobal payoff function and observes the decisions of non-neighbors, hence, updates its globalregret matrix via (2.41). However, the strategy pk in the former case prescribes how to makethe next decision.In what follows, we use a similar approach as in [39, Section 7] to show that both scenariosinduce the same asymptotic behavior. LetD(t) =∥∥R̂G,k(t)−RG,k(t)∥∥.Then, for all 0 ≤ λ ≤ 1,D(t+ λ) =∥∥∥R̂G,k(t+ λ)−RG,k(t+ λ)∥∥∥=∥∥∥R̂G,k(t) + λ · ddtR̂G,k(t)−RG,k(t)− λ · ddtβk(t) + o(λ)∥∥∥=∥∥∥∥(1− λ)[R̂G,k(t)−RG,k(t)]+ λ[ ddt(R̂G,k(t)−RG,k(t))+ R̂G,k(t)−RG,k(t)]+ o(λ)∥∥∥∥≤ (1− λ)D(t) + λ∥∥∥ddt(R̂G,k(t)−RG,k(t))+ R̂G,k(t)−RG,k(t)∥∥∥+ o(λ). (2.76)Substituting from (2.42) for dR̂G,k/dt and dRG,k/dt, the second term on the r.h.s. of (2.76)vanishes. Therefore,D(t+ λ) ≤ (1− λ)D(t) + o (λ) .Consequently,ddtD(t) ≤ −D(t)for almost every t, which yieldsD(−t)e−t ≥ D (0) (2.77)for all t ≥ 0. Note that D(t) is bounded since both R̂G,k(t) and RG,k(t) are bounded due tothe boundedness of the global payoff function. Finally, sincelimt→∞D(−t)e−t = 0, and D(t) ≥ 0one can conclude from (2.77) thatD(t) = D(0) = 0.732.9. Proofs of ResultsTherefore, the limit sets coincide.2.9.3 Proof of Theorem 2.4Define the Lyapunov functionV(RL,k, RG,k)=[dist[RL,k +RG,k,R−]]2 =∑i,j∈Ak[∣∣rL,kij + rG,kij∣∣+]2. (2.78)Taking the time-derivative of (2.78) yieldsddtV(RL,k, RG,k)= 2∑i,j∈Ak∣∣rL,kij + rG,kij∣∣+ ·[ddtrL,kij + ddtrG,kij].Next, replacing for drL,kij /dt and drG,kij /dt from (2.40), we obtainddtV(RL,k, RG,k)= 2∑i,j∈Ak∣∣rL,kij + rG,kij∣∣+×[[UkL(j,nk, θ)− UkL(i,nk, θ)]σki+[UkG,t(j, θ)− UkG,t(i, θ)]σki −[rL,kij + rG,kij]].(2.79)Substituting for σk from (2.37) in (2.79) yieldsddtV(RL,k, RG,k)= 2(1− δ)∑i,j∈Ak∣∣rL,kij + rG,kij∣∣+[UkL(j,nk, θ)+ UkG,t(j, θ)−[UkL(i,nk, θ)+ UkG,t(i, θ)]]χki+ 2δAk∑i,j∈Ak∣∣rL,kij + rG,kij∣∣+[UkL(j,nk, θ)+ UkG,t(j, θ)−[UkL(i,nk, θ)+ UkG,t(i, θ)]](2.80)− 2∑i,j∈Ak∣∣rL,kij + rG,kij∣∣+[rL,kij + rG,kij].First, we focus on the first term on the r.h.s. of (2.80):∑i,j∈Ak∣∣rL,kij + rG,kij∣∣+[UkL(j,nk, θ)+ UkG,t(j, θ)−[UkL(i,nk, θ)+ UkG,t(i, θ)]]χki=∑i,j∈Ak∣∣rL,kij + rG,kij∣∣+[UkL(j,nk, θ)+ UkG,t(j, θ)]χki−∑i,j∈Ak∣∣rL,kij + rG,kij∣∣+[UkL(i,nk, θ)+ UkG,t(i, θ)]χki=∑j,i∈Ak∣∣rL,kji + rG,kji∣∣+[UkL(i,nk, θ)+ UkG,t(i, θ)]χkj−∑i,j∈Ak∣∣rL,kij + rG,kij∣∣+[UkL(i,nk, θ)+ UkG,t(i, θ)]χki=∑i∈Ak[UkL(i,nk, θ)+ UkG,t(i, θ)]×[∑j∈Ak χkj∣∣rL,kji + rG,kji∣∣+ − χki∑j∈Ak∣∣rL,kij + rG,kij∣∣+]. (2.81)In the forth line, we applied∑i,j aij =∑j,i aji only to the first term. In the last line, we742.9. Proofs of Resultschanged the order of summation in the first term. Note further in (2.81) that∣∣rL,kii + rG,kii∣∣+ = 0, for all i ∈ Ak. (2.82)Combining (2.81), (2.82), and (2.38) shows that the first term on the r.h.s. of (2.80) is zero.Subsequently, applying∑i,j∈Ak∣∣rL,kij + rG,kij∣∣+[rL,kij + rG,kij]=∑i,j∈Ak[∣∣rL,kij + rG,kij∣∣+]2to the rest of the terms in (2.80), we obtainddtV(RL,k, RG,k)= −2V(RL,k, RG,k)+ 2δAk∑i,j∈Ak∣∣rL,kij + rG,kij∣∣+[UkL(j,nk, θ)+ UkG,t(j, θ)−[UkL(i,nk, θ)+ UkG,t(i, θ)]].Finally, noting that both local UkL(·, ·, θ)and global UkG,t(·, θ) payoff functions are bounded forall θ ∈ Q, one can find a constant α(θ) such thatddtV(RL,k, RG,k)≤ 2c(θ)δAk∑i,j∈Ak∣∣rL,kij + rG,kij∣∣+ − 2V(RL,k, RG,k). (2.83)In (2.83), c(θ) > 0 depends on the differences in payoffs for different actions i and j and canbe specified based on the particular choice of the payoff function and state of the underlyingtime variations. In this proof, however, we only require existence of such a bound, which isguaranteed since the payoff functions are bounded.Now, suppose∣∣rL,kij + rG,kij∣∣+ > > 0 for all i, j ∈ Ak.Then, one can choose δ small enough (δ < Ak/2c(θ)) such that2c(θ)δ/Ak −∣∣rL,kij + rG,kij∣∣+ < 0and, hence,ddtV(RL,k, RG,k)≤ −V(RL,k, RG,k). (2.84)This proves global asymptotic stability of each limit subsystem in (2.40) when θ(t) = θ is heldfixed. Therefore,limt→∞dist[RL,k(t) +RG,k(t),R−]= 0. (2.85)Next, we proceed to study the stability of the Markovian switched limit dynamical system.We use the above Lyapunov function to extend [67, Corollary 12] and prove global asymptoticstability w.p.1 for the switching system (2.40). Before proceeding with the theorem, we shall752.9. Proofs of Resultsdefine the K∞-class functions.Definition 2.10. A continuous function a : R+ → R+ is class K∞ if: (i) it is strictly increas-ing, (ii) a(0) = 0, and (iii) a(x)→∞ as x→∞.Theorem 2.7 ([67, Corollary 12]). Consider the system (2.43) in Definition 2.8, where θ(t) isthe state of a continuous time Markov chain with generator Q and state space Q = {1, . . . ,Θ}.Defineq := maxθ∈Mθ|qθθ | and q˜ := maxθ,θ′∈Mθqθθ′.Suppose there exist continuously differentiable functions Vθ : Rr → R+, for each θ ∈ Q, twoclass K∞ function a1, a2, and a real number v > 1 such that the following hold:1. a1(dist[x,S]) ≤ Vθ(x) ≤ a2(dist[x,S]), ∀x ∈ Rr, θ ∈ Q,2. ∂Vθ∂x fθ(x) ≤ −λVθ(x), ∀x ∈ Rr, ∀θ ∈ Q,3. Vθ(x) ≤ vVθ′(x), ∀x ∈ Rr, θ, θ′ ∈ Q,4. (λ+ q˜)/q > v.Then, the switched system (2.43) is globally asymptotically stable almost surely.Since we use quadratic Lyapunov function in (2.78), hypothesis 2 in Theorem 2.7 is triviallysatisfied. Further, since the Lyapunov functions are the same for each subsystem θ ∈ Q,existence of v > 1 in hypothesis 3 is automatically guaranteed. Given that λ = 1 in hypothesis 2(see (2.84)), it remains to ensure that the generator of Markov chain Q satisfies hypothesis 4.This is trivially satisfied since, by (2.30), |qθθ′ | ≤ 1, for all θ, θ′ ∈ Q. This completes the proof.2.9.4 Proof of Corollary 2.5We only give an outline of the proof, which essentially follows from Theorems 2.2 and 2.4.The proof for the first result follows similarly.DefineR̂L,k,µ(·) = Rk,µ(·+ tµ) and R̂G,k,µ(·) = RG,k,µ(·+ tµ).Then, it can be shown that (R̂L,k,µ(·), R̂G,k,µ(·)) is tight. For any T1 < ∞, take a weaklyconvergent subsequence of{R̂L,k,µ(·), R̂L,k,µ(· − T1), R̂G,k,µ(·), R̂G,k,µ(· − T1)}.Denote the limit by(R̂L,k(·), R̂L,kT1 (·), R̂G,k(·), R̂G,kT1 (·)). Note thatR̂L,k(0) = R̂L,kT1 (T1) and R̂G,k(0) = R̂G,kT1 (T1).762.9. Proofs of ResultsThe values of{R̂L,kT1 (0), R̂G,kT1(0)}may be unknown, but the set of all possible values (over allT1 and convergent subsequences) belong to a tight set. Using this, the stability condition andTheorems 2.2, for any η > 0 there is a Tη <∞ such that for all T1 > Tη,dist[(R̂L,kT1 (T1), R̂G,kT1(T1)),R−]≥ 1− η.This implies thatdist[(R̂L,k(0), R̂G,k(0)),R−]≥ 1− ηwhich establishes the desired result.773Energy-Aware Sensing OverAdaptive Networks“Exactly what the computer provides is the ability not tobe rigid and unthinking but, rather, to behave condition-ally. That is what it means to apply knowledge to action:It means to let the action taken reflect knowledge of thesituation, to be sometimes this way, sometimes that, asappropriate.”—Allen Newell [220]3.1 IntroductionBuilding upon the non-cooperative game-theoretic model and adaptive learning algorithm ofChapter 2, this chapter proposes an autonomous energy-aware mechanism for activation of sen-sors which collaborate to solve a global parameter estimation problem. Consider a network ofsensors observing temporal data of a common vector parameter of interest from different spatialsources with possibly different statistical profiles. In a centralized approach to the parameterestimation problem, the data or local estimates from all nodes would be communicated to acentral processor where they would be fused. Such an approach calls for sufficient commu-nication and energy resources to transmit the data back and forth between the sensors andthe central processor. An alternative approach is a decentralized implementation where eachnode runs an adaptive filter for the estimation function and locally exchanges and fuses theneighboring estimates. Such distributed estimation algorithms have appealing features such asscalability, robustness, and low power consumption. They are useful in a wide range of appli-cations including environmental monitoring, event detection, resource monitoring, and targettracking [251].3.1.1 Distributed Parameter Estimation Over Adaptive NetworksAdaptive networks are well-suited to perform decentralized estimation tasks. They consist ofa collection of agents with processing and learning abilities, which are linked together through783.1. Introductiona connection topology, and cooperate with each other through local interactions. This chapterfocuses on an adaptive network of sensors which aims to solve a parameter estimation task.Each sensor takes measurements repeatedly and runs an LMS-type adaptive filter to update itsestimate. Since the nodes have a common objective, it is natural to expect their collaborationto be beneficial in general. The cooperation among individual nodes is implemented via adiffusion strategy. Diffusion strategies have been shown to outperform consensus strategiesin [269], which have been widely used for implementing cooperation among agents that areassigned the same task. The resulting adaptive filter, namely, diffusion LMS [64, 204], can bedescribed as follows: At successive times, each node:(i) exchanges estimates with neighbors and fuses the collected data via a pre-specified com-biner function;(ii) uses the fused data and its temporal measurements to refine its estimate via an LMS-typeadaptive filterThe estimates are thus diffused among neighboring nodes so that the estimate at each node isa function of both its temporal data as well as the spatial data across the neighbors. In doingso, an adaptive network structure is obtained where the network as a whole is able to respondin real-time to the temporal and spatial variations in the statistical profile of the data [63].Due to measurement noise suppression through ensemble averaging, such diffusion LMSschemes improve the estimation performance, yet yield savings in computation, communicationand energy expenditure. The interested reader is referred to [64, 204] for detailed mathematicalmodel and performance study of the diffusion LMS adaptive filter.3.1.2 Chapter GoalIn the literature, energy conservation arguments and error variance analysis are widely used toprove stability for diffusion LMS adaptive filter and its generalizations; see [248] for an expo-sition. The first goal of this chapter is derive a classical stochastic approximation formulationfor the diffusion LMS so as to analyze its stability and consistency via an ODE approach. Thisenables simpler derivation of the known stability results for the diffusion LMS adaptive filter,and allows using weak convergence methods in the case that the parameter of interest evolveswith time.We then tackle the question:Can one make such adaptive networks even more power-efficient by allowing eachnode to activate, sense the environment, and process local information only whenits contribution outweighs the associated energy costs?Consistent with the decentralized essence of the diffusion LMS, the goal is to develop a dis-tributed activation mechanism whereby nodes autonomously make activation decision based on793.1. Introductionoptimizing a utility function. This utility function has to capture the trade-off between the“value” of the new measurement and the costs associated with sensor activation. Benefits fromdeveloping such smart data sensing protocols are three-fold: Staying sleep yields savings in en-ergy, communication, and computational resources. Naturally, there exists a trade-off betweenthe convergence rate and how aggressively nodes activate and take measurements. This allowsthe network controller to adjust nodes’ utilities such that the desired estimation/convergenceexpectations, suited to the particular application of interest, are met.3.1.3 Main ResultsWe favor a game-theoretic approach due to the interdependence of individual nodes activity.The channel quality, required packet transmission energy and usefulness of a sensor’s infor-mation all depend on the activity of other sensors in the network. As a result, each nodemust observe and react to other sensors’ behavior in an optimal fashion. The intertwinedrole of adaptation and learning, crucial to the self-configuration of nodes, further makes game-theoretic learning an appealing theoretical framework. The adaptive learning algorithm inChapter 2 then helps to devise an energy-aware diffusion LMS adaptive filter that allows nodesto autonomously make activation decisions.The main results in this chapter are summarized below:1. Simpler derivation of the stability results for diffusion LMS Adaptive filter. By properlyrescaling the periods at which measurements and data fusion take place, the diffusion LMSis reformulated as a classical stochastic approximation algorithm, which in turn enables usingthe ODE method [50, 192] for its analysis. This leads to simpler derivation of the knownstability and consistency results.2. A non-cooperative game-theoretic model for the energy-aware activation control problem.Associated with each activation decision, there corresponds a reward or penalty that capturesthe spatial-temporal correlation among sensors measurements, and trades-off each sensor’scontribution to the estimation task and the energy costs associated with it.3. An energy-aware diffusion LMS adaptive filter. A novel two time-scale adaptive filtering al-gorithm is proposed that combines the diffusion LMS (slow timescale) with a game-theoreticlearning functionality (fast timescale), which prescribes the sensors when to activate and runthe diffusion LMS adaptive filter. To the best of our knowledge, the proposed algorithm isthe first that considers direct interaction of the adaptive filter, that performs the estimationtask, and the activation mechanism, that concerns local efficiency.4. Performance analysis of the proposed energy-aware diffusion LMS algorithm. We prove thatif each sensor individually follows the proposed energy-aware diffusion LMS algorithm, thelocal estimates converge to the true parameter across the network, yet the global activation803.2. Diffusion Least Mean Squaresbehavior along the way tracks the evolving set of correlated equilibria of the underlyingenergy-aware activation control game.5. Numerical evaluation. Simulation results verify theoretical findings and illustrate the perfor-mance of the proposed scheme, in particular, the trade-off between the performance metricsand energy saving.The game-theoretic formulation and activation mechanism that will be discussed in thischapter conveys a design guideline: by carefully specifying the utility functions, a multi-agentsystem can be designed to exhibit many desired sophisticated behaviors by individual agentsperforming simple local behavior. Nature also provides abundance of such sophisticated globalbehavior as the consequence of localized interactions among agents with limited sensing capa-bilities. For example, in bee swarms, about 5% of bees are informed, and this small fraction isable to guide the entire swarm to their new hive [33].3.1.4 Chapter OrganizationThe rest of this chapter is organized as follows: Section 3.2 formulates the parameter estima-tion problem, presents the diffusion LMS, and provides a convergence analysis via the ODEmethod. Section 3.3 then provides an overview of the design principles for an energy-awarediffusion LMS, formulates the activation control game, and presents and elaborates on theproposed energy-aware diffusion LMS algorithm. The global network performance metrics aresubsequently defined in Section 3.4, and used to characterize the globally sophisticated behav-ior emerging by nodes individually following the proposed algorithm. Section 3.5 provides adetailed performance analysis for the energy-aware diffusion LMS algorithm. Finally, numericalexamples are provided in Section 3.6 before the closing remarks in Section 3.7.3.2 Diffusion Least Mean SquaresThis section describes parameter estimation via diffusion LMS. We start by formulating thecentralized parameter estimation problem in Section 3.2.1. The standard diffusion LMS algo-rithm, proposed in [204], is then described in Section 3.2.2 as a decentralized solution to thiscentralized problem. Section 3.2.3 elaborates on how to revise the diffusion protocol in thestandard diffusion LMS so as to be able to rewrite it as a classical stochastic approximationalgorithm. This new formulation sets the stage for Section 3.2.4 that uses the powerful ODEmethod [40, 192] to obtain a simpler (and much shorter) derivation of the known consistencyresults.813.2. Diffusion Least Mean Squares3.2.1 Centralized Parameter Estimation ProblemConsider a set of K nodes K = {1, . . . ,K} spread over some geographic region with the commonobjective to estimate an M × 1 unknown parameter vector ψo via noisy measurementsyk[n] = uk[n]ψo + vk[n], k = 1, . . . ,K. (3.1)Here, {uk[n]} and {vk[n]} denote the sequences of 1×M random regressor vectors and zero-meanlocal measurement noises, respectively. Such linear models are well-suited to approximate input-output relations for many practical applications [247]. The centralized parameter estimationproblem in the linear least mean square (LMS) sense can then be formulated as follows: LetU c[n] := col(u1[n], . . . ,uK [n])(K ×M)yc[n] := col(y1[n], . . . , yK [n])(K × 1)The network of sensors then seeks to solveminψE∥∥yc[n]− U c[n]ψ∥∥2 (3.2)where E and ‖·‖ denote the expectation and Euclidean norm operators, respectively. Note herethat the second-order moment Rku := E{uk[n]>uk[n]}is allowed to vary across sensors.Define the correlation and cross-correlation byRcu := E{U c[n]>U c[n]}(M ×M)Rcyu := E{U c[n]>ycn}(M × 1)respectively1. It is well-known that the optimal solution ψo satisfies the orthogonality conditionE{U c[n]>(U c[n]ψo − y[n])}= 0which can also be expressed by the solution to the normal equations [247]Rcuψo = Rcyu. (3.3)The goal is to develop a distributed stochastic approximation algorithm for parameter estimationthat ensures the sequence of parameter estimatesψk[n] converge to (or track) the true parameterψo (ψo[n] if it is evolving with time).The diffusion least mean square (LMS) [204] is one such algorithm that adopts a peer-to-peer diffusion protocol to implement cooperation among individual sensors. This cooperation1For complex-valued signals, (·)> is replaced with (·)∗ that denotes complex conjugate-transpose.823.2. Diffusion Least Mean Squaresleads to savings in communications and energy resources and enables the network to respond inreal-time to the temporal and spatial variations in the statistical profile of the data [248–250].We thus focus on diffusion LMS as a decentralized solution to the approximate ψo in (3.3) inthe rest of this chapter.3.2.2 Standard Diffusion LMSIt is instructive to start with the standard diffusion LMS algorithm proposed in [204]. Be-fore proceeding with the details of the algorithm, we spell out the conditions imposed on themeasurement model (3.1) in [204]:Assumption 3.1. The sequence{uk[n], vk[n]}is temporally white and spatially independent.More precisely,E{uk[n]>ul[n′]}= Rku · δnn′ · δklE{vk[n]vl[n′]}= σ2v,k · δnn′ · δkl(3.4)where Rku is positive-definite and δij is the Kronecker delta function: δij = 1 if i = j, and0 otherwise. The noise sequence {vk[n]} is further uncorrelated with the regressor sequence{ul[n′]}. That is, for all n, n′, k, and l,E{ul[n′]>vk[n]}= 0. (3.5)The diffusion LMS algorithm can be simply described as an LMS-type adaptive filter witha cooperation strategy based on a peer-to-peer diffusion protocol. Sensors form social groupswithin which estimates are exchanged. Sensors then fuse the collected data and combine it withthe local measurements to refine their estimates. Due to limited communication capabilities,each node k can only communicate with neighbors determined by the connectivity graph of thenetwork.Definition 3.1. The connectivity graph G = (E ,K) is a simple graph, where nodes constitutevertices of the graph, and(k, l) ∈ E if there exists a link between nodes k and l. (3.6)The open Nk and closed neighborhoods Nkc are then defined as in (2.12).Remark 3.1. For simplicity of presentation, we assume the connectivity graph G is fixedand strongly connected, i.e., there exists a path between each pair of nodes. Intermittentlink/node failures can be captured by a random graph model where the probability that twonodes are connected is simply the probability of successful transmission times the indicator833.2. Diffusion Least Mean Squaresfunction that shows the two nodes are neighbors in the underlying fixed graph. This makesthe problem more realistic in large-scale networks since inexpensive nodes are bound to failat random times. Motivated by wireless sensor network applications, the link failures can befurther spatially correlated (due to the interference among wireless communication channels),however, are independent over time. In such cases, mean connectedness of the random graphis sufficient. That is, it is sufficient to ensure thatλ2(L) > 0, where L := E {L[n]} .Here, {L[n]} denotes the sequence of Laplacians of the random graph process, and λ2(L) rep-resents the second largest eigenvalue of L. For a comprehensive treatment of this matter, werefer the reader to [172].Define a stochastic weight matrix W = [wij ] with the following properties:0 ≤ wkl ≤ 1, for all k, l ∈ K, W1K = 1K , and W> = W (3.7)where 1K denotes a K × 1 vector of ones. The peer-to-peer diffusion protocol can then bedescribed as follows: At each period n, each node k exchanges estimate ψk[n] with neighborsover the graph G. The local estimates ψl[n], l ∈ Nkc , are then fused by means of a linearcombiner with weights that are provided by the kth row of the weight matrix W . More precisely,the fused local estimate for each node k can be expressed asψk[n] = ∑l∈Nkc wklψl[n]. (3.8)Since each node k has a different neighborhood, the above fusion rule helps fuse estimates acrossthe network into the estimate at node k.The standard diffusion LMS algorithm [204] integrates the above diffusion protocol with anLMS filter. It requires each node to iteratively update estimates via a two step recursion asfollows:1. ψk[n+ 1] = ψk[n] + εk uk[n]>[yk[n]− uk[n]ψk[n]]2. ψk[n+ 1] =∑l∈Nkcwklψl[n+ 1](3.9)with local step-size 0 < εk < 1. In general, the step-size εk could vary across nodes. However,for simplicity of presentation and with no loss of generality, we assume that the local step-sizesare all the same and equal to ε across the network in the rest of this chapter. In light of (3.9),the estimate at each node is a function of both its temporal measurements as well as the spatialdata across the neighbors. This enables the adaptive network to respond in real-time to the843.2. Diffusion Least Mean Squarestemporal and spatial variations in the statistical profile of the data [204]. It is well-known that,to ensure mean-square stability, the step-size in (3.9) must be sufficiently small and satisfy0 < ε < 2/λmax(Rku) [64, 204, 248]. Making the step-size too small, however, results in dataassimilation and fusion of neighboring estimates being performed on different timescales. If0 < ε 1, assimilation of local measurements constitutes the slow timescale, and the datafusion is performed on the fast timescale.The local updates (3.9) together give rise to a global state-space representation as follows:Definev[n] := col(v1[n], . . . , vK [n])(K × 1)Ψo := col (IMψo, . . . , IMψo) (KM × 1)U [n] := diag(u1[n], . . . ,uK [n])(K ×KM)where IM denotes the M ×M identity matrix. Then, the linear measurement model (3.1) canbe written asy[n] = U [n]Ψo + v[n]. (3.10)Further, letΨ[n] := col(ψ1[n], . . . ,ψK [n])(KM × 1)W := W ⊗ IM (KM ×KM)where ⊗ denotes the Kronecker product, and W is the weight matrix defined in (3.7). Theglobal diffusion LMS update across the network can then be expressed in a more compactstate-space form asΨ[n] = WΨ[n− 1] + εIKMU [n]>(y[n]−U [n]WΨ[n− 1]). (3.11)A classical stochastic approximation algorithm is of the formΨ[n] = Ψ[n− 1] + εg(Ψ[n− 1],v[n]), 0 < µ 1 (3.12)where v[n] denotes a noise sequence and ε is the step-size [40, 192]. Clearly, the global recur-sion (3.11) for diffusion LMS is not in such form unless W = I, which corresponds to the casewhere no data fusion is being performed on local estimates.3.2.3 Revised Diffusion LMSHere, the objective is to derive an equivalent diffusion strategy that allows to express theglobal diffusion LMS update rule (3.11) as a classical stochastic approximation recursion of theform (3.12). The advantages of this new formulation are threefold:1. Both data assimilation and fusion takes place on the same timescale and, hence, one candeal with correlated sequences {uk[n]} and {vk[n]} in the measurement model.853.2. Diffusion Least Mean Squares2. Simpler derivation of the known results can be obtained by employing the powerful ordi-nary differential equation (ODE) method [40, 192].3. One can use weak convergence methods [192, Chapter 8] to show responsiveness of thediffusion LMS to the time-variations underlying the true parameter.We start by describing the desired diffusion protocol that allows to reformulate (3.11) asa classical stochastic approximation algorithm. The equivalence between this new formulationand the standard diffusion LMS will then be discussed.Define a new weight matrixW = IK + εC (3.13)whereC1K = 0K , C> = C, and cij ≥ 0 for i 6= j.Note that W is dominated by the diagonal elements representing the weights that nodes puton their own estimate. Smaller step-sizes require smaller fusion weights to ensure data fusionand assimilation of measurements are performed at the same timescale. This is in contrast tothe weight matrix W in (3.7).LetW := W ⊗ IM (KM ×KM)C := C ⊗ IM (KM ×KM).The global recursion for the diffusion LMS algorithm across the network using the revised datafusion rule can be then written asΨ[n] = Ψ[n− 1] + ε IKM[CΨ[n− 1] +U [n]>(y[n]−U [n]WΨ[n− 1])]. (3.14)Note that the local updates of the revised diffusion LMS adaptive filter is essentially the sameas (3.9) with the exception that the weight matrix W is now replaced with W . This reformu-lation only enables us to use the powerful ODE method in analyzing convergence and trackingproperties.The following lemma provides the condition under which the diffusion LMS algorithms (3.9)and (3.14) are equivalent. Below, we use log(·) to denote the matrix logarithm. For two matricesA and B, B = log(A) if and only if A = exp(B) where exp(B) denotes the matrix exponential.Theorem 3.1. As the step-size ε→ 0, the global recursions (3.11) and (3.14) are equivalent ifand only ifC = ln(W )/ε in (3.13). (3.15)Proof. Suppose ε time units elapse between two successive iterations of (3.11). Then, the datafusion term (the first term on the r.h.s. of (3.11)) can be conceived as a discretization of the863.2. Diffusion Least Mean SquaresODEdΨdt = W˜Ψ, where exp(εW˜)= W .That is, W˜ is the matrix logarithm of W normalized by ε. Taking Taylor expansion yieldsW = I + εW˜ + o(ε), where o(ε)→ 0 as ε→ 0. (3.16)Therefore, as ε → 0, the global recursion for the standard diffusion LMS (3.9) is equivalentto (3.14) if and only if C = W˜ = ln(W )/ε. Using the properties of the Kronecker product, itis straight forward to show that this condition on the weight matrix in the global state-spacerepresentation is equivalent to C = ln(W )/ε for the local weight matrix; see (3.7) and (3.13).Using a Taylor series expansion argument, it can be shown thatC = limε→0exp(εC)− Iε = limε→0W − Iε . (3.17)Subsequently,C1 = limε→0W1− I1ε = 0since W is a stochastic matrix; see (3.7). Finally, since all non-diagonal elements on the r.h.s.of (3.17) are non-negative, so are the non-diagonal elements of C, and the proof is complete.The above lemma establishes that (3.11) can be approximated by (3.14) for small step-sizes0 < ε 1. This has the added benefit of allowing to relax the independence assumption of thesequence {uk[n], vk[n]}. Let Fkn denote the σ-algebra generated by {uk[τ ], yk[τ ], τ < n}, anddenote the conditional expectation with respect to Fkn by Ekn. Then, Assumption 3.1 can berelaxed to allow correlated data without affecting the convergence properties as follows:Assumption 3.2. The sequence {uk[n], yk[n]} is bounded, and there exists a symmetric matrixRku for all k ∈ K such that Rku = E{uk[n]>uk[n]}. Further, for some constant B > 0,∥∥∥∥∥∞∑τ=nEkn{uk[τ ]>vk[τ ]}∥∥∥∥∥≤ B, and∥∥∥∥∥∞∑τ=nEkn{uk[τ ]>uk[τ ]}−Rku∥∥∥∥∥F≤ Bwhere ‖ · ‖ and ‖ · ‖F denote the Euclidean and Frobenius norms, respectively.The above condition allows us to work with correlated signals whose remote past and distantfuture are asymptotically independent. With suitable regularity conditions (such as uniformintegrability), a large class of processes satisfy conditions of the above assumption. Examplesinclude:1. Finite-state Markov chains: Let {xn} takes values in a finite set. Assuming that the873.2. Diffusion Least Mean SquaresMarkov chain {xn} has memory p,E [xn|xn−1, xn−2, . . . , x1] = E [xn|xn−1, xn−2, . . . , xn−p] , for n > p.A useful model is the finite-state random walk noise processvk[n] = vk[n− 1] + σ(∆t) 12 z[n]where σ is the random walk standard deviation parameter, ∆t is the desired time step,and z[n] is a white noise.2. Moving average processes MA(p):x[n] = z[n] +∑pτ=1 γτ z[n− τ ]where {z[n]} is the sequence of (Gaussian) white noise error terms, and γ1 , . . . , γp areparameters of the model. For instance, the measurement noise sequence vk[n] in (3.1)may followvk[n] = w[n]− γw[n− 1], w[n] ∼ N(0, σ2)where N(0, σ2) denotes a normal distribution with variance σ2.3. Stationary autoregressive processes AR(p):x[n] =∑pi=1 aix[n− i] + w[n], w[n] ∼ N(0, σ2)if the roots of the polynomial zp −∑pi=1 aizp−i = 0 lie inside the unit circle.Note further that, to obtain the desired results, the distributions of the sequences {uk[n]} and{vk[n]} need not be known.Remark 3.2. The boundedness assumption in (3.2) is a mild restriction. Dealing with recursiveprocedures in practice, in lieu of (3.9), one often uses a projection or truncation algorithm. Forinstance, one may useψk[n] = piI[ψk[n− 1] + ε(uk[n]>[yk[n]− uk[n]ψk[n− 1]])], (3.18)where piI is a projection operator and I is a bounded set. When the iterates are outside theset I, it will be projected back; see [192] for an extensive discussions.883.2. Diffusion Least Mean Squares3.2.4 Convergence Analysis via ODE MethodIt is well known since the 1970s that the limiting behavior of a stochastic approximation algo-rithm is typically captured by a deterministic ODE. This is in fact the basis of the widely usedODE method, which is perhaps the most powerful and versatile approach for the analysis ofstochastic algorithms. The basic idea is that the stochastic approximation iterates are treatedas a noisy discretization of a limiting ODE. The limiting behavior is then established by provingthat the iterates converge to a solution of the o.d.e. in an appropriate sense. The rest of thissection is devoted to the convergence analysis of the diffusion LMS algorithm via the ODEapproach.Define the estimation errorΨ˜[n] = Ψ[n]−Ψo.The global diffusion LMS recursion can be rewritten in the terms of the estimation error asfollows:Ψ˜[n] = Ψ˜[n− 1] + εIKM[CΨ˜[n− 1] +U [n]>(v[n]−U [n]W Ψ˜[n− 1])]. (3.19)To derive (3.19), we used C1 = 0. Recall that C = C ⊗ IM , where C is defined in (3.13).Following the ODE approach, in lieu of working with the discrete iterates directly, we take acontinuous-time interpolation of the iterates. Define the piecewise constant interpolated processassociated with the discrete-time estimation error iterates asΨ˜ε(t) = Ψ˜[n], for t ∈ [nε, (n+ 1) ε) . (3.20)The above interpolated process belongs to the space of functions that are defined in [0,∞)taking values in RKM , and that are right continuous and have left limits (namely, Ca`dla`gfunctions) with Skorohod topology (see [192, p. 228]). We use D([0,∞) : RKM ) to denote suchan space, whose topology allows us to “wiggle the space and time a bit”.The following theorem provides us with the evolution of the tracking error. It shows thatΨ˜[n] evolves dynamically so that its trajectory follows an ODE. Since the ODE is asymptoticallystable, the errors decay exponentially fast to 0 as time grows.Theorem 3.2. Consider the interpolated process Ψ˜ε(·), defined in (3.20). The following as-sertions hold:1. Under Assumption 3.2, Ψ˜ε(·) is tight in D([0,∞) : RKM ).893.2. Diffusion Least Mean Squares2. Suppose Assumption 3.2 is replaced with(i)1mn+m∑τ=nEkn{uk[τ ]>vk[τ ]}→ 0(ii)1mn+m∑τ=nEkn{uk[τ ]>uk[τ ]}→ Rkuin probability as m→∞.Then, as ε→ 0, Ψ˜ε(·) converges weakly to Ψ˜(·) that is the solution of the ODEdΨ˜(t)dt =(C −Ru)Ψ˜(t), t ≥ 0, Ψ˜(0) = Ψ˜[0] (3.21)where Ru := diag(R1u, . . . , RKu)is a KM ×KM matrix.3. The linear system (3.21) is globally asymptotically stable, andlimt→∞∥∥Ψ˜(t)∥∥ = 0. (3.22)4. Denote by tε any sequence of real numbers satisfying tε →∞ as ε→ 0. Then, Ψ˜ε(·+ tε)converges in probability to 0. That is, for any δ > 0,limε→0P(∥∥∥Ψ˜ε(·+ tε)∥∥∥ ≥ δ)= 0. (3.23)Proof. To prove part 1, we apply the tightness criterion [187, p. 47]. It suffices to show that,for any δ > 0 and t, s > 0 such that s < δ,limδ→0lim supε→0E{sup0≤s≤δEεt∥∥∥Ψ˜ε(t+ s)− Ψ˜ε(t)∥∥∥2}= 0where Eεt denote the σ-algebra generated by the ε-dependent past data up to time t. In viewof Assumption 3.2,Eεt∥∥∥Ψ˜ε(t+ s)− Ψ˜ε(t)∥∥∥2≤ Kε2( t+ sε −tε)2= O(s2). (3.24)The tightness result follows by first taking lim supε→0 and then limδ→0 in (3.24). Tightness isequivalent to sequential compactness on any complete separable metric space. Thus, by virtueof the Prohorov’s theorem, we may extract a weakly convergent subsequence. For notationalsimplicity, still denote the subsequence by Ψ˜ε(·) and use Ψ˜(·) to denote its limit.The proof of part 2 uses stochastic averaging theory and follows from the standard argumentin [192, Chapter 8] to characterize the limit Ψ˜(·) of the convergent subsequence Ψ˜ε(·). Details903.2. Diffusion Least Mean Squaresare similar to the derivations in Section 2.9, and are omitted for brevity. We merely note that,in view of (3.13), W = IKM + εC. Looking at the last term in the r.h.s. of (3.19), only IKMsurvives in the limit. More precisely, the term εC is averaged out to 0.To prove part 3, we use standard results on Lyapunov stability of linear systems. In viewof (3.13), since|ckk| =∑l 6=k ckl =∑l 6=k |ckl| for all kthe symmetric matrix C is weakly diagonally dominant, hence, is negative semi-definite. Con-sequently, the global matrix C = C ⊗ IM is negative semi-definite [157]. On the other hand,Ru is symmetric Ru = R>u and positive-definite. Therefore, C −Ru is negative-definite. Theset of global attractors of (3.21) is then characterized by solving(C −Ru) Ψ˜ = 0. (3.25)Sincedet (C −Ru) =∏i λi (C −Ru) < 0the matrix C −Ru is nonsingular, and (3.25) has the unique solution Ψ˜ = 0. This concludesthe desired result in (3.22).In Part 1, we considered ε small and n large, but εn remained bounded. This gave a limitODE for the sequence of interest as ε → 0. Part 4 is an asymptotic stability argument, forwhich we need to establish the limit as first ε → 0 and then t → ∞. The consequence is thatthe limit points of the ODE and the small step-size diffusion LMS recursions coincide as t→∞.However, in lieu of considering a two stage limit by first letting ε→ 0 and then t→∞, we lookat Ψ˜ε(t+ tε) and require tε →∞. The interested reader is referred to [192, Chapter 8] for anextensive treatment.The above theorem simply addresses consistency of the diffusion LMS. Statement 1 dealswith the case where ε→ 0 and n is large, however, εn remains bounded, whereas Statement 4concerns asymptotic consistency. Here, Ψ˜ε(·+ tε) essentially looks at the asymptotic behaviorof the sequence of estimates Ψ˜[n] as n→∞. The requirement tε →∞ as ε→ 0 means that welook at Ψ˜[n] for a small ε, and after n 1/ε iterations. It shows that, for a small ε, and withan arbitrarily high probability, the sequence of estimates at each node ψk[n] eventually spendsnearly all of its time in a δ-neighborhood of true parameter ψo such that δ → 0 as ε→ 0. Notethat, for a small ε, ψk[n] may escape from such a δ-neighborhood. However, if such an escapeever occurs, will be a “large deviations” phenomenon—it will be a rare event. The order of theescape time (if it ever occurs) is often of the form exp(b/ε) for some constant b > 0; see [192,Section 7.2] for further details.Remark 3.3. In case of a slowly time-varying true parameter ψo[n] it can be shown that, if913.3. Energy-Aware Diffusion LMSψo[n] evolves on the same timescale as the adaptation rate ε of the diffusion LMS (i.e. the trueparameter dynamics are matched to the diffusion LMS stochastic approximation algorithm),ψk[n] can track its variations. Such tracking capability is shown in [289] for the standard LMSalgorithm when the true parameter is non-stationary and evolves on the same timescale as theadaptation speed of the algorithm.3.3 Energy-Aware Diffusion LMSThis section presents an energy-aware variant of the diffusion LMS algorithm described inSection 3.2.3 by embedding a distributed game-theoretic activation control mechanism basedon the adaptive learning procedure of Chapter 2. A qualitative overview of the design principlesare provided in Section 3.3.1. Section 3.3.2 then formulates the game-theoretic model by whichnodes make activation decisions. Finally, Section 3.3.3 presents the energy-aware diffusion LMSalgorithm.3.3.1 Overview of the ApproachAside from scalability and stabilizing effect on the network, deploying cooperative schemes suchas the diffusion protocol (3.8) can lead to energy savings, which in turn prolongs the lifespanof sensors [204, 250]. The main idea in this section is to equip the sensors with an activationmechanism that, taking into account the spatial-temporal correlation of their measurements,prescribes sensors to sleep when the energy cost of acquiring new measurement outweighs itscontribution to the estimation task. This will result in even more savings in energy consumptionand increased battery life of sensors.Consistent with the decentralized essence of the diffusion LMS, we resort to distributedactivation mechanisms, which facilitate self-configuration of nodes and, as well, contribute tonetwork robustness and scalability. Each node will be equipped with a pre-configured (orconfigurable) utility function and a simple adaptive filtering algorithms. The adaptive filtercontrols local functionality with minimal message passing. While individual nodes posses lim-ited communication and processing capability, it is their coordinated behavior that leads to themanifestation of rational global behavior. The proposed formulation conveys a design guide-line: by carefully specifying the utility functions, these networked nodes can be made to exhibitmany desired behaviors by individual nodes performing simple local tasks.The key feature that characterizes our study as a game is the interdependence of individualnode’s behavior. The usefulness of a node’s information, channel quality, required packet trans-mission energy all depend on the activity of other nodes in the network. Each node repeatedlyfaces a non-cooperative game where the actions are whether to “activate” or “sleep” and, ac-cordingly, receives a reward. When a node is active, it updates its estimate and performs fusion923.3. Energy-Aware Diffusion LMSof its own estimate with those received from neighboring nodes, whereas inactive nodes do notupdate their estimates. More precisely, each node runs local updates of the formψk[n] ={ψk[n− 1] + ε[uk[n− 1]>(yk[n]− uk[n]ψk[n− 1])], if activeψk[n− 1], if sleepwhere ψk[n − 1] is the fused local estimate, defined in (3.9). Each individual node’s rewarddepends not only on the actions of other nodes but also on their estimates of the true parame-ter. In this case, no pre-computed policy is given; that is, nodes learn their activation policiesthrough repeated interactions and exchanging information with neighbors. Nodes adapt theirpolicies as the parameter estimates across the network vary over time. The activation mecha-nism is a simple non-linear adaptive filtering algorithm based on the regret-matching adaptivelearning procedure proposed in Chapter 2.This activation mechanism is embedded into the diffusion LMS algorithm such that the over-all energy-aware diffusion LMS forms a two-timescale stochastic approximation algorithm: thefast timescale corresponds to the game-theoretic activation mechanism and the slow timescaleis the diffusion LMS. Intuitively, in such two timescale algorithms, the fast timescale will seethe slow component as quasi-static while the slow component sees the fast one near equilibrium.In light of the above discussion, the main theoretical finding is that, by each node individuallyfollowing the proposed energy-aware diffusion LMS algorithm, the global activation behaviorof nodes at each time is within the polytope of correlated -equilibrium of the underlying ac-tivation control game while the estimate at each node converging to (or tracking) the (slowlytime-varying) true parameter of interest. Figure 3.1 illustrates the local and global behavior ofthe energy-aware diffusion LMS.3.3.2 Activation Control GameThe problem of each node k is to successively pick action ak[n] from the setA = {0 (sleep), 1 (activate)} (3.26)to strategically optimize a utility function. This utility function captures the trade-off betweenenergy expenditure and the “value” of node’s contribution. Let a = (a1, . . . , aK) denote thejoint activation decisions of all nodes. Then,a ∈ AK := ×Kk=1Ai.e., a belongs to the space of all possible joint actions profiles. Consistent with the notationused in Section 2.3.1, define the set of non-neighbors of node k by Sk := K −Nkc . The utility933.3. Energy-Aware Diffusion LMSInformation Exchange Within N eighb orhoodDiffusion L MSSpatial Data from N eighb orsL ocalEstimateAdaptive FilterT emporalMeasurementsif activeGame-T heoretic L earningL ocalActivation DecisionAdaptive FilterN eighb ors Activation B ehaviorN eighb orhood Aggregate Estimate+ Realized U tilities(a) Local Adaptation and LearningGlob al Activation DecisionActive SleepN etw ork-w ide Diffusion L MSSet of Activation StrategiesPolytope of C orrelated Eq uilib riumActivation C ontrolGameN etw ork-w ideEstimates(b) Global BehaviorFigure 3.1: Energy-aware diffusion LMS is asymptotically consistent, yet the global activationbehavior along the way tracks the correlated -equilibria set of the activation control game.943.3. Energy-Aware Diffusion LMSfor each node k is a bounded function Uk : AK × RKM → R, and is comprised of a local and aglobal term:Uk(ak,a−k,Ψ)= UkL(ak,aNk ,Ψ)+ UkG(ak,aSk). (3.27)Here, the action profile of all nodes excluding node k, denoted by a−k, has been rearrangedand grouped into two vectors aNk and aSk that, respectively, represent the joint action profileof neighbors and non-neighboring nodes of node k. Note further that the utility function foreach node depends on the parameter estimates across the network.The local utility function UkL captures the trade-off between the value of the measurementscollected by node k and the energy and costs associated with it. We assume that each nodeis only capable of low-power communication2. The neighbors of each node are thus locatedwithin a pre-specified range. If too many of node k’s neighbors activate simultaneously, excessiveenergy is consumed due to the spatial-temporal correlation of node measurements. This impliesthat the data collected by node k is less valuable. Additionally, the probability of successfultransmission reduces due to channel congestion. (Interchangeably, to keep success rate fixed,the required packet transmission energy should increase.) On the other hand, if too few of nodek’s neighbors activate, their fused estimates lack “innovation.” The local utility of node k isthen given by:UkL(ak,aNk ,Ψ)= Cl,1[1− exp(− γl∥∥ψk −ψk∥∥2s(ηk))]· ak− Cl,2[ETx(ηk)+ EAct]· ak (3.28)whereηk(ak,aNk) = ak +∑l∈Nk al. (3.29)In (3.28), Cl,1 and γl are the pricing parameters related to the “reward” associated with estima-tion task; Cl,2 is the pricing parameter related to the energy costs associated with broadcastingmeasurements ETx, specified in (A.4) in Appendix A, and activation EAct, which can be calcu-lated from Table A.1 in Appendix A as follows:EAct = 1F[(1− km) (PCPU − PCPU0 + PS − PS0)− kcPRX (TCCA + TRXwu)]. (3.30)Here, F is the frequency of nodes activation decisions, and km and kc are the (small) proportionsof time spent sensing the environment and channel, respectively, in sleep mode. The rest of theparameters are defined in Table A.1 in Appendix A. Finally, s(ηk) denotes the probability ofsuccessful transmission. Higher activity ηk of nodes neighboring node k lowers its local utility2The ZigBee/IEEE 802.15.4 standard is currently a leading choice for low-power communication in wirelesssensor networks. It employs a CSMA/CA scheme for multiple access data transmission. In networks with tightenergy constraints, the non-beacon-enabled (unslotted CSMA/CA) mode is more preferable as the node receiversdo not need to switch on periodically to synchronize to the beacon.953.3. Energy-Aware Diffusion LMSdue to:i) reducing success of transmission attempts;ii) inefficient usage of energy resources due to spatial-temporal correlation of measurements.Nodes are thus motivated to activate when majority of neighbors are in the sleep mode and/orits estimate ψk is far from the local aggregated estimate ψk.The global utility function UkG concerns connectivity and diffusion of estimates across thenetwork. Let Rkr denote the set of nodes within radius r from node k excluding itself and itsneighbors Nk on the network graph. Define the number of active nodes in Rkr byζk(ak,aSk)= ak +∑l∈Rkral. (3.31)The global utility of node k is then given byUkG(ak,aSk)= Cg · exp(− γgζk)· ak (3.32)In (3.32), Cg and γg are the pricing parameters. Higher ζk lowers the global utility due to:i) less importance of node k’s contribution to the diffusion of estimates across the network;ii) reducing the success of transmission for other nodes in the neighborhood which affectsglobal diffusion of estimates.Each node k is thus motivated to activate when majority of the nodes in its geographic regionRkr are in the sleep mode.Each node k realizesηk[n] = ηk(ak[n],aNk [n])as a consequence of receiving estimates ψl[n] from neighbors l ∈ Nk, therefore, is able toevaluate its local utility at each period n. Nodes, however, do not observe the actions ofnon-neighbors, hence, do not realize ζk[n] and are unable to evaluate global utilities. To thisend, some sort of centralization is required by implementing cluster heads that monitor nodesactivity in their locale, and deliver global utilities to the nodes. Hereafter, we useUkG,n(ak) = UkG(ak,aSk [n])to denote the realized global utility for node k at period n by choosing action ak[n]. Note thatUkG,n is a time-varying function due to the action profile of non-neighboring nodes changingover time.963.3. Energy-Aware Diffusion LMSRemark 3.4. Note that each node k only requires the cumulative activity ζk in its geographicregion to evaluate its global utility. Adopting a ZigBee/IEEE 802.15.4 CSMA/CD protocolfor broadcasting estimates across neighbors, ζk can be estimated by tracking the proportion ofsuccessful clear channel assessment (CCA) attempts. The interested reader is referred to [185,Section II-C] for a discussion and an algorithm for estimating node activity. This eliminatesthe need for inclusion of cluster heads to deliver global utilities to the nodes.The contribution of each node to the parameter estimation task, the success of transmissionswithin neighborhoods, and the network connectivity all depend on the activity of other nodesin the network. Due to such interdependence, the activation control problem can be formulatedas a non-cooperative repeated game with neighborhood monitoring, defined in Section 2.3.1,where nodes constitute the player set and action sets are identical.Remark 3.5. A cooperative game formulation of the energy-aware activation problem forbearings-only localization is given in [120]. The equilibrium notion, namely, core [230], of thegame then simply provides prescriptions about which nodes to collaborate with each other andthe number of measurements to be collected by each node at a particular period comprisingseveral time slots. In contrast, here, the collaboration among nodes is enforced by the networkconnectivity graph, and nodes only make decisions about their activity. The non-cooperativeactivation control game formulation also allows nodes to decide whether to activate on a singletime slot basis that, intuitively, has to yield improvements in coordination among nodes activa-tion behavior. Each node decides whether to activate so as to balance its lifetime (competition)against contribution to the parameter estimation task (cooperation). The cooperation amongnodes is enforced by imposing incentives for collecting measurements and sharing estimateswith neighboring nodes.3.3.3 Game-Theoretic Activation Control MechanismIn view of the game-theoretic model for the activation control of diffusion LMSs in Section 3.3.2,we proceed here to present the energy-aware diffusion LMS algorithm. Since the activationcontrol game matches the model described in Section 2.3.1, we adopt Algorithm 2.1 as theactivation control mechanism, and embed it into the diffusion LMS algorithm such that theoverall energy-aware diffusion LMS forms a two-timescale stochastic approximation algorithm:the fast timescale corresponds to the game-theoretic activation mechanism (step-size µ) andthe slow timescale is the diffusion LMS (step-size ε = O(µ2)), which carries out the parameterestimation task.Define |x|+ = max{0, x}. The energy-aware diffusion LMS algorithm is summarized belowin Algorithm 3.1.Remark 3.6. A useful variant of the above algorithm can be developed by allowing nodes to973.3. Energy-Aware Diffusion LMSAlgorithm 3.1 Energy-Aware Diffusion LMSInitialization: Set the tunable exploration parameter 0 < δ < 1, adaptation rates 0 < ε, µ 1such that ε = O(µ2), and ρk > 2∣∣Ukmax − Ukmin∣∣, where Ukmax and Ukmin denote the upper andlower bounds on node k’s utility function. Set C in the weight matrix (3.13). Initializepk[0] = [0.5, 0.5]> , RL,k[0] = RG,k[0] = 02×2, and ψk[0] = ψk[0] = 0M×1.Step 1: Node Activation. Chooseak[n] ∈ {0 (Sleep), 1 (Activate)}according to the randomized strategy pk[n] =(pk1[n], pk2[n]), wherepki [n] =(1− δ) min{1ρk∣∣∣rL,kak[n−1],i[n] + rG,kak[n−1],i[n]∣∣∣+, 12}+ δ2 , ak[n− 1] 6= i1−∑j 6=i pkj [n], ak[n− 1] = i(3.33)and ak[n− 1] denotes the action picked in the last period.Step 2: Diffusion LMS.ψk[n+ 1] = ψk[n] + εuk[n]>[yk[n]− uk[n]ψk[n]]· ak[n]. (3.34)Step 3: Estimate Exchange. If ak[n] = 1: (i) Transmit ψk[n] to neighbors Nk and collectψl[n], l ∈ Nk; (ii) setal[n] ={1, if node k receieves node l’s estimate ψl[n]0, otherwiseandψ̂l[n] ={received estimate, if al[n] = 1ψl[n− 1], if al[n] = 0where the weight matrix W = [wkl ] is defined in (3.13). If ak[n] = 0, go to Step 5.Step 4: Fusion of Local Estimates.ψk[n] =∑l∈Nkcwklψ̂l[n]. (3.35)Step 5: Regret Update. Run ‘Step 3’ in Algorithm 2.1.Recursion. Set n← n+ 1, and go Step 1.983.4. Global Network Performancefuse (or even decide whether to fuse) neighbors’ data while in sleep mode. More precisely, inlieu of (3.9), each node performs local diffusion LMS updates of the formψk[n] = ψk[n− 1] + εuk[n]>[yk[n]− uk[n]ψk[n− 1]]· ak[n]. (3.36)This helps to diffuse data when nodes are power hungry in measurements.3.4 Global Network PerformanceThis section deals with the global performance analysis of the energy-aware diffusion LMSalgorithm summarized in Algorithm 3.1. First, two global performance diagnostics are definedin Section 3.4.1 that capture the efficiency and consistency aspects of nodes behavior across thenetwork. These global measures are then used in Section 3.4.2 to characterize the sophisticatedglobally rational behavior emerging by nodes individually following Algorithm 3.1.3.4.1 Global BehaviorThe network global behavior can be captured by two inter-connected discrete-time processes:1. Network-Wide Diffusion LMS Recursion. Define:M [n] = diag(a1[n]IM , . . . , aK [n]IM)(KM ×KM).Following the same lines as (3.14), the global estimation error update associated with Algo-rithm 3.1 can be expressed asΨ˜[n] = Ψ˜[n− 1] + εM [n][CΨ˜[n− 1] +U [n]>(v[n]−U [n]W Ψ˜[n− 1])]. (3.37)In light of (3.37), Algorithm 3.1 can be viewed as a synchronous distributed stochastic ap-proximation algorithm [44, 192]: At each time n, some (but not necessarily all) elements ofthe estimates vector Ψ[n] are updated separately by different nodes spread across the network.The common clock that synchronizes the system is implicit in the game-theoretic activationmechanism employed by nodes. An asynchronous implementation of Algorithm 3.1 is feasible,however, is out of the scope of this chapter.2. Global Activation Behavior. The collective activation behavior of nodes, z[n], is definedas the discounted empirical frequency of joint activation decisions of all nodes up to period n.Formally,z[n] = (1− µ)n−1ea[1] + µ∑2≤τ≤n(1− µ)n−τ ea[τ ] (3.38)993.4. Global Network Performancewhere ea[τ ] denotes a unit vector on the space of all possible action profiles AK = ×Kk=1A withthe element corresponding to a[τ ] being equal to one. In (3.38), similar to Section 2.4.1, thesmall parameter µ introduces an exponential forgetting of the past activation profile of sensorsto enable adaptivity to the evolution of the local parameter estimates. That is, the effect of theold local parameter estimates on the activation strategy of nodes vanishes as they repeatedlymake new activation decisions. It is convenient to rewrite z[n] as a stochastic approximationrecursionz[n] = z[n− 1] + µ(ea[n] − z[n− 1]). (3.39)In view of the above two performance diagnostics, analyzing performance of the network ofnodes is nontrivial and quite challenging as (3.37) and (3.38) are closely coupled: The estimatesΨ[n] = Ψ˜[n] + Ψo affect utility of nodes, hence, their choice of action—that is, ea[n] in (3.39).On the other hand, nodes’ activation strategies pk are functions of z[n] which in turn enterM [n] in (3.37).3.4.2 Main Result: From Individual to Global BehaviorWe now characterize the global behavior emerging by each node individually following Algo-rithm 3.1 via the well-known ODE approach. Thus, in lieu of working directly with the discreteiterates of the two diagnostics introduced in Section 3.4.1, we examine continuous-time piecewiseconstant interpolations of the iterates, which are defined as follows:Ψ˜ε(t) = Ψ˜[n], and zµ(t) = z[n] for t ∈ [nµ, (n+ 1)µ) . (3.40)The following theorem then asserts that, by following Algorithm 3.1:i) the local estimates converge to the true parameter across the network;ii) along the way to convergence, the global activation behavior belongs to the time-varyingpolytope of correlated -equilibria of the underlying activation control game.Theorem 3.3. Consider the activation control game in Section 3.3.2, and suppose every nodefollows Algorithm 3.1. Then, for each > 0, there exists δ̂() such that if δ < δ̂() in Algorithm 1(Step 1), the following results hold:1. Let {tµ} denote any sequence of real numbers satisfying tµ → ∞ as µ → 0. For any Ψ˜,zµ(·+ tµ) converges in probability as µ→ 0 to the correlated -equilibrium set C(Ψ˜ + Ψo)in the sense thatfor any β > 0, limµ→∞P(dist[zµ(·+ tµ), C(Ψ˜ + Ψo)]> δ)= 0 (3.41)where dist[·, ·] denotes the usual distance function.1003.4. Global Network Performance2. Under Assumption 3.2, Ψ˜ε(·) converges weakly as ε→ 0 to Ψ˜(·) that is a solution of theODEdΨ˜(t)dt ∈{M(pic) (C −Ru) Ψ˜(t);pic ∈ C(Ψ˜(t) + Ψo)}(3.42)whereM(pic) = diag(m1(pic)IM , . . . ,mK(pic)IM)andIKM > M(pic) > κ · IKM for some 1 > κ > 0.3. The limit dynamical system (3.42) is globally asymptotically stable andlimt→∞∥∥Ψ˜(t)∥∥ = 0.4. Ψ˜(·+ tµ) converges in probability to 0. That is, for any β > 0,limε→0P(∥∥Ψ˜(·+ tµ)∥∥ > β)= 0. (3.43)In the above theorem, statements 1 and 2 simply show that non-fully rational local behaviorof individual nodes leads to sophisticated globally rational behavior, where all nodes pick actionsfrom the polytope of correlated -equilibrium, hence, coordinating their activation decisions.Statement 3 then asserts that the limit system representing the energy-aware diffusion LMSiterates is stable, and 0 constitutes its global attractor. Statement 4 finally shows that, undersuch coordinated activation of nodes, the diffusion LMS remains to be asymptotically consistent.The sketch of the proof is relegated to Section 3.5.Discussion and Extensions.1) Nodes may face random communication delays in receiving estimates from neighbors. Letdlk[n] denote the random delay node k faces in receiving estimate from node l at time n. Then,node k has access to θl[n−dlk[n]] at time n, but not ψl[τ ] for τ > n−dlk[n]. The results of [192,Chapter 12] ensure that the Theorem 3.3 is still valid if delays are bounded or not arbitrarilylarge in the sense that (n− dlk[n])/n→ 1 almost surely.2) While the true parameter ψo, connectivity graph G, and weight matrix W are assumedto be fixed in Section 3.2, the constant step-sizes ε and µ in Algorithm 3.1 make it responsiveto slowly time-varying ψ[n], G[n], and W [n]. For exposition, suppose only ψo[n] evolves withtime according to a discrete-time Markov chain with finite state space Mψo = {ψo1, . . . ,ψoS}and transition matrixP γ = IS + γQ, 0 < γ 1 (3.44)1013.5. Convergence Analysis of the Energy-Aware Diffusion LMSwhere Q = [qij ] is the generator of a continuous-time Markov chain satisfyingQ1S = 0S , and qij ≥ 0 for i 6= j.The small parameter γ specifies how fast ψo[n] evolves with time. Then, if the Markov chainjumps are on the same timescale as the adaptation rate of the diffusion LMS updates, i.e.,γ = O(ε), each node can properly track ψ[n] by employing Algorithm 3.1, though the trackingspeed depends largely on how aggressively nodes activate the diffusion LMS. The above Markovchain is one of the most general hyper-models available for a finite-state model.3.5 Convergence Analysis of the Energy-Aware Diffusion LMSThis section is devoted to the global performance analysis of the energy-aware diffusion LMS,and provides proofs of the results presented in Theorem 3.3. We will use the results of Sec-tion 3.2.4 which show asymptotic consistency for the diffusion LMS without the activationcontrol mechanism via the ODE approach. For brevity and to make it more comprehensible,the technical details are omitted, and only a sketch of the proof is presented.The proof is divided into two steps, which are conceptually similar to the steps followed inSection 2.5. Each step is presented in a separate subsection in what follows.3.5.1 Limit System CharacterizationLetIn = {k|ak[n] = 1}denote the set of nodes that choose to activate at period n. One can unfold iteration n of (3.37)by replacing it with |In| distinct iterations, where each involves only one node k ∈ In updatingits estimate. Relabeling iteration indices of this unfolded variant by m results in a new algorithmthat behaves identically to Algorithm 3.1. Since nodes explore with some probability δ > 0at each period, the stationary distribution places a probability of at least δ on “activation.”Therefore,lim infm→∞1mm∑τ=0ak[τ ] ≥ δ, for all k ∈ K (3.45)which simply asserts that all nodes take measurements and update their estimates relativelyoften—but in a competitively optimal sense described by the correlated -equilibria—and is keyto the proof of Theorem 3.3.For convenience, take ε = µ2. Note that an interval of length T in the µ timescale involvesT/µ iterates, whereas, the same number of iterations take Tε/µ = Tµ in the ε timescale. Toanalyze the asymptotics of the joint process (zµ(·), Ψ˜µ(·)), it is instructive to first look at the1023.5. Convergence Analysis of the Energy-Aware Diffusion LMSsingularly perturbed coupled system{dΨ˜(t)dt = µM(z(t))[C −Ru]Ψ˜(t)dz(t)dt ∈{σk(z(t),θ(t))× ν−k;ν−k ∈ ∆A−k}− z(t)(3.46)in the limit as µ→ 0. Here, σk represents the randomized activation strategy of node k, whichwill be defined later in the proof. Further, ν−k denotes a randomized joint strategy of all nodesexcluding node k, which belongs to the space of all possible such strategies, denoted by ∆A−k.In addition,M(z(t)) := diag(m1(z(t))IM , . . . ,mK(z(t))IM)(KM ×KM) (3.47)whereIKM > M(z(t)) > κ · IKM , for some 1 > κ > 0. (3.48)Each element mk(z(t)) in (3.47) is the relative instantaneous rate at which each node k activates.More precisely,mk(z(t)) = ∑a−k zk(1,a−k)(t) (3.49)where zk(1,a−k)(t) denotes the empirical frequency of node k choosing to “activate” and the restpicking the joint profile a−k at time t. In view of (3.49), it can be readily seen that (3.48) is adirect consequence of (3.45).3.5.2 Stability Analysis of the Limit SystemIn view of (3.46), Ψ˜(·) is the slow component and z(·) is the fast transient. Therefore, Ψ˜(·) isquasi-static—remains almost constant—while analyzing the behavior of z(·). The weak conver-gence argument shows that (zµ(·), Ψ˜µ(·)) converges weakly to (z(·), Ψ˜(·)), as µ→ 0, such thatthe limit is a solution to{dΨ˜(t)dt = 0dz(t)dt ∈{σ(z(t), Ψ˜(t) + Ψo)× ν−k;ν−k ∈ ∆A−k}− z(t).(3.50)Technical details are tedious and omitted for brevity; see [192, Chapter 8] for more details.The next step is then to prove asymptotic stability and characterize the set of global at-tractors ofdz(t)dt ∈{σ(z(t), Ψ˜ + Ψo)× ν−k;ν−k ∈ ∆A−k}− z(t). (3.51)Note in (3.51) that the slow component Ψ˜(t) = Ψ˜ is held fixed in the fast component sincedΨ˜(t)/dt = 0 in (3.51). The stability analysis of (3.51) has been performed in detail in Sec-tion 2.5. Recall the local RkL[n] and global RkG[n] regret matrices. There exists a close connection1033.5. Convergence Analysis of the Energy-Aware Diffusion LMSbetween z[n] and the pair (RkL[n], RkG[n]) in that the pair regret process records the global ac-tivation behavior of the network translated based on each node’s utility function into regrets.Since it is more convenient to work with (RkL[n], RkG[n]), we considered the limit system asso-ciated with this pair process and studied its asymptotic stability in lieu of analyzing stabilityof the limit dynamical system associated with z[n] in Section 2.5. Theorem 2.4 simply impliesthat the fast dynamics (3.51) is globally asymptotically stable for each fixed Ψ˜ and the set ofcorrelated -equilibria C(Ψ˜ + Ψo) constitutes its global attractors set. Then, for sufficientlysmall values of µ, we expect z(t) to closely track C(Ψ˜(t) + Ψo) for t > 0.This in turn suggests looking at the differential inclusiondΨ˜(t)dt ∈{M(pic)(C −Ru)Ψ˜(t);pic ∈ C(Ψ(t) + Ψo)}(3.52)which captures the dynamics of the slow parameter Ψ˜(t) when the fast parameter z(t) isequilibrated at C(Ψ˜(t) + Ψo). The following lemma proves asymptotic stability of (3.52), andcharacterizes its global attractor.Lemma 3.1. The non-autonomous differential inclusion (3.52) is globally asymptotically stable,andlimt→∞∥∥Ψ˜(t)∥∥ = 0. (3.53)Proof. The proof follows from that of Theorem 3.2. The diagonal matrix M(pic) is positive-definite since all of its diagonal elements are positive. SinceM(pic) is symmetric and positive def-inite and C −Ru is symmetric and negative-definite, M(pic)(C −Ru) is negative definite [157].Therefore, (3.52) is globally asymptotically stable. Further,det(M(pic)(C −Ru))< 0.Therefore,M(pic)(C −Ru)Ψ˜ = 0is non-singular, and 0 is the unique global attractor of (3.52) for any pic ∈ C(Ψ(t) + Ψo).The intuition provided here indeed carries over to the discrete time iterates of the energy-aware diffusion LMS in Algorithm 3.1: the activation mechanism views the diffusion LMS asquasi-static while the diffusion LMS views the game-theoretic activation mechanism as almostequilibrated.1043.6. Numerical Study3.6 Numerical StudyThis section is devoted to the numerical evaluation of the proposed energy-aware diffusion LMSadaptive filter in a small network of nodes. Section 3.6.1 first answers the question: why is itimpossible to obtain an analytical rate of convergence for the proposed two-timescale stochasticapproximation algorithm? We then elaborate on our semi-analytic approach to illustrate globalnetwork performance in Section 3.6.2. Finally, the simulation setup and the numerical resultsare presented in Section 3.6.3.3.6.1 Why Is Derivation of Analytical Convergence Rate Impossible?The rate of convergence refers to the asymptotic properties of the normalized errors about thelimit point ψo. It is well known that the asymptotic covariance of the limit process providesinformation about the rate of convergence of stochastic recursive algorithms; see [192, Chapter10]. For instance, in the standard single node LMS with constant step-size ε, the error ψ[n]−ψois asymptotically normally distributed with mean 0 and covariance εΣ, whereΣ = E{ξ[0]ξ[0]>}+∑∞τ=1 E{ξ[0]ξ[τ ]>}+∑∞τ=1 E{ξ[τ ]ξ[0]>}(3.54)and ξ[τ ] = u[τ ]>(y[τ ]− u[τ ]ψo). If the noise is i.i.d., the covariance can be computed viaΣ = E{u[n]>u[n](y[n]− u[n]ψo)2}= Ruσ2v (3.55)where Ru = E{u[n]>u[n]}, and σ2v = E{v[n]2}. The figures of merit such as the asymptoticexcess mean square error (EMSE) can then be obtained as follows:EMSE = 12Tr(Σ) (3.56)where Tr(·) denotes the trace operator.This approach, however, is not applicable in our analysis of the energy-aware diffusion LMSsince the limiting process for the diffusion LMS is a differential inclusion rather than an ordinarydifferential equation—see (3.52). That is, for each initial condition, the sample path of errorsbelongs to a set, and independent runs of the algorithm result in different sample paths. Thisprohibits deriving an analytical rate of convergence, and accordingly analytical figure of merits,for the energy-aware diffusion LMS algorithm, though it is obvious that its performance isupper bounded by the standard diffusion LMS with no activation control mechanism3. In whatfollows, we resort to Monte Carlo simulations in order to illustrate and compare the rate ofconvergence and performance of the energy-aware diffusion LMS.3The interested reader is referred to [64, Section IV-D] and [204] for the derivation of EMSE in standarddiffusion LMS and related results.1053.6. Numerical Study3.6.2 Semi-Analytic Approach to Numerical StudyIn view of the discussion in Section 3.6.1, we conduct a semi-analytic numerical study in thissection to demonstrate performance of Algorithm 3.1. Recall from Section 3.5 that the limit-ing process for the discrete-time iterates of the energy-aware diffusion LMS is the differentialinclusion given in (3.52). This in turn suggests that, at each time n, the diagonal elementsof M(z[n]) can be picked according to (3.49) using a vector z[n] drawn from the set of cor-related -equilibria C(Ψ[n]). Recall, however, that z[n] converges to the set C(Ψ[n]) ratherthan a particular point in that set. In fact, even if Ψ[n] = Ψ is held fixed, z[n] can generallymove around in the polytope C(Ψ). Since the game-theoretic activation mechanism is run onthe faster timescale, its behavior in the slower timescale (diffusion LMS) can be modeled byarbitrarily picking points from the polytope C(Ψ[n]) and constructing M(z[n]) accordingly.We implement the above idea in the simulations by using Matlab’s cprnd function, whichdraws samples with uniform distribution from the interior of a polytope defined by a system oflinear inequalities Ax < b. Note that, theoretically, finding a correlated -equilibrium in a gameis equivalent to solving a linear feasibility problem of the form Api < I (see Definition 2.3).Using a uniform distribution for sampling from C(Ψ[n]) is no loss of generality since it assumesno prior on the interior points of C(Ψ[n]), and matches the essence of convergence to a set.Once the sample pic[n]—which is essentially a probability distribution on the joint activationdecisions of nodes—is taken, it is marginalized on each node to obtain individual nodes’ activa-tion strategies. This reduces the complexity of our numerical study to a one-level (rather thana two-level) stochastic simulation.3.6.3 Numerical ExampleFigure 3.2 depicts the network topology we study in this example. We define C = [ckl ] in (3.7)asckl ={1|Nkc |, l ∈ Nkc ,0, otherwise.(3.57)This is the most natural choice that comes first into the mind, and is enough for our demon-stration purposes. One can, however, employ a smart adaptive weighting scheme to improveperformance; see [204] for further details. Figure 3.2 further illustrates the statistics of eachnode’s regressors, generated by a Gaussian Markov source with local autocorrelation function ofthe form r(τ) = σ2u,kα|τ |u,k, where αu,k is the correlation index. In simulations, we assume nodesare equipped with Chipcon CC2420 transceiver chipset which implements CSMA/CA protocolfor exchanging estimates with neighbors. The reader is referred to Appendix A for detailedmodel description and an expression for ETx(·) in (3.28). We further set kc = km = 0.01,meaning that sleeping sensors only wake up 1% of the time for monitoring. We further assumethe noise vkn is i.i.d. with σ2v,k = 10−3 for all k ∈ K, and ε = 0.01.1063.6. Numerical Study24567813(a) Network topology (adopted from [204])2 3 4 5 6 7 800.10.20.30.40.50.6Sensor nStatiticsofRegressrs:σ2 u,k,αu,k σ2u,kαu,k(b) Nodes’ regressors statisticsFigure 3.2: Network topology and nodes’ regressors statistics.1073.6. Numerical Study0.5 1 1.5 2 2.52030405060708090Cl,2ProportionofActivePeriods(%) EMSEnet after 103 iterations (γl = γg = 2)EMSEnet after 103 iterations (γl = γg = 1)−60−50−40−30−20−10EMSEnet (db)Activation Percentage (γl = γg = 2)Activation Percentage (γl = γg = 1)Figure 3.3: Trade-off between energy expenditure and estimation accuracy after 1000 iterations.0 500 1000 1500 2000 2500 3000 350010−610−510−410−310−210−1Iteration kEMSEnet Diffusion LMS (Simulation)Diffusion LMS (Theory)Energy-Aware Diffusion LMS (Cl,1 = Cg = 2)Energy-Aware Diffusion LMS (Cl,1 = Cg = 1)Figure 3.4: Network excess mean square error EMSEnet.1083.7. Closing RemarksDefine the network excess mean square error:EMSEnet := 1K∑Kk=1 EMSEkwhere EMSEk denotes the EMSE for node k. EMSEnet is simply obtained by averagingEMSEnet = 1K∑Kk=1 |uk[n](ψo −ψk[n− 1])|2.Figure 3.3 demonstrates the trade-off between energy expenditure in the network and the rateof convergence of the diffusion LMS algorithm in terms of EMSEnet after 1000 iterations. Nodesbecome more conservative by increasing the energy consumption parameter Cl,2 and, accord-ingly, activate less frequently due to receiving lower utilities; see (3.28). This reduces the averageproportion of active nodes and increases EMSEnet due to recording fewer measurements andless frequent fusion of neighboring estimates. Increasing the pricing parameters γl , in (3.28),and γg , in (3.32), has the same effect as can be observed in Figure 3.3. The global performanceof Algorithm 3.1 is further compared with the standard diffusion LMS [204] in Figure 3.4.As the pricing parameters Cl,1 in (3.28)—associated with the contribution of nodes in localparameter estimation—and Cg in (3.32)—corresponding to the contribution of nodes in localdiffusion—increase, nodes activate more frequently. As shown in Figure 3.4, this improves therate of convergence of the energy-aware diffusion LMS algorithm across the network.3.7 Closing RemarksBuilding upon the adaptive learning algorithm of Chapter 2 and cooperative diffusion strategiesin adaptive networks, this chapter considered energy-aware parameter estimation via diffusionleast mean squares adaptive filter by equipping nodes with a game-theoretic activation mech-anism. We analyzed the stability and consistency properties of the diffusion LMS adaptivefilter via the powerful ODE method. After carefully formulating the energy-aware activationcontrol problem as a slowly switching non-cooperative game with neighborhood monitoring,it was shown how the adaptive learning algorithm of Chapter 2 can be combined with thediffusion LMS adaptive filter so that the combination functions as an energy-aware parameterestimation filter. The resulting energy-aware diffusion LMS retains the appealing features of thediffusion LMS adaptive filter, such as scalability and stabilizing effect on the network, improvedestimation performance, savings in computation, communication and energy expenditure, yetcoordinates activation of nodes such that the global activation behavior at each time is withina small -distance of the correlated equilibria set of the underlying activation control game.Therefore, while individual nodes posses limited communication, sensing and processingcapability, it is their coordinated behavior that leads to the manifestation of rational global be-1093.7. Closing Remarkshavior which can adapt itself to complex data relations such as the spatial-temporal correlationof sensors’ measurements. The proposed formulation conveys a design guideline: by carefullyspecifying the utility functions, these networked devices can coordinate themselves to exhibitdesirable global behavior.The research so far in the signal processing community has been mostly focused on thewell-known Nash equilibrium when dealing with game-theoretic analysis of distributed multi-agent system. This is despite the advantages of the correlated equilibrium such as structuraland computational simplicity, and potentially higher “rewards” to individual agents. Further,the independence assumption underlying the definition of Nash equilibrium is rarely true inmulti-agent learning scenarios since the environment naturally correlates agents’ actions. Thecorrelated equilibrium can most naturally be viewed as imposing equilibrium conditions on thejoint action profile of agents, rather than on individual agent’s actions as in Nash equilibrium.An intuitive interpretation is coordination in decision-making as described by the followingexample: Suppose, at the beginning of each period, each node receives a private recommendationak exogenously as whether to “activate” or “sleep”, where the joint recommendation to allnodes a = (a1, . . . , aK) is drawn from a distribution. Such a joint distribution correlates nodes’actions, hence, the name correlated equilibrium. A correlated equilibrium results if every noderealizes that the recommendation is a best-response to the random estimated activity of others,provided that others, as well, follow their recommendations. This implies coordinated activityof nodes once they reach the correlated equilibria set.In light of the above discussion, it is now time for researchers in many fields to exploit thesebenefits to enhance and augment the design of aware, flexible, and self-regulating systems.110Part IIAdaptive Randomized Search forDiscrete Stochastic Optimization1114Smooth Best-response AdaptiveSearch for Discrete StochasticOptimization“It is perhaps natural that the concept of best or optimaldecisions should emerge as the fundamental approach forformulating decision problems.”—D. G. Luenberger [206]4.1 IntroductionThe second part of this thesis concerns finding the optimal decision among a set of feasiblealternatives in the presence of uncertainty, however, in the absence of any interaction with otherindividuals. This problem is mathematically formulated as a discrete stochastic optimizationproblem, and arises in the design and analysis of many real-life discrete-event systems such assupply chain management [82, 102, 103], communication networks [41, 133, 186], manufacturingsystems [159, 274], and transportation networks [207, 237].4.1.1 Simulation-Based OptimizationIn many real-life situations, due to the unexplained randomness and complexity involved, theretypically exists no explicit relation between the performance measure of interest and the un-derlying decision variables1. The computational advances has particularly benefited the areaof optimization for stochastic discrete-event systems, where the complexities necessitate simu-lation in order to estimate performance. In such cases, one can only observe “samples” of thesystem performance through a simulation process [100]. Computer simulations are used exten-sively as models of real systems to evaluate output responses in many areas including supplychain management, finance, manufacturing, engineering design, and medical treatment [169].1In the experimental design literature, the performance measure is usually referred to as the response, andthe parameters as factors.1124.1. IntroductionThe problem of finding the optimal design/decision can then be seen as getting simulation andoptimization effectively combined.The so-called simulation-based discrete stochastic optimization [10, 100, 243] is an emerg-ing field which integrates optimization techniques into simulation analysis. The correspondingobjective function is an associated measurement of an experimental simulation. Due to the com-plexity of the simulations, the objective function may be difficult and expensive to evaluate.Moreover, the inaccuracy of the objective function values often complicates the optimizationprocess. Therefore, deterministic optimization tools are not useful. Further, the derivative-dependent methods are not applicable as, in many cases, derivative information is unavailable.To summarize, there are two difficulties involved in solving such problems: (i) the optimiza-tion problem itself is NP-hard, and (ii) experimenting on a trial feasible solution can be toocomputationally expensive to be repeated numerous times.4.1.2 Randomized Search MethodsIn part II, we focus on sampling-based random search methods to find the global optimum of adiscrete stochastic optimization problem that undergoes random unpredicted changes. The ap-peal of sampling-based methods is because they often approximate well, with a relatively smallnumber of samples, problems with a large number of scenarios; see [55] and [199] for numericalreports. Random search algorithms typically generate candidate solutions from the neighbor-hood of a selected solution in each iteration. Some algorithms have a fixed neighborhood, whileothers change their neighborhood structure based on the information gained through the opti-mization process. The sampled candidate is then evaluated, and based on the estimate of itsobjective value, the neighborhood or the sampling strategy is modified for the next period. Therandom search methods are adaptive in the sense that they use information gathered duringprevious iterations to adaptively decide how to exhaust the simulation budget in the currentiteration. Random search algorithms usually assume no structure on the objective function,and are globally convergent.Many systems of practical interest are too complex. Therefore, simulating them numeroustimes at a candidate configuration to obtain a more accurate estimate of its objective value iscomputationally prohibitive. Therefore, it is of interest to design specialized search techniquesthat are both attracted to the globally optimum candidate, and efficient. The efficiency of arandom search method is typically evaluated by the balance established between exploration,exploitation, and estimation. Exploration refers to searching globally for promising candidateswithin the entire search space, exploitation involves local search of promising subregions of theset of candidate solutions, and estimation refers to obtaining more precise estimates at desirablecandidates.1134.1. Introduction4.1.3 Chapter GoalsThis chapter focuses on the following two extensions of a standard simulation-based discretestochastic optimization problem:1. The data collected via simulation experiments is temporally correlated, which is morerealistic in practice.2. The elements of the problem such as the profile of the stochastic events or objectivefunction undergo random unpredicted changes.The goal is to devise a class of sampling-based adaptive search algorithms that are both capableto track the non-stationary global optimum, and efficient in the sense that they localize the op-timum by taking and evaluating as few samples as possible from the search space. The resultingalgorithms can thus be deployed as online control mechanisms that enable self-configuration oflarge-scale stochastic systems.4.1.4 Main ResultsWhen little is known about the structure of a discrete stochastic optimization problem andthe objectives cannot be evaluated exactly, one has to experience each feasible solution severaltimes so as to obtain a somewhat accurate estimate of its objective value. This is where therole of the sampling strategy in a randomized search algorithm becomes pivotal. The samplingstrategy that we propose in this chapter is inspired by adaptive learning algorithms in interactivesituations. We particularly take our inspiration from stochastic fictitious play [106] learning rulein non-cooperative game theory.The main results in this chapter are summarized below:1. A regime switching model that captures random evolution of parameters affecting the stochas-tic behavior of the elements underlying the discrete stochastic optimization problem or theobjective function (or both). This model is mainly meant for analyzing the tracking capa-bility of the devised adaptive search scheme. The typical independence assumption amongthe data collected via simulation experiments is further relaxed, which is more realistic inpractice.2. A smooth best-response adaptive search scheme. A class of adaptive search algorithms isproposed that relies on a smooth best-response sampling strategy. The proposed algorithmassumes no functional properties such as sub-modularity, symmetry, or exchangeability onthe objective function, and relies only on the data collected via observations, measurementsor simulation experiments.3. Performance analysis of the smooth best-response adaptive search algorithm. We prove that,if the evolution of the parameters underlying the discrete stochastic optimization problem1144.1. Introductionoccur on the same timescale as the updates of the adaptive search algorithm, it can properlytrack the global optimum by showing that a regret measure can be made and kept arbitrarilysmall infinitely often. Further, if now the time variations occur on a slower timescale thanthe adaptation rate of the proposed algorithm, the most frequently sampled element of thesearch space tracks the global optimum as it jumps over time. This in turn implies that theproposed scheme exhausts most of its simulation budget on the global optimum.4. Numerical evaluation verifies faster convergence to the global optimum, and improved effi-ciency as compared with the random search methods in [9, 290], and the upper confidencebound schemes in [22, 115].It is well known that, if the random changes underlying the discrete stochastic optimizationproblem occurs too rapidly, there is no chance one can track the time-varying optima via anadaptive search algorithm [290]. Such a phenomenon is known as trackability in the literatureof adaptive filtering [40]. On the other hand, if the changes are much slower as compared toupdates of the adaptive search algorithm, the discrete stochastic optimization problem can beapproximated by a stationary problem, and the changes can be ignored. This chapter focuseson the non-trivial case where the changes occur on the same timescale as the updates of theadaptive search algorithm, and prove that the proposed scheme properly tracks the time-varyingglobal optimum.The tracking analysis proceeds as follows: First, by a combined use of weak convergencemethods [192] and treatment on Markov switched systems [291, 293], we show that the limitdynamical system representing the discrete-time iterates of the proposed adaptive search algo-rithm is a switched ordinary differential equation (ODE). This is in contrast to the standardtreatment of stochastic approximation algorithms, where the limiting dynamics converge to a de-terministic ODE. By using Lyapunov function methods for randomly switched systems [67, 197],the stability analysis then shows that the limit system is asymptotically stable almost surely.This in turn establishes that the limit points of the switching ODE and the discrete-time iter-ates of the algorithm coincide. The final step concludes the tracking and efficiency results bycharacterizing the set of global attractors of the limit dynamical system.4.1.5 Chapter OrganizationThe rest of this chapter is organized as follows: Section 4.2 describes the standard discretestochastic optimization problem. Section 4.3 then presents the regime switching discrete stochas-tic optimization, and formalizes the main assumptions posed on the problem. In Section 4.4,the smooth best-response adaptive search scheme is presented and elaborated. The main the-orem of this chapter entailing the tracking and efficiency properties is given in Section 4.5. Astep-by-step proof of the main results is summarized in Section 4.6. The case of slowly evolvingdiscrete stochastic optimization problem is analyzed in Section 4.7. Finally, numerical examples1154.2. Discrete Stochastic Optimization Problemare provided in Section 4.8 followed by the closing remarks in Section 4.9. The detailed proofsare relegated to Section 4.10 for clarity of presentation.4.2 Discrete Stochastic Optimization ProblemIn its simplest setting, a discrete stochastic optimization problem is formulated asmins∈SF (s) := E {F (s,xn(s))} (4.1)where each element is described below:1. Feasible Set: S = {1, 2, . . . , S} is discrete and finite. This may be due to the natureof the problem itself, a reduction through other analyses, or a simplification due to practicalconsiderations. For example, in an (s, S) inventory problem (see, e.g., [296]), it may be thatorder quantities are restricted by the supplier, and through rough analytical calculations, goodestimates of upper and lower bounds on the optimal restocking order can be found [100].The problem is unconstrained in the sense that no feasibility constraint exists. For a con-strained variant, one can either solve the set of feasibility constraints (if possible) in advanceto obtain S, or use the method of discrete Lagrange multipliers (see, e.g., [278, 279]). The fea-sible set can be conceived as the set of possible configurations of a system subject to stochasticinputs.2. Stochastic Events: xn(s) is a bounded Rr-valued stationary random process representingthe stochastic inputs to the system, whose distribution may vary across the set of candidatesolutions S.3. Objective Function: F : S × Rr → R denotes the bounded performance function ofinterest for the stochastic system. In view of (4.1), one is therefore interested in optimizing theexpected performance of a stochastic system over a discrete set of feasible design alternatives.In many practical applications, the random process xn(s) has finite variance but unknowndistribution; hence, the expectation cannot be evaluated analytically. Stochastic simulationhas proven to be a powerful tool in such cases owing to its great modeling flexibility. Thatis, one is capable to obtain a realization of F (s,xn(s)), for any choice of s, via a simulationexperiment.Example 4.1. Consider the problem of optimizing the buffer sizes in a queueing networkcomprising multiple stations with buffers. Such a network may represent an assembly line inthe manufacturing industry, networked processors in parallel computing, or a communicationnetwork. Let s and xn(s) denote the index to the vector of buffer sizes and the random vector1164.3. Non-stationary Discrete Stochastic Optimization Problemof service times at different stations, respectively. The function F (s,xn(s)) is often set tothe amortized cost of buffers minus the revenues due to the processing speed (cf. [287], [192,Chapter 2.5]). In this case, it is very time consuming to write an explicit formula for F (s,xn(s))in closed form. Furthermore, when a multi-station multi-server queuing network is subject tosome recycling and a complex priority system, it would be impossible to model the networkanalytically, and obtain an explicit formula for F (s,xn(s)) [132, p. 455].Letfn(s) = F (s,xn(s)). (4.2)It is typically assumed that the sample mean of F (s,xn(s)) is a strongly consistent estimatorof F (s). More precisely, letF̂N (s) :=1NN∑n=1fn(s) (4.3)denote the sample average over N simulation data collected on candidate solution s. Then, forall s ∈ S,limN→∞F̂N (s)→ F (s) almost surely. (4.4)The above assumption is satisfied by most simulation outputs [155]. It becomes the the Kol-mogorov’s strong law of large numbers if {fn(s)} is independent and identically distributed(i.i.d.), and the ergodicity if {fn(s)} is ergodic.A brute force method of solving the discrete stochastic optimization problem (4.1) involvesan exhaustive enumeration:1. For each candidate solution s, compute F̂N (s), defined in (4.3), via simulation for largeN .2. Pick s∗ = argmins∈S F̂N (s).The above procedure is highly inefficient since one has to perform a large number of simulationexperiments at each feasible alternative; however, those performed on non-promising candidatesdo not contribute to finding the global optimum and are wasted. The main idea is thus todevelop a novel adaptive search scheme that is both attracted to the global optima set andefficient, in the sense that it spends less time simulating the non-promising alternatives [235,Chapter5].4.3 Non-stationary Discrete Stochastic Optimization ProblemThis chapter considers two extensions of the above problem:1. It allows for {fn(s)} to be a correlated sequence as long as it satisfies the weak law oflarge numbers.1174.3. Non-stationary Discrete Stochastic Optimization Problem2. It studies a non-stationary variant where the global optimum jumps with time due tounderlying time variations in the problem.Such problems arise in a broad range of practical applications where the goal is to track theoptimal operating configuration of a stochastic system subject to exogenous time inhomogeneity.More precisely, we consider discrete stochastic optimization problem of the formmins∈SF (s, θ[n]) := E{F(s,xn(s, θ[n]), θ[n])}(4.5)where the feasible set S = {1, 2, . . . , S} is defined as before. All time-varying parametersunderlying the system, e.g., profile of stochastic events or objective function (or both), areassumed to be finite-state and absorbed to a vector. For simplicity, we index these vectors byQ = {1, . . . ,Θ}, and work with the indices instead. Evolution of the underlying parameters canthen be captured by the integer-valued process θ[n] ∈ Q. It is assumed that θ[n], n = 1, 2, . . ., isunobservable with unknown dynamics. The random process xn(s, θ), as before, represents thestochastic events occurring in the system, except that now its distribution may vary across thestate-space of the underlying time-varying parameters as well. Notice that, the set of optimizersof (4.5), denoted by M(θ[n]), is non-stationary and evolves following the sample path of θ[n].For example, changes in the market trend may affect the demand for a particular product and,hence, the optimal restocking strategy in an inventory management system. Tracking suchtime-varying sets lies at the very heart of applications of adaptive stochastic approximationalgorithms.Since only samples of the objective function values are realized via simulation, we usefn(s, θ[n]) := F(s,xn(s, θ[n]), θ[n]). (4.6)Therefore, the non-stationary discrete stochastic optimization problem in (4.5) becomesmins∈SE {fn(s, θ[n])} . (4.7)Note that each stochastic process fn(s, θ[n]) in (4.7) is no longer stationary even though fn(s, i)is stationary for each i ∈ Q. Intuitively, in lieu of one sequence, we have Θ sequences {fn(s, i) :i ∈ Q} for each feasible decision s ∈ S. One may imagine that there is a random environmentthat is dictated by the process θ[n]. If θ[n] takes a value, say, i, at time n, then fn(s, i) will bethe sequence of data output by simulation2.We make the following assumption on the sequence of data obtained via simulation experi-ments.2The interested reader is referred to [221] for a detailed treatment of the asymptotic properties of suchstochastic processes.1184.4. Smooth Best-Response Adaptive Randomized SearchAssumption 4.1. Let E` denote the conditional expectation given F`, the σ-algebra generatedby {fn(s, i), θ[n] : i ∈ Q, s ∈ Q, n < `}. Then, for each s ∈ S and for all θ ∈ Q:(i) {fn (s, θ)} is a bounded, stationary, and real-valued sequence of random variables.(ii) For any ` ≥ 0, there exists F (s, θ) such that1NN+`−1∑n=`E` {fn (s, θ)} → F (s, θ) in probability as N →∞. (4.8)The above condition allows us to work with correlated sequences of simulation data (on aparticular feasible solution) whose remote past and distant future are asymptotically indepen-dent. The result follows from the mixing condition with appropriate mixing rates. Examplesinclude sequences of i.i.d. random variables with bounded variance, martingale difference se-quences with finite second moments, moving average processes driving by a martingale differencesequence, mixing sequences in which remote past and distant future are asymptotically inde-pendent, and certain non-stationary sequences such as a function of a Markov chain, etc. (cf.[155, 286]).Since the sequence {θ[n]} cannot be observed, we suppress the dependence in (4.7) on θ[n],and keep using fn(s) in the rest of this chapter to denote the objective function value at feasiblesolution s realized at time n.4.4 Smooth Best-Response Adaptive Randomized SearchThis section introduces a class of stochastic approximation algorithms that, relying on smoothbest-response strategies [39, 151], prescribes how to sample from the search space so as toefficiently learn and track the non-stationary global optimum. To this end, Section 4.4.1 definesthe class of smooth best-response sampling strategies and outlines its distinct properties. Theproposed adaptive search algorithm is then presented in Section 4.4.2.4.4.1 Smooth Best-Response Sampling StrategyA randomized search algorithm repeatedly takes samples from the search space according to arandomized sampling strategy (a probability distribution over the search space). These samplesare then evaluated via real-time simulation experiments, using which the sampling strategy andestimate of the global optimum are revised. The sampling strategy that we propose, namely,smooth best-response sampling strategy, is inspired by leaning algorithms in games [106, 151],and is defined as follows.1194.4. Smooth Best-Response Adaptive Randomized SearchDefinition 4.1. Let∆S ={p ∈ RS ; pi ≥ 0,∑i∈Spi = 1}. (4.9)denote the simplex of all probability distributions on the search space S. Choose a perturbationfunction ρ(σ) : int(∆S)→ R, where int(∆S) denotes the interior of simplex ∆S, such that:1. ρ(·) is C1 (i.e., continuously differentiable), strictly concave, and |ρ| ≤ 1;2. ‖∇ρ(σ)‖ → ∞ as σ approaches the boundary of ∆S:limσ→∂(∆S)‖∇ρ(σ)‖ =∞where ‖ · ‖ denotes the Euclidean norm, and ∂(∆S) represents the boundary of simplex∆S.Given a belief vectorφ = col(φ1 , . . . , φS)∈ RSthe smooth best-response sampling strategy is then defined bybγ(φ):= argminp∈∆S(∑i∈S piφi − γρ(p)), 0 < γ < ∆F (4.10)where γ is the parameter that determines the exploration weight, and ∆F denotes the greatestdifference in objective function values among the candidate solutions.The conditions imposed on the perturbation function ρ leads to the following distinct prop-erties of the resulting strategy (4.10):i) The strict concavity condition ensures uniqueness of bγ(·).ii) The boundary condition implies bγ(·) belongs to the interior of the simplex ∆S.The smooth best-response sampling strategy (4.10) approaches the unperturbed best-responsestrategy argmini∈S φi , when γ → 0, and the uniform randomization, when γ →∞. As is naturalin any learning scenario, one expects the adaptive discrete stochastic optimization algorithm toexhibit exploration so as to learn the stochastic behavior that affects the performance measure ofinterest in the system. The smooth best-response sampling strategy (4.10) exhibits explorationusing the idea of adding random values to the beliefs3 as explained below: Suppose β =(β1 , . . . , βS ) is a random vector that takes values in RS according to some strictly positive3This is in contrast to picking candidate solutions at random with a small probability, which is common ingame-theoretic learning such as Algorithm 2.1 and multi-armed bandit algorithms.1204.4. Smooth Best-Response Adaptive Randomized Searchdistribution, and represents the random disturbances added to the beliefs. Given the beliefvector φ, suppose one samples each feasible solution i following the choice probability function:Ci(φ) = P(argmins∈S[φs − βs] = i).Using Legendre transformation, [151, Theorem 2.1] shows that, regardless of the distributionof the random vector β, a deterministic representation of the formargminp∈∆S(p · φ− γρ(p))can be obtained for the best response strategy with added random disturbances. That is,the sampling strategy (4.10) with deterministic perturbation is equivalent to a strategy thatminimizes the sum of the belief and the random disturbance.The smooth best-response strategy constructs a genuine randomized strategy. This is anappealing feature since it circumvents the discontinuity inherent in algorithms of pure best-response type (i.e., γ = 0 in (4.10)). In such algorithms, small changes in the belief φ canlead to an abrupt change in the sampling behavior of the algorithm. Such a switching in thedynamics of the algorithm complicates the convergence analysis.Example 4.2. An example of the function ρ in Definition 4.1 is the entropy function [106, 107]ρ (p) = −∑i∈Spi ln (pi)which gives rise to the well-known Boltzmann exploration strategy with constant temperaturebγi(φ)= exp(− φi/γ)∑j∈S exp(− φj/γ) . (4.11)Such a strategy is also used in the context of learning in games, widely known as logisticfictitious-play [105] or logit choice function [151].We shall now define the belief vector and describe its update mechanism. The belief is avectorφ[n] = col(φ1 [n], . . . , φS [n])∈ RSwhere each element φi [n] is the estimate at time n of the objective function value at candidate i.Let s[n] be the candidate sampled from the search space S at time n according to the best-response sampling strategy bγ(φ[n]). Then, each element of the belief vector is recursively1214.4. Smooth Best-Response Adaptive Randomized Searchupdated viaφi [n+ 1] = φi [n] + µ[ fn(s[n])bγi (φ[n])· I(s[n] = i)− φi [n]], 0 < µ 1 (4.12)where I(·) denotes the indicator function: I(X) = 1 if the statement X is true, and 0 otherwise.Further, fn(s[n]) represents the simulated objective function value at the sampled candidate attime n. The normalization factor 1/pi(bγ(φ[n])), intuitively speaking, makes the length of pe-riods that each element i has been sampled comparable to the total time elapsed. The discountfactor µ is further a small parameter that places more weight on more recent simulations andis necessary as we aim to track time-varying global optima. It can be easily verified that φi [n]in (4.12) can be defined directly using the output of simulation experiments up to time n asfollows:φi [n] := (1− µ)n−1[f1(s[1])bγi(φ[1])]I(s[1] = i) + µ∑2≤τ≤n(1− µ)n−τ[ fτ (s[τ ])bγi (φ[τ ])]I(s[τ ] = i). (4.13)This verifies that each component of the belief vector have the interpretation of discounted aver-age objective function at the corresponding candidate solution. We shall emphasize that (4.13)relies only on the data obtained from simulation experiments; it does not require the systemmodel nor the realizations of the θ[n] (that captures the exogenous non-stationarities).4.4.2 Adaptive Search AlgorithmThe smooth best-response adaptive search can be simply described as an adaptive samplingscheme. Based on the belief φ[n], it prescribes to take a sample s[n] from the search spaceaccording to the smooth best-response sampling strategy bγ(φ[n])in (4.10). A simulationexperiment is then performed to evaluate the sample and obtain fn(s[n]), which is in turnfed into a stochastic approximation algorithm to update the belief and, hence, the samplingstrategy. The proposed algorithm relies only on the simulation data and requires minimalcomputational resources per iteration: It needs only one simulation experiment per iterationas compared to two in [9], or S (the size of the search space) in [235, Chapter 5.3]. Yet, asevidenced by the numerical evaluations in Section 4.8, it guarantees performance gains in termsof the tracking speed.Let 0S denote a column zero vector of size S. The smooth best-response adaptive searchalgorithm is summarized below in Algorithm 4.1.Remark 4.1. We shall emphasize that Algorithm 4.1 assumes no knowledge of θ[n] nor itsdynamics. If θ[n] was observed, one could form and update beliefs φθ[n] independently for eachθ ∈ Q, and use bγ(φθ′ [n]) to select s[n] once the system switched to θ′.1224.5. Tracking the Non-Stationary Global OptimaAlgorithm 4.1 Smooth Best-Response Adaptive SearchInitialization: Choose ρ according to the conditions in Definition 4.1. Set the adaptationrate 0 < µ 1, and the exploration parameter 0 < γ < ∆F , where ∆F is an estimate of thegreatest difference in objective function values. Initializeφ[0] = 0S .Step 1: Sampling. Sample s[n] ∼ bγ(φ[n]), wherebγ(φ[n])= argminp∈∆S[∑i∈S piφi[n]− γρ(p)]. (4.14)Step 2: Evaluation. Perform simulation to obtain fn(s[n]).Step 3: Belief Update.φ[n+ 1] = φ[n] + µ[gn(s[n],φ[n], fn(s[n]))− φ[n]](4.15)where g = col(g1 , . . . , gS ), and gi(s[n],φ[n], fn(s[n])):= fn(s[n])bγi(φ[n]) · I(s[n] = i). (4.16)Recursion. Set n← n+ 1 and go to Step 1.Remark 4.2. The choice of adaptation rate µ is of critical importance. Ideally, one would wantµ to be small when the belief is close to the vector of true objective values, and large otherwise.Choosing the best µ is, however, not straightforward as it depends on the dynamics of thetime-varying parameters (which is unknown). One enticing solution is to use a time-varyingadaptation rate µ[n], and an algorithm which recursively updates it as the underlying parame-ters evolve. The updating algorithm for µ[n] will in fact be a separate stochastic approximationalgorithm which works in parallel with the adaptive search algorithm. Such adaptive step-sizestochastic approximations are studied in [40, 191] and [192, Section 3.2], and remain out of thescope of this chapter.4.5 Tracking the Non-Stationary Global OptimaThis section is devoted to characterizing asymptotic properties of the smooth best-responseadaptive search algorithm. To this end, Section 4.5.1 introduces a regret measure that playsa key role in the analysis to follow. Section 4.5.2 then postulates a hyper-model for the timevariations underlying the discrete stochastic optimization problem. The main theorem of thischapter is finally presented in Section 4.5.3 that asserts, if the underlying time variations occuron the same timescale as the adaptation speed of Algorithm 4.1, the non-stationary globaloptima set can be properly tracked.1234.5. Tracking the Non-Stationary Global Optima4.5.1 Performance DiagnosticThe regret for an on-line learning algorithm is defined as the opportunity loss and comparesthe performance of an algorithm, selecting among several alternatives, to the performance ofthe best of those alternatives in the hindsight. This is in contrast to the decision theoryliterature, where the regret typically compares the average realized payoff for a particularcourse of actions with the payoff that would have been obtained if a different course of actionshad been chosen [48]. Accordingly, we definer[n] := (1− µ)n−1[f1(s[1])− Fmin(θ[1])]+ µ∑2≤τ≤n(1− µ)n−τ[fτ (s[τ ])− Fmin(θ[τ ])](4.17)whereFmin(θ) := mins∈SF (s, θ). (4.18)In (4.17), {s[τ ]} is the sequence of feasible solutions sampled and evaluated by Algorithm 4.1,and F (s, θ) is defined in (4.8). Similar to (4.12), since the algorithm is deemed to track non-stationary global optima, the regret r[n] is defined as the discounted average performanceobtained using the proposed sampling scheme as compared with the time-varying global optima.It can be easily verified that r[n] can be recursively updated viar[n+ 1] = r[n] + µ[fn(s[n])− Fmin(θ[n])− r[n]]. (4.19)We shall emphasize that r[n] is just a diagnostic that quantifies the tracking capability of theproposed random search scheme, and does not need to be evaluated iteratively by the algorithm.Therefore, no knowledge of θ[n] is required.4.5.2 Hyper-Model for Underlying Time VariationsA typical method for analyzing the performance of an adaptive algorithm is to postulate ahyper-model for the underlying time variations [40]. Here, we assume that dynamics of θ[n]in (4.5) follow a discrete-time Markov chain with infrequent jumps. The following assumptionformally characterizes the hyper-model. It is the same as Assumption 2.1, and is repeated onlyfor self-containedness.Assumption 4.2. Let {θ[n]} be a discrete-time Markov chain with finite state space Q ={1, 2, . . . ,Θ} and transition probability matrixP ε := IΘ + εQ. (4.20)1244.5. Tracking the Non-Stationary Global OptimaHere, ε > 0 is a small parameter, IΘ denotes the Θ × Θ identity matrix, and Q = [qij ] isirreducible, and represents the generator of a continuous-time Markov chain satisfying4qij ≥ 0 for i 6= j,|qij | ≤ 1 for all i, j ∈ Q,Q1Θ = 0Θ.(4.21)We shall emphasize that the above Markovian model is not used in implementation ofAlgorithm 4.1 since it cannot be observed by agents. It is only used in this section to analyzehow well the proposed adaptive search algorithm can track regime switching global optima.Choosing ε small enough in (4.20) ensures that the entries of P ε are all non-negative. Dueto the dominating identity matrix in (4.20), the transition probability matrix is “close” to theidentity matrix; therefore, θ[n] varies slowly with time.The small adaptation rate µ in Algorithm 4.1 defines how fast the beliefs are updated. Theimpact of the switching rate of the hyper-model θ[n] on tracking properties of the proposedadaptive search scheme is captured by the relationship between µ and ε in the transitionprobability matrix (4.20). If 0 < ε µ, the global optimum rarely changes and is essentiallyfixed during the transient interval of the adaptive search algorithm. We thus practically dealwith a static discrete stochastic optimization problem. On the other hand, if ε µ, theoptimization problem changes too rapidly, and there is no chance one can track the regimeswitching global optimum via an adaptive stochastic approximation algorithm. In this chapter,we consider the following no-trivial and more challenging case.Assumption 4.3. ε = O(µ) in the transition probability matrix P ε in Assumption 4.2.The above condition states that the time-varying parameters underlying the discrete stochas-tic optimization problem evolve with time on a scale that is commensurate with that determinedby the step-size of Algorithm 4.1.4.5.3 Main Result: Tracking PerformanceIn what follows, we use the so-called ordinary differential equation (ODE) method [50, 188, 192]to analyze the tracking capability of Algorithm 4.1. Thus, in lieu of working with the discreteiterates directly, we examine continuous-time piecewise constant interpolation of the regret4Note that this is no loss of generality. Given an arbitrary non-truly absorbing Markov chain with transitionprobability matrix Pµ = I + %Q, one can formQ =1qmax·Q, where qmax = maxi∈Ms|qii|.To ensure that the two Markov chains generated by Q and Q evolve on the same time scale, ε = % · qmax .1254.5. Tracking the Non-Stationary Global Optimaiterates, defined as follows:rµ(t) = r[n], for t ∈ [nµ, (n+ 1)µ) . (4.22)This will enable us to get a limit ODE that represents the asymptotic dynamics of the discrete-time iterates as µ→ 0. Then, asymptotic stability of the set of global attractors of the derivedODE concludes the desired tracking result. These details are, however, relegated to a latersection which provides the detailed analysis.Let |x|+ = max {0, x}. The following theorem is the main result of this chapter and impliesthat the sequence {s[n]} generated by Algorithm 4.1 most frequently samples from the globaloptima set.Theorem 4.1. Suppose Assumptions 4.1–4.3 hold, and let {tµ} be any sequence of real numberssatisfying tµ →∞ as µ→ 0. Then, for any η > 0, there exists γ(η) > 0 such that, if γ ≤ γ(η)in (4.14), then|rµ(·+ tµ)− η|+ → 0 in probability.That is, for any δ > 0,limµ→0 P(rµ(·+ tµ)− η ≥ δ)= 0. (4.23)Proof. See Section 4.6 for the detailed proof.The above theorem evidences the tracking capability of Algorithm 4.1 by looking at theworst case regret, and showing that it will be asymptotically less that some arbitrarily smallvalue5. Clearly, ensuring a smaller worst case regret requires sampling the global optimizer morefrequently. Put differently, the higher the difference between the objective value of a feasiblesolution and the global optimum, the lower must be the empirical frequency of sampling thatsolution to maintain the regret below a certain small value. Here, rµ(·+ tµ) essentially looks atthe asymptotic behavior of r[n]. The requirement tµ →∞ as µ→ 0 means that we look at r[n]for a small µ and large n with nµ → ∞. It shows that, for a small µ, and with an arbitrarilyhigh probability, r[n] eventually spends nearly all of its time in the interval [0, η+ δ) such thatδ → 0 as µ→ 0. Note that, for a small µ, r[n] may escape from the interval [0, η+ δ). However,if such an escape ever occurs, will be a “large deviations” phenomena—it will be a rare event.The order of the escape time (if it ever occurs) is often of the form exp(c/µ) for some c > 0;see [192, Section 7.2] for details.Remark 4.3. The choice of the exploration factor γ essentially determines the size of therandom disturbances β in (4.11), and affects both the efficiency and convergence rate of theproposed algorithm as evidenced in Figure 4.2. Larger γ increases the exploration weight versus5This result is equivalent to the Hannan consistency notion [137] in repeated games, however, in a non-stationary setting.1264.6. Analysis of the Smooth Best-Response Adaptive Searchexploitation, hence, more time is spent on simulating non-promising elements of the searchspace. In practice, one can initially start with γ = ∆F , where ∆F is a rough estimate of thegreatest difference in objective function values based on the outputs of preliminary simulationexperiments. To achieve an asymptotic regret of at most η in Theorem 4.1, one can thenperiodically solve for γ(η) inbrγ(η)(φ[n])· φ[n] = mini∈Sφi[n] + η, (4.24)and reset 0 < γ < γ(η) in the smooth best-response strategy (4.14).Note that, to allow adaptivity to the time variations underlying the problem, Algorithm 4.1selects non-optimal states with some small probability. Thus, one would not expect the sequenceof samples eventually spend all its time in the global optima before it undergoes a jump. Infact, {s[n]} may visit each feasible solution infinitely often. Instead, the sampling strategy im-plemented by Algorithm 4.1 ensures the empirical frequency of sampling non-optimal elementsstays very low.Remark 4.4. In the traditional analysis of adaptive filtering algorithms, one typically workswith a time-varying parameter whose trajectories change slowly but continuously. In compari-son, here, we allow the parameters underlying the stochastic system jump change with possiblylarge jump sizes. If such jumps occur on a faster timescale—an order of magnitude faster—thanthe adaptation rate µ of the stochastic approximation recursion (4.15), then Algorithm 4.1 willonly be able to find the system configuration that optimizes the performance of the “meansystem” dictated by the stationary distribution of the integer-valued process θ[n].4.6 Analysis of the Smooth Best-Response Adaptive SearchThis section is devoted to both asymptotic and non-asymptotic analysis of the adaptive searchalgorithm, and proves the main result presented in Theorem 4.1.In this section, we work with the following two diagnostics:1. Regret r[n]: It quantifies the ability of the adaptive search algorithm in tracking therandom unpredictable jumps of the global optimum, and is defined in Section 4.5.1.2. Empirical Sampling Distribution z[n]: Efficiency of a discrete stochastic optimizationalgorithm is defined as the percentage of times that the global optimum is sampled [235]. Toquantify efficiency of the smooth best-response sampling strategy (4.10), we define the empiricalsampling distribution by a vectorz[n] = (z1 [n], . . . , zS [n]) ∈ RS1274.6. Analysis of the Smooth Best-Response Adaptive Searchwhere each element zi [n] records the percentage of times element i was sampled and simulatedup to time n. It is iteratively updated via the stochastic approximation recursionz[n+ 1] = z[n] + µ[es[n] − z[n]], 0 < µ 1 (4.25)where ei denotes the unit vector in the S-dimensional Euclidean space whose i-th component isequal to one. The small parameter µ introduces an exponential forgetting of the past samplingfrequencies and is the same as the adaptation rate of the stochastic approximation update rulefor the beliefs in (4.12). The empirical sampling frequency z[n] can be compactly defined usingthe non-recursive expressionz[n] := (1− µ)n−1es[1] + µ∑2≤τ≤n (1− µ)n−τ es[τ ] (4.26)and has two functions:1. At each time n, the global optimum is estimated by the index of the maximizing compo-nent of the vector z[n].2. The maximizing component itself then quantifies efficiency of the proposed smooth best-response sampling scheme.The rest of this section is organized into two subsections: Section 4.6.1 characterizes thelimit system associated with discrete-time iterates of the adaptive search scheme and the twoperformance diagnostics defined above. Section 4.6.2 then proves that such a limit system isglobally asymptotically stable with probability one and characterizes its global attractors. Theanalysis up to this point considers µ small and n large, but µn remains bounded. Finally, toobtain the result presented in Theorem 4.1, we let µn go to infinity and conclude asymptoticstability of interpolated process associated with the regret measure in Section 4.6.3. For betterreadability, details for each step of the proof are relegated to Section 4.10. Certain technicaldetails are also passed over; however, adequate citation will be provided for the interestedreader.4.6.1 Characterization of the Mean Switching ODEThis subsection uses weak convergence methods to derive the limit dynamical system associatedwith the iterates (r[n], z[n]). We shall start by defining weak convergence, that makes the basisof our tracking analysis.Definition 4.2 (Weak Convergence). Let Z[n] and Z be Rr-valued random vectors. Z[n]converges weakly to Z, denoted by Z[n]⇒ Z, if for any bounded and continuous function Ω(·),EΩ(Z[n])→ EΩ(Z) as n→∞.1284.6. Analysis of the Smooth Best-Response Adaptive SearchDefineF(θ) := col(F (1, θ), · · · , F (S, θ))(4.27)where F (s, θ) is defined in (4.8). Let furtherφ̂[n] := φ[n]− F(θ[n]) (4.28)denote the deviation error in tracking the true objective function values via the simulation dataat time n. Let furtherx[n] :=φ̂[n]r[n]z[n] . (4.29)In view of (4.15) and (4.19), it can be easily verified that x[n] follows the recursionx[n+ 1] = x[n] + µ[A(s[n], φ̂[n], fn(s[n]))− x[n]]+[F(θ[n])− F(θ[n+ 1])0S+1](4.30)whereA(s[n], φ̂[n], fn(s[n]))=g(s[n], φ̂[n] + F(θ[n]), fn(s[n]))− F(θ[n])fn(s[n])− Fmin(θ[n])es[n] . (4.31)In (4.31), g(·, ·, ·) is defined in (4.16), and Fmin(·) and F(·) are defined in (4.18) and (4.27),respectively.As is widely used in the analysis of stochastic approximations, we work with the piecewiseconstant interpolated processesxµ(t) = x[n], and θµ(t) = θ[n], for t ∈ [nµ, (n+ 1)µ). (4.32)The following theorem characterizes the limit dynamical system representing the stochasticapproximation iterates x[n] as a Markovian switching ODE.Theorem 4.2. Consider the recursion (4.30) and suppose Assumptions 4.1–4.3 hold. Then,as µ → 0, the interpolated pair process (xµ(·), θµ(·)) converges weakly to (x(·), θ(·)) that is asolution ofdxdt = G(x, θ(t))− x, (4.33)1294.6. Analysis of the Smooth Best-Response Adaptive SearchwhereG(x, θ(t)) =0Sbγ(φ̂+ F(θ(t)))· F(θ(t))− Fmin(θ(t))bγ(φ̂+ F(θ(t))) .Here, 0S denotes an S×1 zero vector, F(·) and Fmin(·) are defined in (4.18) and (4.27), respec-tively, and θ(t) denotes a continuous time Markov chain with generator Q as in Assumption 4.2.Proof. See Section 4.10.The above theorem asserts that the asymptotic behavior of Algorithm 4.1 can be capturedby an ODE modulated by a continuous-time Markov chain θ(t). At any given instance, theMarkov chain dictates which regime the system belongs to, and the system then follows thecorresponding ODE until the modulating Markov chain jumps into a new state. This explainsthe regime-switching nature of the system under consideration.4.6.2 Stability Analysis of the Mean Switching ODEWe next proceed to analyze stability of the limit system that have been shown in Section 4.6.1to capture the dynamics of the discrete-time iterates of Algorithm 4.1.In view of (4.33), there exists no explicit interconnection between the dynamics of r(t) andz(t). To obtain the result presented in Theorem 4.1, we need to look only at the stability ofddt[φ̂r]=[0Sbγ(φ̂+ (θ(t)))· F(θ(t))− Fmin(θ(t))]−[φ̂r]. (4.34)Let us start by looking at the evolution of φ̂(t), the deviation error in tracking the objectivefunction values. In view of (4.34), φ̂(t) evolves according to the deterministic ODEdφ̂dt = −φ̂. (4.35)Note that the dynamics of φ̂(t) is independent of the dynamics of r(t). Consequently,φ̂(t) = φ̂0 exp(−t).That is, (4.35) is asymptotically stable, and φ̂(t) decays exponentially fast to 0S as t → ∞.This essentially establishes that realizing fn(s[n]) provides sufficient information to constructan unbiased estimator of the vector of true objective function values6.6It can be shown that the sequence{φ[n]}induces the same asymptotic behavior as the beliefs developedusing the brute force scheme [235, Chapter 5.3], discussed in Section 4.2.1304.6. Analysis of the Smooth Best-Response Adaptive SearchNext, substituting the global attractor φ̂ = 0S into the limit switching ODE associatedwith the regret r(t) in (4.34), we analyze stability ofdrdt = bγ (F(θ(t)))· F(θ(t))− Fmin(θ(t))− r. (4.36)Before proceeding with the theorem, let us define asymptotic stability w.p.1 (See [293,Chapter 9] and [67] for further details). In what follows, dist[·, ·] denotes the usual distancefunction.Definition 4.3. Consider the Markovian switched systemY˙ (t) = h (Y (t), θ(t))Y (0) = Y0, θ(0) = θ0, Y (t) ∈ Rr, θ(t) ∈ Q(4.37)where θ(t) is a continuous time Markov chain with generator Q, and h(·, i) is locally Lipschitzfor each i ∈ Q. A closed and bounded set H ⊂ Rn×Q is asymptotically stable almost surely iflimt→∞ dist[(Y (t), θ(t)),H]= 0 a.s.LetR[0,η) = {r ∈ R; 0 ≤ r < η} . (4.38)The stability analysis is broken down into two steps: First, the stability of each deterministicsubsystem, obtained by fixing the switching signal θ(t) = θ, is examined. The set of globalattractors is shown to comprise R[0,η) for all θ ∈ Q. The slow switching condition then allowsus to apply the method of multiple Lyapunov functions [197, Chapter 3] to analyze stability ofthe switched system.Theorem 4.3. Consider the limit Markovian switched dynamical system (4.36). Let r(0) = r0and θ(0) = θ0. For any η > 0, there exists γ(η) such that, if 0 < γ < γ(η) in (4.14), thefollowing results hold:1. If θ(t) = θ is held fixed, the set R[0,η) is globally asymptotically stable for each deterministiclimit system θ ∈ Q. Furthermore,limt→∞dist(r(t),R[0,η))= 0. (4.39)2. For the Markovian switched system, the set R[0,η) × Q is globally asymptotically stable1314.6. Analysis of the Smooth Best-Response Adaptive Searchalmost surely. In particular,a) P(∀ > 0, ∃γ() > 0 s.t. dist[r0,R[0,η)]≤ γ()⇒ supt≥0 dist[r(t),R[0,η)]< )= 1b) P(∀γ, ′ > 0,∃T (γ, ′) ≥ 0 s.t. dist[r0,R[0,η)]≤ γ⇒ supt≥T (γ,′) dist[r(t),R[0,η)]< ′)= 1.Proof. See Section 4.10.The above theorem states that the set of global attractors of the Markovian switchedODE (4.36) is the same as that for all deterministic ODEs, obtained by holding θ(t) = θ,and guarantees an asymptotic regret of at most η. This sets the stage for Section 4.6.3 whichcombines the above result with Theorem 4.2 to conclude the desired tracking result in Theo-rem 4.1.4.6.3 Asymptotic Stability of the Interpolated ProcessesThis section completes the proof of Theorem 4.1 by looking at the asymptotic stability of theinterpolated process yµ(t), defined as follows:yµ(t) = y[n] :=[φ̂[n]r[n]]for t ∈ [nµ, (n+ 1)µ). (4.40)In Theorem 4.2, we considered µ small and n large, but µn remained bounded. This gives alimit switched ODE for the sequence of interest as µ→ 0. The next step is to study asymptoticstability and establish that the limit points of the switched ODE and the stochastic approxi-mation algorithm coincide as t → ∞. We thus consider the case where µ → 0 and n → ∞,however, µn → ∞ now. Nevertheless, instead of considering a two-stage limit by first lettingµ→ 0 and then t→∞, we study yµ(t+ tµ) and require tµ →∞ as µ→ 0. The following corol-lary concerns asymptotic stability of the interpolated process, and asserts that the interpolatedregret process will asymptotically be confined to the interval [0, η).Corollary 4.1. LetYη ={col(0S , r); r ∈ R[0,η)}. (4.41)Denote by {tµ} any sequence of real numbers satisfying tµ →∞ as µ→ 0. Assume {y[n] : µ >0, n < ∞} is tight or bounded in probability. Then, for each η ≥ 0, there exists γ (η) ≥ 0 suchthat if γ ≤ γ (η) in (4.14),yµ(·+ tµ)→ Yη in probability as µ→ 0. (4.42)1324.7. The Case of Slow Random Time VariationsProof. See Section 4.104.7 The Case of Slow Random Time VariationsThis section is devoted to the analysis of the adaptive search scheme when the random evolutionof the parameters underlying the discrete stochastic optimization problem occurs on a timescalethat is much slower as compared to the adaptation rate of the adaptive search algorithm withstep-size µ. More precisely, we replace Assumption 4.3 with the following assumption.Assumption 4.4. 0 < ε µ in the transition probability matrix P ε in Assumption 4.2.The above assumption introduces a different timescale to Algorithm 4.1, which leads toan asymptotic behavior that is fundamentally different as compared with the case ε = O(µ),that we analyzed in Section 4.6. Under Assumption 4.4, θ(·) is the slow component and x(·)is the fast transient in (4.33). It is well-known that, in such two timescale systems, the slowcomponent is quasi-static—remains almost constant—while analyzing the behavior of the fasttimescale. The weak convergence argument then shows that (xµ(·), θµ(·)) converges weakly to(x(·), θ) as µ→ 0 such that the limit x(·) is a solution todxdt = G(x, θ)− x, (4.43)whereG(x, θ) =0Sbγ(φ̂+ F(θ))· F(θ)− Fmin(θ)bγ(φ̂+ F(θ)) . (4.44)Our task is then to investigate stability of this limit system. Recall (4.29). In view of (4.44),there exists no explicit interconnection between the dynamics of r and z in (4.43). Therefore,we start by looking atddt[φ̂r]=[0Sbγ(φ̂+ F(θ))· F(θ)− Fmin(θ)]−[φ̂r].The first component is asymptotically stable, and any trajectory φ̂(t) decays exponentially fastto 0S as t → ∞. The first result in Theorem 4.3 also shows that, for the second component,the set R[0,η) is globally asymptotically stable for all θ ∈ Q.It remains to analyze stability ofddt[φ̂z]=[0Sbγ(φ̂+ F(θ))]−[φ̂z].1334.7. The Case of Slow Random Time VariationsThe first component is asymptotically stable. Substituting the global attractor φ̂ = 0S , welook atdzdt = bγ(F(θ))− zwhich is also globally asymptotically stable. Further, any trajectory z(t) converges exponen-tially fast to bγ(F(θ))as t→∞.Combining the above steps, one can obtain the same asymptotic stability result as thatpresented in Corollary 4.1. Using a similar argument as in Corollary 4.1, it can also be shownthatzµ(·+ tµ)→ bγ(F(θ))in probability as µ→ 0where {tµ} is any sequence of real numbers satisfying tµ → ∞ as µ → 0. Therefore, for anyδ > 0,limµ→0 P(∥∥zµ(·+ tµ)− bγ(F(θ))∥∥ ≥ δ)= 0 (4.45)where ‖ · ‖ denotes the Euclidean norm. Feeding in the vector of true objective function valuesF(θ) for any θ, the smooth best-response strategy in Definition 4.1 outputs a randomizedstrategy with the highest probability assigned to the global optimum. This in turn implies thatthe maximizing element of the empirical sampling distribution zµ(·+ tµ) represents the globaloptimum. The following corollary summarizes this result.Corollary 4.2. Denote the most frequently sampled feasible solution bysmax[n] = argmaxi∈Szi [n]where zi [n] is the i-th component of z[n], defined in (4.26). Further, define the continuous timeinterpolated sequencesµmax(t) = smax[n] for t ∈ [nµ, (n+ 1)µ) .Then, under Assumptions 4.1, 4.2 and 4.4, sµmax(· + tµ) converges in probability to the globaloptima set M(θ) for each θ ∈ Q. That is, for any δ > 0,limµ→0P(dist[sµmax(·+ tµ),M(θ)]≥ δ)= 0 (4.46)where dist[·, ·] denotes the usual distance function.The above corollary simply asserts that, for a small µ and for any θ ∈ Q, the most frequentlysampled element, with a high probability, eventually spends nearly all its time in the globaloptima set.1344.8. Numerical Study4.8 Numerical StudyThis section illustrates performance of the adaptive search scheme in Algorithm 4.1, henceforthreferred to as AS, using the numerical example presented in [9, 11]. The performance of theAS scheme will be compared with the following two algorithms existing in the literature:1. Random search (RS) [9, 290]. RS is a modified hill-climbing algorithm. The RSalgorithm is summarized below in Algorithm 4.2.Algorithm 4.2 Random search (RS) [9, 290]Initialization. Select s[0] ∈ S. Set pii[0] = 1 if i = s[0], and 0 otherwise. Set s∗[0] = s[0] andn = 1.Step 1: Random search. Sample a feasible solution s′[n] uniformly from the set S −s[n−1].Step 2: Evaluation and Acceptance. Simulate to obtain fn(s[n−1]) and fn(s′[n]), and sets[n] ={s′[n], if fn(s′[n]) > fn(s[n− 1]),s[n− 1], otherwise.Step 3: Occupation Probability Update.pi[n] = pi[n− 1] + µ[es[n] − pi[n− 1]], 0 < µ < 1 (4.47)where ei is a unit vector with the i-th element being equal to one.Step 4: Global Optimum Estimate.s∗[n] ∈ argmaxi∈Spii[n].Recursion. Let n← n+ 1, and go to Step 1.For static discrete stochastic optimization problems, the constant step-size µ in (4.47) isreplaced with the decreasing step-size µ[n] = 1/n, which gives the RS algorithm in [9]. Each it-eration of the RS algorithm requires one random number selection, O(S) arithmetic operations,one comparison and two independent simulation experiments.2. Upper confidence bound (UCB) [22, 115]. The UCB algorithms belong to the tothe family of “follow the perturbed leader” algorithms. Let B denote an upper bound onthe objective function and ξ > 0 be an constant. The UCB algorithm is summarized inAlgorithm 4.3.For static discrete stochastic optimization problems, one has to set µ = 1 in (4.48), whichgives the UCB1 algorithm in [22]; otherwise, the discount factor µ has to be chosen in theinterval (0, 1). Each iteration of the UCB algorithm requires O(S) arithmetic operations, onemaximization, and one evaluation of the objective function fn(s).1354.8. Numerical StudyAlgorithm 4.3 Upper confidence bound (UCB) [22, 115]Initialization. For all i ∈ S, simulate to obtain f0(i), and setf i[0] = f0(i), ni = 1, and s∗[0] ∈ argmaxi∈S f i[0].Select 0 < µ ≤ 1, and set n = 1.Step 1: Sampling and Evaluation. Sample a candidate solutions[n] = argmaxi∈Sf i[n] + 2B√ξ ln(M [n] + 1)mi[n]wheref i(n) = 1mi[n]∑nτ=1 µn−τfτ (s[τ ])mi[n] =∑nτ=1 µn−τI(s[τ ] = i), and M [n] =∑Si=1mi[n].(4.48)Simulate fn(s[n]).Step 2: Global Optimum Estimate.s∗[n] ∈ argmaxi∈Sf i[n].Recursion. Let n← n+ 1, and go to Step 1.In comparison to the RS and UCB algorithms, using ρ(x) as in Remark 4.2, the AS schemepresented in this chapter requires O(S) arithmetic operations, one random number selection,and one evaluation of the objective function fn(s) at each iteration.We start with a static discrete stochastic optimization example, and then proceed to theregime switching setting.4.8.1 Example 1: Static Discrete Stochastic OptimizationSuppose the demand d[n] for a particular product has a Poisson distribution with parameter λ:d[n] ∼ f(s;λ) = λs exp(−λ)s! .The sequence {d[n]} is assumed to be i.i.d. for the comparison purposes—since the RS andUCB methods assume independence of observations. The objective is then to find the numberthat maximizes the demand probability, subject to the constraint that at most S units can beordered. This problem can be formulated as a discrete deterministic optimization problem:argmaxs∈{0,1,...,S}[f(s;λ) = λs exp(−λ)s!](4.49)1364.8. Numerical Studywhich can be solved analytically. Here, the aim is to solve the following stochastic variant: Findargmins∈{0,1,...,S}−E {I(d[n] = s)} (4.50)where I(X) is the indicator function: I(X) = 1 if the statement X is true, and 0 otherwise.Clearly, the set of global optimizers of (4.49) and (4.50) coincide. This enables us to check theresults obtained via the adaptive search schemes.We consider the following two cases for the rate parameter λ:(i) λ = 1. The set of global optimizers will be M = {0, 1};(ii) λ = 10. The set of global optimizers will be M = {9, 10}.For each case, we further study how the size of the search space affects the performance byconsidering two cases: (i) S = 10, and (ii) S = 100. Since the problem is static, in the sensethat M is fixed for each case, one can use the results of [36] to show that if the explorationfactor γ in (4.10) decreases to zero sufficiently slowly, the sequence of samples {s[n]} convergesalmost surely to the global minimum. More precisely, we consider the following modificationsto Algorithm 4.1: The constant step-size µ in (4.15) is replaced by the decreasing step-size µ[n] = 1n . The exploration factor γ in (4.14) is replaced by 1nα , 0 < α < 1.By the above construction, {s[n]} will eventually become reducible with singleton communicat-ing classM. That is, the sequence of samples {s[n]} eventually spends all its time in the globaloptimum. This is in contrast to Algorithm 4.1 in the regime switching setting, which will bediscussed in the next example. In this example, we set α = 0.2 and γ = 0.01. We further setB = 1, and ξ = 0.5 in the UCB method.To compare the computational costs of the schemes and give a fair comparison, we use thenumber of simulation experiments performed by each scheme when evaluating its performance.All algorithms start at a randomly chosen element s[0], and move towards M. The speed ofconvergence, however, depends on the rate parameter λ and the search space size S + 1 in allthree methods. Close scrutiny of Table 4.1 leads to the following observations: In all threeschemes, the speed of convergence decreases when either S or λ (or both) increases. However,the effect of increasing λ is more substantial since the objective function values of the worst andbest feasible solutions get closer when λ = 10. Given equal number of simulation experiments,higher percentage of cases that a particular method has converged to the global optimumindicates convergence at a faster rate. Table 4.1 confirms that the AS algorithm ensures fasterconvergence in each case.To evaluate and compare efficiency of the algorithms, the sample path of the number ofsimulation experiments performed on non-optimal solutions, i.e., 1−∑i∈M zi[n], is plotted in1374.8. Numerical StudyIteration S = 10 S = 100n AS RS UCB AS RS UCB10 55 39 86 11 6 4350 98 72 90 30 18 79100 100 82 95 48 29 83500 100 96 100 79 66 891000 100 100 100 93 80 915000 100 100 100 100 96 9910000 100 100 100 100 100 100(a) λ = 1Iteration S = 10 S = 100n AS RS UCB AS RS UCB10 29 14 15 7 3 2100 45 30 41 16 9 13500 54 43 58 28 21 251000 69 59 74 34 26 305000 86 75 86 60 44 4410000 94 84 94 68 49 5920000 100 88 100 81 61 7450000 100 95 100 90 65 81(b) λ = 10Table 4.1: Percentage of independent runs of algorithms that converged to the global optimumset in n iterations.1384.8. Numerical Study101 102 103 104 105 10600.10.20.30.40.50.60.70.80.91Number of Simulations nProportionofSimulationsonNon-OptimalStates ASRSUCBFigure 4.1: Proportion of simulations expended on non-optimal solutions (λ = 1, S = 100).Figure 4.1, when λ = 1 and S = 100. As can be seen, since the RS method randomizes among all(except the previously sampled) feasible solutions at each iteration, it performs approximately98% of the simulations on non-optimal elements. Further, the UCB algorithm switches toits exploitation phase after a longer period of exploration as compared to the AS algorithm.Figure 4.1 thus verifies that the AS algorithm has the supreme balance between explorationof the search space and exploitation of the simulation data. Figure 4.2 further illustrates thesensitivity of both the convergence speed and the efficiency of the AS scheme to the explorationparameter γ. As can be seen, larger γ increases both the number of simulation experiments thatthe algorithm requires to find the global optimum and the proportion of simulations performedon non-optimal candidate solutions.4.8.2 Example 2: Regime Switching Discrete Stochastic OptimizationConsider the discrete stochastic optimization problem described in Section 4.8.1 with the ex-ception that now λ(θ[n]) jumps between 1 and 10 according to a Markov chain {θ[n]} with statespace Q = {1, 2}, and transition probability matrixP ε = I + εQ, Q =[−0.5 0.50.5 −0.5]. (4.51)1394.8. Numerical Study101 102 103012345678910Number of Simulations nEstimateoftheGlobalOptimum γ = 0.01γ = 0.1γ = 10 1 2 3 4 5 6 7 8 9 1000.10.20.30.40.50.60.70.80.91Demand Size sEmpiricalSamplingFrequencyz(104 ) γ = 0.01γ = 0.1γ = 1Figure 4.2: Sensitivity to γ when λ = 1 and S = 10. The set of global optimizers isM = {0, 1}.(Top) convergence speed. (Bottom) The distribution of simulation experiments (efficiency) aftern = 104 iterations.1404.8. Numerical Study100 101 102 103 104 10501234567891011Number of Simulations nEstimateoftheGlobalOptimum ASRSUCBFigure 4.3: Sample path of the estimate of the global optimum. The set of global optimizers isM(1) = {0, 1}, for 0 < n < 103, and M(2) = {9, 10}, for 103 ≤ n < 105.The regime switching discrete stochastic optimization problem that we aim to solve is then (4.50),where S = 10, andd[n] ∼ f(s;λ(θ[n])) = λs(θ[n]) exp(−λ(θ[n]))s! , λ(1) = 1, and λ(2) = 10. (4.52)The sets of global optimizers are then M(1) = {0, 1} and M(2) = {9, 10}, respectively. In therest of this section, we assume γ = 0.1, and µ = ε = 0.01.Figure 4.3 shows tracking capability of the algorithms for a slow Markov chain {θ[n]} thatundergoes a jump from θ = 1 to θ = 2 at n = 103. As can be seen, contrary to the RS scheme,both the AS and UCB methods properly track the changes; however, the AS algorithm respondsfaster to the jump. Figure 4.4 further illustrates the proportion of simulation experimentsperformed on non-optimal feasible solutions as λ jumps. Evidently, the AS algorithm providesthe supreme balance between exploration and exploitation, and properly responds to the regimeswitching. Finally, Figure 4.5 compares efficiency of algorithms for different values of ε in (4.51),which determines the speed of evolution of the hyper-model. Each point on the graph is anaverage over 100 independent runs of 106 iterations of the algorithms. As expected, the estimateof the global optimum spends more time in the global optimum for all methods as the speedof time variations decreases. The estimate provided by the AS algorithm, however, differs fromthe true global optimum less frequently as compared with the RS and UCB methods.1414.8. Numerical StudyFigure 4.4: Proportion of simulation experiments on non-optimal feasible solutions when theglobal optimum jumps at n = 103.10−3 10−2 10−1051015202530354045Speed of Markovian Switching ε%ofTimeSpentOutofGlobalOptima ASRSUCBSpeed of Markovianswitching ε matchesalgorithm step-size μFigure 4.5: Proportion of the time that global optimum estimate does not match the true globaloptimum versus the switching speed of the set of global optimizers.1424.9. Concluding Remarks4.9 Concluding RemarksInspired by stochastic fictitious play learning rules in game-theory, this chapter presented asmooth best-response adaptive search algorithm for simulation-based discrete stochastic opti-mization problems that undergo random unpredictable changes. These problems are of practicalinterest in many fields such as engineering, supply chain management, logistics and transporta-tion, management science, and operations research. The salient features of the proposed schemeinclude:1. It assumes no functional properties such as sub-modularity, symmetry, or exchangeabilityon the objective function.2. It allows temporal correlation of the data obtained from simulation experiments, whichis more realistic in practice.3. It can track non-stationary global optimum when it evolves on the same timescale as theupdates of the adaptive search algorithm, and spends most of its simulation effort on theglobal optimizers set.Numerical examples confirmed superior efficiency and convergence characteristics as comparedwith the existing random search and pure exploration methods. The proposed method issimple, intuitive, and flexible enough to allow an end-user to exploit the structure inherent inthe optimization problem at hand, and also exhibits attractive empirical performance.Now that it has been shown adaptive search algorithms inspired by game-theoretic learningschemes can be used to solve discrete stochastic optimization problems efficiently, it will beinteresting to devise and evaluate adaptive search schemes of better-response type similar tothe regret-matching learning rule of Chapter 2. This will be the subject of Chapter 5.4.10 Proofs of Results4.10.1 Proof of Theorem 4.2Before proceeding with the proof, we shall recall the definition of tightness.Definition 4.4 (Tightness). Let Z[n] be an Rr-valued random vector. We say the sequence{Z[n]} is tight if for each η > 0, there exists a compact set Kη such that P (Z[n] ∈ Kη) ≥ 1− ηfor all n.The definitions of weak convergence (see Definition 4.2) and tightness extend to randomelements in more general metric spaces. On a complete separable metric space, tightness isequivalent to relative compactness, which is known as Prohorov’s Theorem [46]. By virtue of1434.10. Proofs of Resultsthis theorem, we can extract convergent subsequences when tightness is verified. We thus startby proving the tightness of the interpolated process xµ(·).Recall the discrete-time iterates x[n], in (4.30). In view of the boundedness of the objectivefunction, for any 0 < T1 < ∞ and n ≤ T1/µ, {x[n]} is bounded w.p.1. As a result, for anyϑ > 0,supn≤T1/µE ‖x[n]‖ϑ <∞ (4.53)where in the above and hereafter ‖ · ‖ denotes the Euclidean norm. We will also use t/µ todenote the integer part of t/µ for each t > 0. Next, considering the interpolated process xµ(·),defined in (4.32), and the recursion (4.30), for any t, u > 0, δ > 0, and u < δ, it can be verifiedthatxµ(t+ u)− xµ(t) = µ(t+u)/µ−1∑k=t/µ[A(s[k], φ̂[k], fk(s[k]))− x[k]]+(t+u)/µ−1∑k=t/µ[F(θ[k])− F(θ[k + 1])0S+1] (4.54)where A(·, ·, ·) is defined in (4.31). Consequently, using the parallelogram law,Eµt ‖xµ(t+ u)− xµ(t)‖2 ≤ 2Eµt∥∥∥∥∥∥µ(t+u)/µ−1∑k=t/µA(s[k], φ̂[k], fk(s[k]))− x[k]∥∥∥∥∥∥2+ 2Eµt∥∥∥∥∥∥(t+u)/µ−1∑k=t/µ[F(θ[k])− F(θ[k + 1])]∥∥∥∥∥∥2 (4.55)where Eµt denotes the conditional expectation given the σ-algebra generated by the µ-dependentpast data up to time t. By virtue of the tightness criteria [187, Theorem 3, p. 47] or [192,Chapter 7], it suffices to verify thatlimδ→0lim supµ→0E{sup0≤u≤δEµt ‖xµ(t+ u)− xµ(t)‖2}= 0. (4.56)Concerning the first term on the r.h.s. of (4.55), by boundedness of the objective function, we1444.10. Proofs of ResultsobtainEµt∥∥∥µ∑(t+u)/µ−1k=t/µ A(s[k], φ̂[k], fk(s[k]))− x[k]∥∥∥2= µ2(t+u)/µ−1∑τ=t/µ(t+u)/µ−1∑κ=t/µEµt{[A(s[τ ], φ̂[τ ], fτ (s[τ ]))− x(τ)]′ [A(s[κ], φ̂[κ], fκ(s[κ]))− x(κ)]}≤ Kµ2( t+ uµ −tµ)2= O(u2)(4.57)We then concentrate on the second term on the r.h.s. of (4.55). Note that, for sufficiently smallpositive µ, if Q is irreducible, then so is I + εQ. Thus, for sufficiently large k,‖(I + εQ)k − 1νε‖M ≤ λkc , for some 0 < λc < 1where νε denotes the row vector of stationary distribution associated with the transition matrixI+εQ, 1 denotes the column vector of ones, and ‖·‖M represents any matrix norm. The essentialfeature involved in the second term in (4.55) is the difference of the transition probabilitymatrices of the form(I + εQ)k−(t/µ) − (I + εQ)k+1−(t/µ).However, it can be seen that(I + εQ)k−(t/µ) − (I + εQ)k+1−(t/µ) = −εQ[(I + εQ)k−(t/µ) − 1Θνε].This in turn implies thatEµt∥∥∥∥∥∥(t+u)/µ−1∑k=t/µ[F(θ[k])− F(θ[k + 1])0S+1]∥∥∥∥∥∥2≤ KO(ε)(t+u)/µ−1∑k=t/µλkc ≤ O(ε)∞∑k=1λkc = O(µ).(4.58)Note that, in the last equality, we used Assumption 4.2. Finally, combining (4.57) and (4.58),the tightness criteria (4.56) is verified. Therefore, xµ(·) is tight in D([0,∞] : R2S+1), whichdenotes the space of functions that are defined in [0,∞), taking values in R2S+1, and are rightcontinuous and have left limits (Ca`dla`g functions) with Skorohod topology (see [192, p. 228]).In view of [290, Proposition 4.4], θµ(·) is also tight, and θµ(·) converges weakly to θ(·) suchthat θ(·) is a continuous time Markov chain with generator Q given in Assumption 2.1. As theresult, the pair (xµ(·), θµ(·)) is tight in D([0,∞] : R2S+1 ×Q).Using Prohorov’s theorem [192], one can extract a convergent subsequence. For notational1454.10. Proofs of Resultssimplicity, we still denote the subsequence by xµ(·), and its limit by x(·). By the Skorohodrepresentation theorem [192], xµ(·)→ x(·) in the sense of w.p.1 and the convergence is uniformon any compact interval. We now proceed to characterize the limit x(·) using martingaleaveraging methods.First, we demonstrate that the last term in (4.54) contributes nothing to the limit ODE.We aim to showlimµ→0Eh (xµ(tι), θµ(tι) : ι ≤ κ0)Eµt(t+u)/µ−1∑k=t/µ[F(θ[k])− F(θ[k + 1])0S+1] = 0.This directly follows from an argument similar to the one used in (4.58).To obtain the desired limit, it will be proved that the limit (x(·), θ(·)) is the solution of themartingale problem with operator L defined below: For all i ∈ Q,Ly(x, i) =[∇xy(x, i)]> [G(x, i)− x] +Qy(x, ·)(i),Qy(x, ·)(i) =∑j∈Q qijy(x, j),(4.59)and, for each i ∈ Q, y(·, i) : R2S+1 7→ R with y(·, i) ∈ C10 (C1 function with compact support).Further, ∇xy(x, i) denotes the gradient of y(x, i) with respect to x, and G(·, ·) is definedin (4.34). Using an argument similar to [291, Lemma 7.18], one can show that the martingaleproblem associated with the operator L has a unique solution. Thus, it remains to prove thatthe limit (x(·), θ(·)) is the solution of the martingale problem. To this end, it suffices to showthat, for any positive arbitrary integer κ0, and for any t, u > 0, 0 < tι ≤ t for all ι ≤ κ0, andany bounded continuous function h(·, i) for all i ∈ Q,Eh(x(tι), θ(tι) : ι ≤ κ0)[y (x(t+ u), θ(t+ u))− y (x(t), θ(t))−∫ t+ut Ly (x(v), θ(v)dv)]= 0.(4.60)To verify (4.60), we work with (xµ(·), θµ(·)) and prove that the above equation holds as µ→ 0.By the weak convergence of (xµ(·), θµ(·)) to (x(·), θ(·)) and Skorohod representation, it canbe shown thatEh (xµ(tι), θµ(tι) : ι ≤ κ0) [(xµ(t+ u), θµ(t+ u))− (xµ(t), θµ(t))]→ Eh (x(tι), θ(tι) : ι ≤ κ0) [(x(t+ u), θ(t+ u))− (x(t), θ(t))] .Now, choose a sequence of integers {nµ} such that nµ →∞ as µ→ 0, but δµ = µnµ → 0, and1464.10. Proofs of Resultspartition [t, t+ u] into subintervals of length δµ. Then,y(xµ(t+ u),θµ(t+ u))− y (xµ(t), θµ(t))=∑t+u`:`δµ=t[y (x[`nµ + nµ], θ[`nµ + nµ])− y (x[`nµ], θ[`nµ])]=∑t+u`:`δµ=t[y (x[`nµ + nµ], θ[`nµ + nµ])− y(x[`nµ], θ[`nµ + nµ])]+∑t+u`:`δµ=t[y(x[`nµ], θ[`nµ + nµ])− y(x[`nµ], θ[`nµ])](4.61)where∑t+u`:`δµ=t denotes the sum over ` in the range t ≤ `δµ ≤ t+ u.First, we consider the second term on the r.h.s. of (4.61):limµ→0Eh(xµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t [y (x[`nµ], θ[`nµ + µ])− y (x[`nµ], θ[`nµ])]]= limµ→0 Eh(xµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t∑Θi0=1∑Θj0=1×[∑`nµ+nµ−1k=`nµ[y (x[`nµ], j0)P (θ[k + 1] = j0|θ[k] = i0)− y(x[`nµ], i0)]I (θ[k] = i0)]]= Eh(x(tι), θ(tι) : ι ≤ κ0) [∫ t+ut Qy(x(v), θ(v))dv](4.62)For the first term on the r.h.s. of (4.61), we obtainlimµ→0 Eh (xµ(tι), θµ(tι) : ι ≤ κ0)×[∑t+u`:`δµ=t [y (x[`nµ + nµ], θ[`nµ + nµ])− y (x[`nµ], θ[`nµ + nµ])]]= limµ→0 Eh (xµ(tι), θµ(tι) : ι ≤ κ0)×[∑t+u`:`δµ=t [y (x[`nµ + nµ], θ[`nµ])− y (x[`nµ], θ[`nµ])]]= limµ→0 Eh (xµ(tι), θµ(tι) : ι ≤ κ0)×[∑t+u`:`δµ=t∇φ̂y>[φ̂[`nµ + nµ]− φ̂[`nµ]]+∇ry[r[`nµ + nµ]− r[`nµ]]+∇φ̂y> [z[`nµ + nµ]− z[`nµ]]]= limµ→0 Eh (xµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t δµ×(∇φ̂y>[1nµ∑`nµ+nµ−1k=`nµ[g(s[k], φ̂[k] + F(θ[`nµ]), fk(s[k]))− F(θ[`nµ])]− 1nµ∑`nµ+nµ−1k=`nµφ̂[k]]+∇ry[1nµ∑`nµ+nµ−1k=`nµ[fk (s[k])− Fmin(θ[`nµ])]− 1nµ∑`nµ+nµ−1k=`nµr[k]]+∇rz>[1nµ∑`nµ+nµ−1k=`nµes[k] −1nµ∑`nµ+nµ−1k=`nµz[k]])](4.63)where ∇xy denotes the gradient column vector with respect to vector x. For notational sim-plicity, we use ∇φ̂y, ∇ry, and ∇zy to denote ∇φ̂y(x[`nµ], θ[`nµ]), ∇ry(x[`nµ], θ[`nµ]), and1474.10. Proofs of Results∇zy(x[`nµ], θ[`nµ]), respectively. The rest of the proof is divided into three steps, each con-cerning one of the three terms in (4.63).Step 1. We start by looking atlimµ→0Eh (xµ(tι), θµ(tι) : ι ≤ κ0)×[∑t+u`:`δµ=t δµ∇φ̂y>[1nµ∑`nµ+nµ−1k=`nµ[g(s[k], φ̂[k] + F(θ[`nµ], fk(s[k]))− F(θ[`nµ])] ]]= limµ→0Eh (xµ(tι), θµ(tι) : ι ≤ κ0)×[∑t+u`:`δµ=t δµ∇φ̂y>(− F(θ[`nµ]) + 1nµ∑`nµ+nµ−1k=`nµ∑Θθˇ=1 (4.64)[∑Θθ=1 E`nµ{g(s[k], φ̂[k] + F(θ[`nµ]), fk(s[k], θ))}· E`nµ{I(θ[k] = θ|θ[`nµ] = θˇ)} ])]where fk(·, θ) is defined in Assumption 4.1, and concentrate on the term involving the Markovchain θ[k]. Note that for large k with `nµ ≤ k ≤ `nµ+nµ and k−`nµ →∞, by [290, Proposition4.4], for some k̂0 > 0,(I + εQ)k−`nµ = Z((k − `nµ)ε)+O(ε+ exp(−k̂0(k − `nµ)),dZ(t)dt = Z(t)Q, Z(0) = I.For `nµ ≤ k ≤ `nµ + nµ, ε = O(µ) yields that (k − `nµ)ε → 0 as µ → 0. For such k,Z((k − `nµ)ε)→ I. Therefore, by the boundedness of g(·, ·), it follows that, as µ→ 0,1nµ∑`nµ+nµ−1k=`nµ∥∥∥E`nµg(s[k], φ̂[k] + F(θ[`nµ]), fk(s[k], θ))∥∥∥×∣∣∣E`nµ[I (θ[k] = θ)∣∣I(θ[`nµ] = θˇ)]− I(θ[`nµ] = θˇ) ∣∣∣→ 0.Therefore,limµ→0 Eh (xµ(tι), θµ(tι) : ι ≤ κ0)×[∑t+u`:`δµ=t δµ∇φ̂y>(1nµ∑`nµ+nµ−1k=`nµ[g(s[k], φ̂[k] + F(θ[`nµ]), fk(s[k]))− F(θ[`nµ])] )]= limµ→0 Eh (xµ(tι), θµ(tι) : ι ≤ κ0)×[∑t+u`:`δµ=t δµ∇φ̂y>(∑Θθˇ=1 I(θ[`nµ] = θˇ)(4.65)×[− F(θˇ) + 1nµ∑`nµ+nµ−1k=`nµE`nµ{gk(s[k], φ̂[k] + F(θˇ), fk(s[k], θˇ))}])].1484.10. Proofs of ResultsIt is more convenient to work with the individual elements of g(·, ·, ·). Substituting for the i-thelement from (4.16) in (4.65) resultslimµ→0 Eh (xµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t δµ∇φ̂y>(∑Θθˇ=1 I(θ[`nµ] = θˇ)×[− F (i, θˇ) + 1nµ∑`nµ+nµ−1k=`nµE`nµ{fk(i,θˇ)bγi(φ̂[k]+F(θˇ)) · I (s[k] = i)}])]= limµ→0 Eh (xµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t δµ∇φ̂y>×(∑Θθˇ=1[− F (i, θˇ) + 1nµ∑`nµ+nµ−1k=`nµE`nµ{fk(i, θˇ)}]I(θ[`nµ] = θˇ) )]. (4.66)In (4.66), we usedE`nµ{I (s[k] = i)} = bγi(φ[`nµ])since s[k] is chosen according to the smooth best-response strategy bγ(φ[k]); see Step 1 inAlgorithm 4.1. Recall thatφ[`nµ] = φ̂[`nµ] + F(θ[`nµ]).However, in the last line of (4.65), θ[`nµ] = θˇ is held fixed. Therefore, φ[`nµ] = φ̂[`nµ] +F(θˇ). Note that fk(i, θˇ) is still time-dependent due to the presence of noise in the simulationdata. Note further that θ[`nµ] = θµ(µ`nµ). In light of Assumptions 4.1 and 4.2, by the weakconvergence of θµ(·) to θ(·), the Skorohod representation, and using µ`nµ → v, it can be shownfor the second term in (4.66) that, as µ→ 0,∑Θθˇ=11nµ∑`nµ+nµ−1k=`nµE`nµ{fk(i, θˇ)}I(θµ(µ`nµ) = θˇ)→∑Θθˇ=1 F (i, θˇ)I(θ(v) = θˇ)= F (i, θ(v)) in probability.(4.67)Using a similar argument for the first term in (4.66) yieldslimµ→0 Eh (xµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t δµ∇φ̂y>×(1nµ∑`nµ+nµ−1k=`nµ[g(s[k], φ̂[k] + F(θ[`nµ]), fk(s[k]))− F(θ[`nµ])] )]= 0S . (4.68)By using the technique of stochastic approximation (see, e.g., [192, Chapter 8]), it can be shownthatEh(xµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t δµ∇φ̂y>[1nµ∑`nµ+nµ−1k=`nµφ̂(k)]]→ Eh(x(tι), θ(tι) : ι ≤ κ0)[∫ t+ut[∇φ̂y(x(v), θ(v))]>φ̂(v)dv]as µ→ 0.(4.69)1494.10. Proofs of ResultsStep 2. Next, we concentrate on the second term in (4.63). By virtue of the boundedness offk(s, θ) (see Assumption 4.1), and using a similar argument as in Step 1,limµ→0 Eh (xµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=tδµnµ∇ry×(∑`nµ+nµ−1k=`nµ∑Θθˇ=1[∑Si=1 E`nµ{fk(i, θˇ)I (s[k] = i)}− Fmin(θˇ)]I(θ[`nµ] = θˇ) )]= limµ→0 Eh (xµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=tδµnµ∇ry×(∑`nµ+nµ−1k=`nµ∑Θθˇ=1[∑Si=1 bγi(φ[`nµ])E`nµ{fk(i, θˇ)}− Fmin(θˇ)]I(θ[`nµ] = θˇ) )]= limµ→0 Eh (xµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t δµ∇ry×(1nµ∑`nµ+nµ−1k=`nµ∑Θθˇ=1 I(θµ(µ`nµ) = θˇ)×[∑Si=1 bγi(φµ(µ`nµ))E`nµ{fk(i, θˇ)}− Fmin(θˇ)])]. (4.70)Here, we used E`nµ{I (s[k] = i)} = bγi(φ[`nµ])as in Step 1. Recall from (4.28) thatφ[k] = φ̂[k] + F(θ[k]).By weak convergence of θµ(·) to θ(·), the Skorohod representation, and using µ`nµ → v andAssumptions 4.1 and 4.2, it can then be shown∑Θθˇ=11nµ∑`nµ+nµ−1k=`nµbγi(φµ(µ`nµ))E`nµ{fk(i, θˇ)}I(θµ(µ`nµ) = θˇ)→ bγi(φ̂(v) + (θ(v)))F (i, θ(v)) in probability as µ→ 0. (4.71)Using a similar argument for the second term in (4.70), we conclude that, as µ→ 0,Eh (xµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t δµ∇ry(1nµ∑`nµ+nµ−1k=`nµ[fk(s[k])− Fmin(θ[`nµ])] )]→ Eh (xµ(tι), θµ(tι) : ι ≤ κ0)×[ ∫ t+ut∇ry(x(v), θ(v))(bγ(φ̂(v) + F(θ(v)))· F(θ(v))− Fmin(θ(v)))dv]. (4.72)Finally, similar to (4.69),Eh(xµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t δµ∇ry(1nµ∑`nµ+nµ−1k=`nµr[k])]→ Eh(x(tι), θ(tι) : ι ≤ κ0)[∫ t+ut∇ry(x(v), θ(v)) r(v)dv]as µ→ 0. (4.73)1504.10. Proofs of ResultsStep 3. Next, we concentrate on the last term in (4.63). Using similar arguments as in Step1 and 2,limµ→0 Eh (xµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t δµ∇zy>×[1nµ∑`nµ+nµ−1k=`nµ∑Si=1 ei · E`nµ {I (s[k] = i)}]]= limµ→0 Eh (xµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t δµ∇zy>[1nµ∑`nµ+nµ−1k=`nµ∑Si=1 bγi(φ[`nµ])· ei]]= limµ→0 Eh (xµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t δµ∇zy>×[1nµ∑`nµ+nµ−1k=`nµ∑Θθˇ=1 I(θ[`nµ] = θˇ) [∑Si=1 bγi(φ̂[`nµ] + F(θˇ))· ei]. (4.74)Noting that θ[`nµ] = θµ(µ`nµ), by the weak convergence of θµ(·) to θ(·), the Skorohod repre-sentation, and using µ`nµ → v, it can then be shown1nµ∑`nµ+nµ−1k=`nµ∑Θθˇ=1 I(θµ(µ`nµ) = θˇ) [∑Si=1 bγi(φ̂µ(µ`nµ) + F(θˇ))· ei]→ bγ(φ̂(v) + F(θ(v)))in probability as µ→ 0. (4.75)Therefore, as µ→ 0,Eh (xµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t δµ∇zy>[1nµ∑`nµ+nµ−1k=`nµes[k]]]→ Eh (xµ(tι), θµ(tι) : ι ≤ κ0)[ ∫ t+ut ∇zy>(x(v), θ(v)) bγ(φ̂(v) + F(θ(v)))du]. (4.76)Finally, similar to (4.69),Eh(xµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t δµ∇zy>[1nµ∑`nµ+nµ−1k=`nµz(k)]]→ Eh(x(tι), θ(tι) : ι ≤ κ0)[∫ t+ut∇zy>(x(v), θ(v)) z(v)dv]as µ→ 0. (4.77)Combining Steps 1 to 3 concludes the proof.4.10.2 Proof of Theorem 4.3Define the Lyapunov function:Vθ (r) = r2where the subscript θ emphasizes that the switching signal θ(t) is held fixed at θ in (4.36).Taking the time derivative, and applying (4.36), we obtainddtVθ (r) = 2r ·[bγ(F(θ))· F(θ)− Fmin(θ)− r]1514.10. Proofs of ResultsSince the objective function value at various states is bounded for each θ ∈ Q,ddtVθ (r) ≤ 2r · [C(γ, θ)− r] (4.78)for some constant C(γ, θ). Recall the smooth best-response sampling strategy bγ(·) in Defini-tion 4.1. The parameter γ simply determines the magnitude of perturbations applied to theobjective function. It is then clear that C(γ, θ) is monotonically increasing in γ.In view of (4.78), for each η > 0, γ̂ can be chosen small enough such that, if γ ≤ γ̂ andr ≥ η,2C(γ, θ) ≤ η, hence, ddtVθ (r) ≤ −Vθ(r). (4.79)Therefore, each deterministic subsystem is globally asymptotically stable and, for γ ≤ γ̂,limt→∞dist[r(t),R[0,η)]= 0.Next, we use the above Lyapunov function argument to extend [67, Corollary 12] to proveglobal asymptotic stability w.p.1 of the Markovian switched limit dynamical systems. Beforeproceeding with the theorem, we recall the definition of class K∞ functions.Definition 4.5. A continuous function a : R+ → R+ is a class K∞ function if: (i) it is strictlyincreasing, (ii) a(0) = 0, and (iii) a(x)→∞ if x→∞.Theorem 4.4 ([67, Corollary 12]). Consider the switching system (4.37) in Definition 4.3,where θ(t) is the state of a continuous time Markov chain with generator Q. Defineq := maxθ∈Q|qθθ |, and q˜ := maxθ,θ′∈Q qθθ′ .Suppose there exist continuously differentiable function Vθ : Rn → R+, θ ∈ Q, two class K∞functions a1, a2, and a real number v > 1 such that the following hold:1. a1(dist[Y,H]) ≤ Vθ(Y ) ≤ a2(dist[Y,H]), ∀Y ∈ Rr, θ ∈ Q,2. ∂Vθ∂Y fθ(Y ) ≤ −λVθ(Y ), ∀Y ∈ Rr,∀θ ∈ Q,3. Vθ(Y ) ≤ vVθ′(Y ), ∀Y ∈ Rr, θ, θ′ ∈ Q,4. (λ+ q˜)/q > v.Then, the regime-switching system (4.37) is globally asymptotically stable almost surely.The quadratic Lyapunov function (4.78) satisfies hypothesis 2 in Theorem 4.4; see (4.79).Further, since the Lyapunov functions are the same for all subsystems θ ∈ Q, existence of v > 1in hypothesis 3 is automatically guaranteed. Hypothesis 4 simply ensures that the switchingsignal θ(t) is slow enough. Given that λ = 1 in hypothesis 2, it remains to ensure that the1524.10. Proofs of Resultsgenerator Q of the Markov chain θ(t) satisfies (1+q˜)/q > 1. This is also satisfied since, by (2.30),|qθθ′ | ≤ 1 for all θ, θ′ ∈ Q.4.10.3 Proof of Corollary 4.1We only give an outline of the proof, which essentially follows from Theorems 4.2 and 4.3.Defineŷµ(·) = yµ(·+ tµ).Then, it can be shown that ŷµ(·) is tight. For any T1 <∞, take a weakly convergent subsequenceof{ŷµ(·), ŷµ(· − T1)}, and denote the limit by(ŷ(·), ŷT1 (·)).Note that ŷ(0) = ŷT1 (T1). The value of ŷT1 (0) may be unknown, but the set of all possiblevalues of ŷT1 (0) (over all T1 and convergent subsequences) belongs to a tight set. Using thisand Theorems 4.2 and 4.3, for any % > 0, there exists a T% <∞ such that for all T1 > T%,dist[ŷT1 (T1),Yη] ≥ 1− %.This implies thatdist[ŷ(0),Yη]≥ 1− %and the desired result follows.1535Regret-Based Adaptive Search forDiscrete Stochastic Optimization“True optimization is the revolutionary contribution of mod-ern research to decision processes.”—George B. Dantzig5.1 IntroductionSimilar to Chapter 4, this chapter also considers simulation-based discrete optimization prob-lems that undergo random unpredictable changes. The simulation experiments performed forevaluating the objective function are allowed to provide correlated information, which is morerealistic in many systems of practical interest. The global optimum jump changes as the pa-rameters underlying the problem, such as the profile of the stochastic events or the objectivefunction, randomly evolve with time. The time varying parameters are assumed to be unob-servable with unknown dynamics. This problem can be interpreted as tracking the optimaloperating configuration of a stochastic system subject to uncontrollable time inhomogeneity.Tracking such non-stationary optima lies at the very heart of stochastic approximation meth-ods. Details on simulation-based discrete optimization problems and the related challenges canbe found in Section 4.1. We merely recall that there are two main difficulties involved in solvingsuch problems: (i) the optimization problem itself is NP-hard, and (ii) experimenting on a trialfeasible solution can be too computationally expensive to be repeated numerous times.The sampling-based adaptive search scheme presented in Chapter 4 relies on a samplingstrategy that is of best-response type. That is, it keeps sampling the candidate which is thus farconceived to guarantee the highest “performance.” This is an appealing feature in simulation-based discrete stochastic optimization problems as one aims to localize the globally optimumcandidate with as little computational expense as possible. Inspired by the regret-matchingadaptive learning algorithm of Chapter 2, this chapter proposes a regret-based adaptive searchalgorithm that is of better-response type. That is, it keeps sampling and evaluating all thosecandidates that have been found promising based on the limited simulation data collected1545.1. Introductionthus far. The proposed method belongs to the class of randomized search schemes with fixedneighborhood structure.The regret-based adaptive search algorithm proceeds as follows: At each time, a sample istaken from the search space according to a potential-based sampling strategy which relies on asimple and intuitive regret measure. The system performance is then simulated at the sampledconfiguration, and fed into a stochastic approximation algorithm to update the regrets and,hence, the sampling strategy. The appeal of sampling-based methods is because they oftenapproximate well, with a relatively small number of samples, problems with a large number ofscenarios; see [55, 199] for numerical reports on sampling-based methods.5.1.1 Main ResultsA typical method for analyzing an algorithm that aims at tracking non-stationary optima is topostulate a hyper-model for the underlying time variations [40]. This hyper-model may representthe dynamics of the profile of the stochastic events in a system or changes in the performancemeasure (or both). Here, the hyper-model is assumed to be a discrete-time finite-state Markovchain, which perfectly captures the dynamics of the jumping global optimum when the feasibleset is integer-ordered. It is well known that, if the hyper-model changes too drastically, thereis no chance the time-varying optimum can be tracked. On the other hand, if the hyper-modelevolves too slowly, its variation can be ignored. This chapter, similar to Chapter 4, studies thenon-trivial case where the hyper-model evolves on the same timescale as the adaptive searchalgorithm. This is in contrast to most existing algorithms that are designed to locate stationaryglobal optimum.The main results in this chapter are summarized below:1. A regret-based adaptive search scheme of better-response type that is inspired by the regret-matching learning rule in Chapter 2. The proposed algorithm assumes no functional proper-ties such as sub-modularity, symmetry, or exchangeability on the objective function, allowstemporally correlated sequences of simulation data, and can as well be deployed in staticdiscrete stochastic optimization problems.2. Performance analysis of the regret-based adaptive search algorithm. We prove that, if theevolution of the parameters underlying the discrete stochastic optimization problem occuron the same timescale as the updates of the regret-base adaptive search algorithm, it canproperly track the global optimum by showing that the worst case regret can be made andkept arbitrarily small infinitely often. Further, if now the time variations occur on a slowertimescale than the adaptation rate of the proposed algorithm, the most frequently sampledelement of the search space tracks the global optimum as it jumps over time. Indeed,the proportion of experiments completed on a non-optimal candidate solution is inverselyproportional to how far its associated objective value is from the global optimum.1555.1. Introduction3. Numerical evaluation. We use two examples to evaluate and compare the performance ofthe proposed scheme. The first example is the same as that in Section 4.8, and concerns aproduction firm that aims to learn the order size that occurs with maximum probability. Thesecond example is a discrete-event simulation optimization of for a classic (s, S) inventorymanagement policy. Both examples verify faster convergence and improved efficiency ascompared with the random search method in [9, 290], the simulated annealing algorithmin [5], and the upper confidence bound schemes of [22, 115].Both analytical results and numerical evaluations are somewhat surprising as they show, byproperly choosing the potential function in the sampling strategy, one can obtain performancecharacteristics with better-response type adaptive search methods that are similar to thoseobtained via adaptive search schemes of best-response type in Chapter 4.We use the well-known ordinary differential equation (ODE) approach [50, 192] to analyzethe tracking capability of the proposed algorithm. First, by a combined use of weak convergencemethods [192] and treatment on Markov switched systems [291, 293], it is shown the limit systemassociated with the discrete-time iterates of the algorithm is a switched ODE modulated bya continuous-time Markov chain. This is in contrast to the standard treatment of stochasticapproximations, where the limiting dynamics converge to a deterministic ODE. Next, usingLyapunov function methods [66, 197] for switched systems, it is proved that the limit systemis globally asymptotically stable almost surely. This in turn establishes that the limit pointsof the switched ODE and the discrete-time iterates of the adaptive search algorithm coincide.Finally, the tracking and efficiency results are established via characterizing the set of globalattractors of the limit system.5.1.2 Chapter OrganizationThe rest of this chapter is organized as follows: Section 5.2 reviews the regime switching discretestochastic optimization problem introduced in Section 4.3. Section 5.3 then presents the regret-based adaptive search algorithm and elaborates on the distinct properties that differentiates itfrom the adaptive search algorithm proposed in Section 4.4. The main theorem of this chapterentailing the tracking and efficiency properties is given in Section 5.4. A step-by-step proof ofthe main results is summarized in Section 5.5. The case of slowly evolving discrete stochasticoptimization problem is analyzed in Section 5.6. Finally, numerical examples are provided inSection 5.7 followed by the concluding remarks in Section 5.8. The detailed proofs are relegatedto Section 5.9 for clarity of presentation.1565.2. Non-stationary Discrete Stochastic Optimization5.2 Non-stationary Discrete Stochastic OptimizationThis section briefly recalls the non-stationary discrete stochastic optimization problem, intro-duced in Section 4.3, and spells out the main assumption made on the sequence of simulationdata.A non-stationary discrete stochastic optimization problem is of the form:mins∈SE{F (s,xn(s, θ[n]), θ[n])}(5.1)where S = {1, 2, . . . , S} is the finite integer-ordered set of candidate solutions. The integer-valued process θ[n] takes values in the set Q = {1, . . . ,Θ}, and captures evolution of theparameters underlying the problem, e.g., profile of stochastic events or objective function. Itis assumed that θ[n] is unobservable with unknown dynamics. The random process xn(s, θ)represents the stochastic events occurring in the system, and its distribution can vary bothacross the set of candidate solutions and the state-space of the underlying time variations.Finally, F : S × Rr ×Q → R is the bounded performance function of interest for a stochasticsystem. Note that, the set of optimizers of (5.1), denoted by M(θ[n]), is non-stationary andevolves following the sample path of θ[n]. One is, therefore, interested to optimize the expectedperformance of a stochastic system that undergoes random unpredictable changes over time.However, since no distributional assumption is made on xn(s, θ[n]), the expectation cannot beevaluated analytically.In the sequel, we consider a simulation-based variant of the above discrete stochastic opti-mization problem, where the explicit relation between the decision variables and the objectivefunction is unknown, and simulation is used to obtain realizations of the objective functionvalues, denoted byfn(s, θ[n]) := F (s,xn(s, θ[n]), θ[n]). (5.2)Therefore, the non-stationary discrete stochastic optimization problem in (5.1) becomesmins∈SE {fn(s, θ[n])} . (5.3)Note that each stochastic process fn(s, θ[n]) in (4.7) is no longer stationary even though fn(s, i)is stationary for each i ∈ Q. Intuitively, in lieu of one sequence, we have Θ sequences {fn(s, i) :i ∈ Q} for each feasible decision s ∈ S. One may imagine that there is a random environmentthat is dictated by the process θ[n]. If θ[n] takes a value, say, i, at time n, then fn(s, i) will bethe sequence of data output by simulation.The following assumption is made on the sequence of data obtained from the simulationexperiments.1575.3. Regret-Based Adaptive Randomized SearchAssumption 5.1. Let E` denote the conditional expectation given F`, the σ-algebra generatedby {fn(s, i), θ[n] : i ∈ Q, s ∈ Q, n < `}. Then, for each s ∈ S and for all θ ∈ Q:(i) {fn (s, θ)} is a bounded, stationary, and real-valued sequence of random variables.(ii) For any ` ≥ 0, there exists F (s, θ) such that1NN+`−1∑n=`E` {fn (s, θ)} → F (s, θ) in probability as N →∞. (5.4)The above condition allows correlation in the simulation data, that is more realistic in prac-tice. The result follows from the mixing condition with appropriate mixing rates. Examplesof such processes include: sequences of i.i.d. random variables with bounded variance, mar-tingale difference sequences with finite second moments, moving average processes driving bya martingale difference sequence, mixing sequences in which remote past and distant futureare asymptotically independent, and certain non-stationary sequences such as a function of aMarkov chain, etc. (cf. [155, 286]).Since the sequence {θ[n]} cannot be observed, we suppress the dependence in (5.3) on θ[n],and keep using fn(s) in the rest of this chapter to denote the objective function value at feasiblesolution s realized at time n.5.3 Regret-Based Adaptive Randomized SearchThis section introduces a class of stochastic approximation algorithms that, relying on potential-based sampling strategies [39, 142, 145], prescribes how to sample from the search space so asto efficiently learn and track the non-stationary global optimum. To this end, Section 5.3.1defines the class of potential-based sampling strategies and elaborates on its distinct properties.Section 5.3.2 then presents the proposed regret-based adaptive search algorithm.5.3.1 Potential-Based Sampling StrategyRecall from Section 4.4.1 that a randomized search algorithm repeatedly takes samples from thesearch space according to a randomized strategy, which is simply a probability distribution overthe search space. These samples are then evaluated via real-time simulation, using which thesampling strategy and estimate of the global optimum are revised. In this section, we introducea new class of sampling strategies, namely, potential-based sampling strategy, that is inspiredby learning algorithms in games [139, 145], and relies on a simple and intuitive regret measure.This regret measure differs from that introduced in Section 4.4.1, which quantified the trackingcapability of the adaptive search scheme in localizing the global optimum.1585.3. Regret-Based Adaptive Randomized SearchLetRS− = {x ∈ Rs; x < 0}represent the negative orthant in the S-dimensional Euclidean space, and R+ denote the set ofnon-negative reals. Let further x+ = max {x,0}, where the maximum is taken element-wise,‖ ·‖1 and ‖ ·‖ denote the L1 and L2 norms, and 1S represent a vector of size S with all elementsbeing equal to one. The potential-based sampling strategy is then defined as follows:Definition 5.1. Let P : RS → R+ be a potential function with the following properties:1. P is continuously differentiable and P (x)→∞ as ‖x+‖2 →∞;2. P (x) = 0 if and only if x ∈ RS−;3. ∇P (x) ≥ 0;4. ∇P (x) · x ≥ C ‖∇P (x)‖1 ‖x+‖1 > 0 for all x /∈ RS− for some constants C > 0.Given a regret vectorr = (r1 , . . . , rS ) ∈ RSthe potential-based sampling strategy is then defined asp(r) = (1− δ) ∇P (r)∑Si=1∇iP (r)+ δS1S , where ∇iP (r) =∂P (r)∂ri, 0 < δ < 1. (5.5)We shall now proceed to formally define the regret vector and describe its update mechanism.The regret is defined as the opportunity loss by a column vectorr[n] = col(r1 [n], . . . , rS [n]) ∈ RS .Each element ri[n] evaluates the increase in average performance were one to replace the se-quence of past (up to time n) samples by a fixed element i from the search space. Let s[τ ]denote the feasible solution sampled at time τ according to the potential-based sampling strat-egy p(r[τ ]). Then, each element ri[n] is formally defined byri[n] = (1− µ)n−1[f1(s[1])− f1(s[1])pi (r[1]) · I (s[1] = i)]+ µ∑nτ=2(1− µ)n−τ[fτ (s[τ ])− fτ (s[τ ])pi (r[τ ]) · I (s[τ ] = i)] (5.6)where I(X) denotes the indicator function: I(X) = 1 if statement X is true, and I(X) = 0otherwise. The normalization factor 1/pi(r[τ ]), intuitively speaking, makes the length of periodsthat each element i has been sampled comparable to the total time elapsed. The discount factor0 < µ 1 is further a small parameter that places more weight on more recent simulations and1595.3. Regret-Based Adaptive Randomized Searchis necessary as the algorithm is deemed to track time-varying global optima. It can be easilyverified that ri[n] in (5.6) can be updated recursively viari[n+ 1] = ri[n] + µ[fn(s[n])− fn(s[n])pi (r[n]) · I (s[n] = i)− ri[n]]. (5.7)The regrets are updated relying only on the simulation data fn(s[n]); therefore, no knowledge ofthe objective function, nor the realizations of {θ[n]}, are required. By the above construction,a positive regret ri[n] > 0 implies the opportunity to gain by more frequently sampling thecandidate solution i in future.The potential-based sampling strategy (5.5) can now be interpreted as follows: With prob-ability 1 − δ, the sample is taken according to probabilities proportional to regrets. With theremaining probability δ, the sample is chosen according to a uniform distribution on the searchspace. The strategy (5.5) exhibits better-response behavior as it places positive probability onall feasible solutions with positive regrets. The more positive the regret associated with a fea-sible solution, the higher will be the probability allocated to sampling that element in future.Surprisingly, this better-response behavior is an appealing feature since it circumvents the dis-continuity inherent in the strategies of best-response type, where a small change in the regretsmay lead to an abrupt change in behavior. Such discontinuities complicates the convergenceanalysis. The exploration, represented by the parameter δ, is further essential as the algorithmcontinuously learns the profile of the stochastic events and its evolution through the simulationexperiments.In light of the above discussion, the potential-based sampling strategy in (5.5) differs fromthe smooth best-response sampling strategy in Section 4.4.1 in three aspects:1. The potential-based sampling strategy places positive probability on all plausible candi-dates and, hence, is of better-reply type, whereas the best-response strategy keeps samplingthe most plausible candidate given the limited conception of the system performance atdifferent candidates and, hence, is of best-response type.2. The smooth best-response strategy exhibits exploration by adding random values to thebeliefs. In contrast, the potential-based strategy implements exploration by uniformlyrandomizing among all candidate solutions with a small probability.3. The smooth best-response strategy circumvents the discontinuity inherent in strategiesof best-response type, which are perhaps the most desirable strategies, by introducingrandom disturbances to the beliefs. The potential-based strategy, in contrast, avoids suchdiscontinuities—which complicate the convergence analysis—by constructing a better-response sampling strategy.1605.3. Regret-Based Adaptive Randomized SearchExample 5.1. Let ‖ · ‖p denote the Lp-norm. A typical example for the potential function isP (r) =(‖x+‖p)p =∑Si=1(r+i)p , 1 < p <∞which results in the following potential-based sampling strategyp(r[n]) =(p1(r[n]), . . . , pS (r[n])), where pi(r[n]) = (1− δ)(r+i [n])p−1∑Sj=1(r+j [n])p−1+ δS . (5.8)For p = 2, such a strategy is widely known as unconditional modified-regret-matching [139, 143]in the context of learning in games. It further draws analogies with the Blackwell’s approach-ability framework [1, 47].5.3.2 Adaptive Search AlgorithmAlthough being different in the nature of the sampling prescription, the regret-based adaptivesearch algorithm is similar to the smooth best-response adaptive search in that both algorithmssimply implement an adaptive sampling scheme. Relying on the regrets experienced so far, thepotential-based strategy prescribes how to sample from the search space. Once the sample istaken, the objective function is evaluated via real-time simulation. Finally, the new simulationdata is fed into a stochastic approximation algorithm to update the regrets and, accordingly,the sampling strategy. Similar to Algorithm 4.1, the regret-based adaptive search algorithmrelies only on the simulation data and requires minimal computational resources per iteration:It needs only one simulation experiment per iteration as compared to two in [9], or S (the sizeof the search space) in [235, Chapter 5.3]. Yet, as evidenced by the numerical evaluations inSection 5.7, it guarantees performance gains in terms of the tracking speed.Let 1S and 0S represent column vectors of ones and zeros of size S, respectively. Theregret-based adaptive search algorithm is summarized below in Algorithm 5.1.Remark 5.1. The choice of the step-size µ is of critical importance. Ideally, one would want µto be small when the belief is close to the vector of true objective values, and large otherwise.Choosing the best µ is, however, not straightforward as it depends on the dynamics of thetime-varying parameters (which is unknown). One enticing solution is to use a time-varyingstep-size µ[n], and an algorithm which recursively updates it as the underlying parametersevolve. The updating algorithm for µ[n] will in fact be a separate stochastic approximationalgorithm which works in parallel with the adaptive search algorithm. Such adaptive step-sizestochastic approximation algorithms are studied in [40, 191] and [192, Section 3.2], and are outof the scope of this chapter.Remark 5.2. We shall emphasize that Algorithm 5.1 assumes no knowledge of the hyper-modelθ[n] nor its dynamics. The dynamics of θ[n] is indeed used only in the tracking analysis in1615.4. Main Result: Tracking Non-Stationary Global OptimaAlgorithm 5.1 Regret-Based Adaptive SearchInitialization: Choose a potential function P according to the conditions in Definition 5.1.Set the exploration parameter 0 < δ < 1 and the adaptation rate 0 < µ 1. Initializep[0] = 1S · 1S and r[0] = 0S .Step 1: Sampling. Select s[n] ∼ p(r[n]):p(r[n]) = (1− δ)∇P (r)|r=r[n]∑Si=1∇iP (r)|r=r[n]+ δS1S , where ∇iP (r) =∂P (r)∂ri. (5.9)Step 2: Evaluation. Perform a simulation experiment to obtain fn(s[n]).Step 3: Regret Update.r[n+ 1] = r[n] + µ[g(s[n], r[n], fn(s[n]))− r[n]]where g = col (g1 , . . . , gS ) , gi(s[n], r[n], fn(s[n]))= fn (s[n])− fn(i)pi (r[n]) · I (s[n] = i)(5.10)Recursion. Set n← n+ 1 and go to Step 1.Section 5.5. If θ[n] was observed, one could form and update regret vectors rθ[n] independentlyfor each θ ∈ Q, and use p(rθ′ [n]) to select s[n] once the system switched to θ′.5.4 Main Result: Tracking Non-Stationary Global OptimaThis section is devoted to characterizing asymptotic properties of the regret-based adaptivesearch algorithm. Before proceeding, it has to be emphasized that the regret measure in thischapter differs from that in Chapter 4. The regret in this chapter is a vector that compares the(discounted) average realized objective function for a particular sequence of samples with thatwould have been obtained if a fixed candidate had been chosen for all past periods. The regretmeasure in Chapter 4, in contrast, compares the performance of an algorithm, selecting amongseveral candidates, to the performance of the best of those alternatives. The regret measure inChapter 4 is thus the worst case regret. The two regret measures, however, relate to each otherin that the worst case regret r[n] in Chapter 4 is the maximizing element of the regret vectorr[n].A typical method for analyzing the performance of an adaptive algorithm is to postulatea hyper-model for the underlying time variations [40]. Here, similar to Chapter 4, we assumethat θ[n] is a discrete-time Markov chain with infrequent jumps. For completeness, we repeatthe following assumption from Section 4.5.2 that characterizes the Markov hyper-model.Assumption 5.2. Let {θ[n]} be a discrete-time Markov chain with finite state space Q =1625.4. Main Result: Tracking Non-Stationary Global Optima{1, 2, . . . ,Θ} and transition probability matrixP ε = IΘ + εQ. (5.11)Here, ε > 0 is a small parameter, IΘ denotes the Θ × Θ identity matrix, and Q = [qij ] isirreducible and represents the generator of a continuous-time Markov chain satisfyingqij ≥ 0 for i 6= j,|qij | ≤ 1 for all i, j ∈ Q,Q1Θ = 0Θ.(5.12)The interested reader is referred to Section 4.5.2 for further details and discussion on theabove assumption. The small step-size µ in (5.10) defines how fast the regret vector is updated.The impact of switching rate of the hyper-model θ[n] on convergence properties of Algorithm 5.1is captured by the relationship between µ and ε in the transition probability matrix (5.11). Thischapter focuses on the case where the time-varying parameters underlying the discrete stochasticoptimization problem evolve on the same timescale as that determined by the step-size µ ofAlgorithm 5.1. This is formalized by the following assumption.Assumption 5.3. ε = O(µ) in the transition probability matrix P ε in (5.11).Similar to Chapter 4, we use the ordinary differential equation (ODE) method [50, 188, 192]to analyze the tracking capability of Algorithm 5.1. Thus, in lieu of working with the discreteiterates directly, we examine the piecewise constant interpolations of the iterates r[n], which isdefined as follows:rµ(t) = r[n] for t ∈ [nµ, (n+ 1)µ) . (5.13)This will enable us to get a limit ODE that represents the asymptotic dynamics of the discrete-time iterates as µ → 0. Then, asymptotic stability of the global attractors set of the desiredODE concludes the desired tracking result. These details are, however, relegated to a latersection which provides the detailed proofs.Let x+ := max{x, 0}, where the maximum is taken element-wise. The following theorempresents the main result and implies that the sequence of samples generated via Algorithm 5.1most frequently visits the global optimum and tracks its time variations.Theorem 5.1. Suppose Assumptions 5.1–5.3 hold. Let {tµ} be any sequence of real numberssatisfying tµ → ∞ as µ → 0. Then, for any η > 0, there exists δ(η) > 0 such that, if0 < δ ≤ δ(η) in (5.9), then (rµ(·+ tµ)− η1S)+ converges in probability to 0 as µ→ 0. That is,for any β > 0:limµ→0 P(∥∥ (rµ(·+ tµ)− η1S)+∥∥ > β)= 0.where ‖ · ‖ denotes the Euclidean norm.1635.5. Analysis of the Regret-Based Adaptive Search AlgorithmThe above theorem evidences the tracking capability of Algorithm 5.1 by showing thatmaxi∈Sri[n] ≤ η infinitely often as µ→ 0, n→∞.The maximizing element of the regret vector r[n] simply gives the difference between the (dis-counted) average performance realized via employing Algorithm 5.1 and the global minimum;therefore, it quantifies the tracking capability of the proposed potential-based sampling strat-egy. Clearly, ensuring a smaller worst case regret requires sampling the global optimizer morefrequently. Put differently, the higher the difference between the objective value of a feasiblesolution and the global optimum, the lower must be the empirical frequency of sampling thatsolution to maintain the regret below a certain small value. Here, rµ(·+ tµ) looks at the asymp-totic behavior of r[n]. The requirement tµ →∞ as µ→ 0 means that we look at r[n] for a smallµ and large n with µn→∞. For a small µ, r[n] eventually spends nearly all of its time (withan arbitrarily high probability) in an α-neighborhood of the set{r ∈ RS : (r)+ ≤ η1S}, whereα → 0 as µ → 0. Note that, when µ is small, r[n] may escape from such an α-neighborhood.However, if such an escape ever occurs, will be a “large deviations” phenomena—it will be arare event. The order of the escape time (if it ever occurs) is often of the form exp(c/µ) forsome c > 0; see [192, Section 7.2] for details. Therefore, given a fixed η, the larger the differenceF (i, θ[n])−Fmin(θ[n]), the lower will be the frequency of sampling and simulating candidate i.Note that, similar to Algorithm 4.1, to allow adaptivity to the time variations, Algorithm 5.1samples each non-optimal solution with small but positive probabilities. Thus, one would notexpect the sequence of samples to eventually spend all its time in the global optima set beforeit undergoes a jump. Instead, the sampling strategy implemented by Algorithm 5.1 ensures theempirical frequency of sampling non-optimal elements stays very low.Remark 5.3. In the traditional analysis of adaptive filtering algorithms, one typically workswith a time-varying parameter whose trajectories change slowly but continuously. In compari-son, here, we allow the parameters underlying the stochastic system jump change with possiblylarge jump sizes. If such jumps occur on a faster timescale—an order of magnitude faster—than the adaptation rate µ of (5.10), then Algorithm 5.1 will only be able to find the systemconfiguration that optimizes the performance of the “mean system” dictated by the stationarydistribution of the integer-valued process θ[n].5.5 Analysis of the Regret-Based Adaptive Search AlgorithmThis section is devoted to both asymptotic and non-asymptotic analysis of the regret-basedadaptive search algorithm, and is organized into three subsections: Section 5.5.1 characterizesthe limit system associated with discrete-time iterates of Algorithm 5.1. Next, Section 5.5.2proves that such a limit system is globally asymptotically stable with probability one and1645.5. Analysis of the Regret-Based Adaptive Search Algorithmcharacterizes its global attractors. The analysis up to this point considers µ small and n large,but µn remains bounded. Finally, to obtain the result presented in Theorem 5.1, we let µn goto infinity and conclude asymptotic stability of interpolated process associated with the regretmeasure in Section 5.5.3. For better readability, details for each step of the proof are relegatedto Section 5.9. Certain technical details are also passed over; however, adequate citation willbe provided for the interested reader.5.5.1 Limit System Characterization as a Switching ODEThe first step involves using weak convergence methods to derive the limit dynamical systemassociated with the iterates of the regret vector r[n]. Recall the definition of weak convergencefrom Definition 4.2. Further, recall the interpolated process rµ(·) associated with the regretvector iterates from (5.13), and similarly defineθµ(t) = θ[n], for t ∈ [nµ, (n+ 1)µ).Let furtherF(θ) =(F (1, θ), . . . , F (S, θ)). (5.14)where F (s, θ) is defined in (5.4). The following theorem characterizes the limit dynamicalsystem representing the stochastic approximation iterates r[n] as a Markovian switched ODE.Theorem 5.2. Suppose Assumptions 5.1–5.3 hold. Then, as µ → 0, the interpolated pairprocess (rµ(·), θµ(·)) converges weakly to (r(·), θ(·)) that is a solution ofdrdt = p (r) · F(θ(t)) 1S − F(θ(t))− r (5.15)where p (r) is defined in (5.5), and θ(t) denotes a continuous-time Markov chain with generatorQ as in Assumption 5.2.Proof. See Section 5.9.The above theorem asserts that the asymptotic behavior of Algorithm 5.1 can be capturedby an ODE modulated by a continuous-time Markov chain θ(t). At any given instance, theMarkov chain dictates which regime the system belongs to, and the system then follows thecorresponding ODE until the modulating Markov chain jumps into a new state. This explainsthe regime switching nature of the system under consideration.5.5.2 Stability Analysis of the Limit SystemWe next proceed to analyze stability and characterize the set of global attractors of the limitdynamical system (5.15). It will be shown that the limit system guarantees an asymptotic worst1655.5. Analysis of the Regret-Based Adaptive Search Algorithmcase regret of at most η.We break down the stability analysis into two steps: First, we examine the stability of thedeterministic ODE (5.15) when θ(t) = θ is held fixed. The set of global attractors is shown tocompriseRη ={r ∈ RS : ‖r+‖1 ≤ η}where r+ = max{r,0} (5.16)for all θ ∈ Q. The slow switching condition then allows us to apply the method of multi-ple Lyapunov functions [197, Chapter 3] to analyze stability of the switching system. Beforeproceeding to the theorem, recall almost sure asymptotic stability from Definition 4.3.Theorem 5.3. Consider the limit Markovian switched dynamical system in (5.15). Let r(0) =r0 and θ(0) = θ0. For any η > 0, there exists δ(η) > 0 such that, if 0 < δ ≤ δ(η) in (5.9), thefollowing results hold:1. If θ(t) = θ is held fixed, Rη is globally asymptotically stable for the deterministic limitsystem for all θ ∈ Q. Furthermore,limt→∞dist [r(t),Rη] = 0 (5.17)where dist[·, ·] denotes the usual distance function.2. For the Markovian switched system, the set Rη×Q is globally asymptotically stable almostsurely. In particular,a) P(∀ > 0, ∃γ() > 0 s.t. dist[r0 ,Rη]≤ γ()⇒ supt≥0 dist[r(t),Rη]< )= 1b) P(∀γ, ′ > 0,∃T (γ, ′) ≥ 0 s.t. dist[r0 ,Rη]≤ γ⇒ supt≥T (γ,′) dist[r(t),Rη]< ′)= 1.Proof. See Section 5.9.The above theorem states that the set of global attractors of the Markovian switchedODE (5.15) is the same as that for all deterministic ODEs, obtained by holding θ(t) = θ,and guarantees a worst case regret of at most η. This sets the stage for Section 5.5.3 whichcombines the above result with Theorem 5.2 to conclude the desired tracking result in Theo-rem 5.1.5.5.3 Asymptotic Stability of the Interpolated ProcessesThis section completes the proof of Theorem 5.1 by looking at the asymptotic stability of theinterpolated process rµ(·). In Theorem 5.2, we considered µ small and n large, but µn remained1665.6. The Case of Slow Random Time Variationsbounded. This gives a limit switched ODE for the sequence of interest as µ → 0. Next, westudy asymptotic stability and establish that the limit points of the switched ODE and thestochastic approximation algorithm coincide as t→∞. We thus consider the case where µ→ 0and n→∞, however, µn→∞ now. Nevertheless, instead of considering a two-stage limit byfirst letting µ → 0 and then t → ∞, we study rµ(t + tµ) and require tµ → ∞ as µ → 0. Thefollowing corollary concerns asymptotic stability of the interpolated process, and asserts thatthe sum of the positive elements of rµ(·) will asymptotically be less than η.Corollary 5.1. Let {tµ} be any sequence of real numbers satisfying tµ →∞ as µ→ 0. Assume{r[n] : µ > 0, n < ∞} is tight or bounded in probability. Then, for each η > 0, there existsδ (η) > 0 such that if 0 < δ ≤ δ (η) in (5.9),rµ(·+ tµ)→ Rη in probability as µ→ 0.Proof. See Section 5.9.5.6 The Case of Slow Random Time VariationsThis section is devoted to the analysis of the adaptive search scheme when the random evolutionof the parameters underlying the discrete stochastic optimization problem occurs on a timescalethat is much slower as compared to the adaptation rate of the adaptive search algorithm withstep-size µ. More precisely, we replace Assumption 5.3 with the following assumption.Assumption 5.4. 0 < ε µ in the transition probability matrix P ε in Assumption 5.2.Before proceeding further, we briefly recall the empirical sampling distribution z[n] fromSection 4.6, which quantifies efficiency of the regret-based adaptive search algorithm [235]. Itis defined by a vectorz[n] = (z1 [n], . . . , zS [n]) ∈ RSwhere each element zi[n] gives the (discounted) frequency of sampling and simulating candidatesolution i up to time n. It is updated recursively viaz[n+ 1] = z[n] + µ[es[n] − z[n]](5.18)where ei ∈ RS denotes the unit vector with i-th element being equal to one. Further, 0 < µ 1is a small parameter that introduces an exponential forgetting of the past sampling frequenciesand facilitates tracking the evolution of global optimum. It is the same as the step-size of thestochastic approximation recursion used for updating the regret vector in (5.10). The empiricalsampling distribution z[n] has two functions:1675.6. The Case of Slow Random Time Variations1. At each time n, the global optimum is estimated by the index of the maximizing compo-nent of z[n].2. The maximizing component itself then quantifies the efficiency of the potential-basedsampling scheme.Assumption 5.4 introduces a different timescale to Algorithm 5.1, which leads to an asymp-totic behavior that is fundamentally different compared with the case ε = O(µ) that we analyzedin Section 5.5. Under Assumption 5.4, θ(·) is the slow component and r(·) is the fast transientin (5.15). It is well-known that, in such two timescale systems, the slow component is quasi-static—remains almost constant—while analyzing the behavior of the fast timescale. The weakconvergence argument then shows that (rµ(·), θµ(·)) converges weakly to (r(·), θ) as µ→ 0 suchthat the limit r(·) is a solution todrdt = p (r) · F(θ) 1S − F(θ)− r (5.19)and p (r) is defined in (5.5). The first part in Theorem 5.3 shows that the set Rη, definedin (5.16), is globally asymptotically stable for all θ ∈ Q. The tracking result in Theorem 5.1then follows directly from Corollary 5.1.Subsequently, we use this result to obtain an efficiency argument for the regret-based adap-tive search scheme in Algorithm 5.1. Define the interpolated process associated with the em-pirical sampling distribution aszµ(t) = z[n] for t ∈ [nµ, (n+ 1)µ).LetFmin(θ) := mini∈SF (i, θ) (5.20)where F (i, θ) ia defined in (5.4). Let furtherΥη(θ) ={pi ∈ ∆S;∑i∈Spii[F (i, θ)− Fmin(θ)]≤ η}(5.21)denote the set of all sampling distributions that guarantee a worst case regret of at most ηwhen the hyper-model θ[n] = θ is held fixed. In (5.21),∆S ={pi ∈ RS ;pii ≥ 0,∑i∈S pii = 1}denotes the set of all possible sampling strategies. To obtain the desired efficiency result, weshow that zµ(t) converges to the set Υη(θ) as t→∞.In contrast to the assumption made throughout this chapter, suppose the realizations of1685.6. The Case of Slow Random Time Variationsthe stochastic input xn(s, θ[n]) can be observed at each time n. Since the hyper-model isunobservable, we suppress it and use xn(s) to denote xn(s, θ[n]). Assumption 5.2 implies that{xn(s)} is stationary when θ[n] = θ is held fixed. Suppose further that the explicit relationbetween the objective function, the decision variable, and the stochastic elements in the systemis known at each time n. Let Fn(·, ·) = F (·, ·, θ[n]). Accordingly, suppose Fn(·, ·) is provided ateach time n. To update regrets, one can then employ a stochastic approximation algorithm ofthe formr̂[n+ 1] = r̂[n] + µ[ĝn(s[n],xn(s[n]))− r̂[n]]where ĝn = col(ĝ1,n , . . . , ĝS,n), ĝi,n = Fn (s[n],xn(s[n]))− Fn (i,xn(i)) I (s[n] = i) .(5.22)The samples s[n] are accordingly taken using the potential-based strategy p(r̂[n]). Followingthe same approach as in the proof of Theorem 5.2, one can then show that the limit systemassociated with the piecewise constant interpolated process r̂µ(t), defined asr̂µ(t) = r̂[n], for t ∈ [nµ, (n+ 1)µ)is the same as (5.15), and followsdr̂dt =[p (r̂) · F(θ)]1S − F(θ(t))− r̂. (5.23)Comparing (5.23) with (5.19), we observe that r[n] and r̂[n] are stochastic approximations ofthe same ODE. This property allows us to invoke the following theorem which proves that thepotential-based strategy induces the same asymptotic behavior for the processes r[n] and r̂[n].Theorem 5.4. Using the same potential-based strategy p(r[n]), the limit sets of the processes{r[n]}, defined in (5.10) and {r̂[n]}, defined in (5.22), coincide.Proof. See Section 5.9.In view of (5.6) and Assumption 5.1, for any θ ∈ Q, we haver̂µi (t) = (1− µ)n−1F (s[1],x1(s[1]), θ)− F (i,x1(i), θ) · I (s[1] = i)+ µ∑2≤τ≤n(1− µ)n−τF (s[τ ],xτ (s[τ ]), θ)− F (i,xτ (i), θ) · I (s[τ ] = i)→∑Sj=1 zµj (t)[F (j, θ)− F (i, θ)]as t→∞. (5.24)On any convergent subsequence {z[n′]}n′≥0 → pi(θ), with slight abuse of notation, letzµ(t) = z[n′], and rµ(t) = r[n′] for t ∈ [n′µ, n′µ+ µ).1695.7. Numerical StudyThis, together with (5.24), yieldsmaxi∈S r̂µi (·+ tµ)→∑Si=1 pii(θ)[F (i, θ)− Fmin(θ)](5.25)since tµ →∞ as µ→ 0. Finally, comparing (5.25) with (5.21) concludes that, for each θ ∈ Q,zµ(·+ tµ)→ Υη(θ) if and only if maxirµi (·+ tµ) ≤ η as µ→ 0.This in turn implies that the maximizing element of the empirical sampling distributionzµ(·+ tµ) represents the global optimum. The following corollary summarizes this result.Corollary 5.2. Denote the most frequently sampled feasible decision bysmax[n] = argmaxi∈Szi [n]where zi [n] is the i-th component of z[n], defined in (5.18). Further, define the continuous-timeinterpolated sequencesµmax(t) = smax[n] for t ∈ [nµ, (n+ 1)µ) .Then, under Assumptions 5.1, 5.2 and 5.4, and for sufficiently small η, there exists δ(η) suchthat if 0 < δ < δ(η) in (5.9), sµmax(·+ tµ) converges in probability to the global optima set M(θ)for each θ ∈ Q. That is, for any β > 0,limµ→0P(dist[sµmax(·+ tµ),M(θ)]≥ β)= 0 (5.26)where dist[·, ·] denotes the usual distance function.The above corollary simply asserts that, for a small µ and for any θ ∈ Q, the most frequentlysampled element, with a high probability, eventually spends nearly all its time in the globaloptima set.5.7 Numerical StudyThis section illustrates the performance of the proposed algorithm using two examples: Sec-tion 5.7.1 concerns finding the demand size for a particular product which occurs with thehighest probability, and is borrowed from [9, 11]. Section 5.7.2 considers a classical (s, S)inventory restocking policy problem [182, 236]. Before proceeding with these examples, wedescribe three existing schemes in the literature with which we wish to compare our results.These algorithms are proved to be globally convergent under no structural assumptions on theobjective function.1705.7. Numerical Study1. Random Search (RS) [9, 290]. RS is a modified hill-climbing, and is summarized inAlgorithm 4.2 in Section 4.8. For a static discrete stochastic optimization, the constant step-size µ in (4.47) is replaced with a decreasing step-size µ[n] = 1/n, which gives the RS algorithmin [9].2. Upper Confidence Bound (UCB) [22, 115]. The UCB algorithm belongs to the tothe family of “follow the perturbed leader” algorithms, and is summarized in Algorithm 4.3.For a static discrete stochastic optimization problem, we set µ = 1 in (4.48), which gives theUCB1 algorithm in [22]; otherwise, the discount factor µ has to be chosen in the interval (0, 1).2. Simulated Annealing (SA) [5]. SA is almost identical to RS with the exception thatoccasional downhill moves are allowed. The neighborhood structure is such that every elementin the search space is a neighbor of all others, and the annealing temperature T is fixed. The SAalgorithm is summarized below in Algorithm 5.2. For a static discrete stochastic optimization,the constant step-size µ in (5.27) is replaced with a decreasing step-size µ[n] = 1/n, which givesthe SA algorithm in [5].Algorithm 5.2 Simulated annealing (SA) [5]Initialization. Select s[0] ∈ S. Set pii[0] = 1 if i = s[0], and 0 otherwise. Set s∗[0] = s[0] andn = 1.Step 1: Random search. Sample a feasible solution s′[n] uniformly from the set S −s[n−1].Step 2: Evaluation and Acceptance. Simulate to obtain fn(s[n−1]) and fn(s′[n]), and sets[n] ={s′[n], if U ≤ exp(− [fn(s′[n])− fn(s[n])]+ /T),s[n− 1], otherwise,where U ∼ Uniform (0, 1).Step 3: Occupation Probability Update.pi[n] = pi[n− 1] + µ[es[n] − pi[n− 1]], 0 < µ < 1 (5.27)where ei is a unit vector with the i-th element being equal to one.Step 4: Global Optimum Estimate.s∗[n] ∈ argmaxi∈Spii[n].Let n← n+ 1 and go to Step 1.In what follows, we refer to the potential-based strategy presented in Algorithm 5.1 as PS.1715.7. Numerical Study5.7.1 Finding Demand with Maximum ProbabilitySuppose that the demand d[n] for a particular product has a Poisson distribution with parameterλ:d[n] ∼ f(s;λ) = λs exp(−λ)s! . (5.28)The sequence {d[n]} is assumed to be i.i.d. for the comparison purposes—since the RS, SA,and UCB methods assume independence of observations. The objective is then to solve thefollowing discrete stochastic optimization problem:argmins∈S={0,1,...,S}−E {I(d[n] = s)} . (5.29)This problem is equivalent toargmaxs∈{0,1,...,S}[f(s;λ) = λs exp(−λ)s!](5.30)which can be solved analytically. This enables us to check the estimate of the global optimumprovided by the algorithms with the true global optimum.We start with a static discrete stochastic optimization problem, and consider two cases forthe rate parameter λ:(i) λ = 1. The set of global optimizers will be M = {0, 1};(ii) λ = 10. The set of global optimizers will be M = {9, 10}.Figure 5.1 illustrates the true occurrence probabilities for different demand sizes. For each case,we further study the effect of the size of search space on the performance of the algorithmsby considering two instances: (i) S = 10, and (ii) S = 100. We use the potential-basedstrategy (5.8) with p = 5 for the PS scheme. Since the problem is static, one can use the resultsof [143] to show that if the exploration factor δ in (5.9) decreases to zero sufficiently slowly, thesequence {s[n]} converges almost surely to the stationary global optima set, denoted by M.More precisely, we consider the following modifications to Algorithm 5.1: The constant step-size µ in (5.10) is replaced by decreasing step-size µ[n] = 1n+1 . The exploration factor δ in (5.9) is replaced by 1nα , where 0 < α < 1.By the above construction, the sequence {s[n]} will eventually become reducible with singletoncommunicating classM. That is, {s[n]} eventually spends all its time inM. This is in contrastto Algorithm 5.1 in the regime switching setting. In this example, we set α = 0.2 and δ = 0.5.Finally, we set T = 1 in the SA method, B = µ = 1 and ξ = 0.5 in the UCB scheme.To give a fair comparison, all schemes are given the same simulation budget as that of the PSalgorithm, i.e., one simulation per iteration. All algorithms start at a randomly chosen element1725.7. Numerical Study0 2 4 6 8 10 12 14 16 18 2000.050.10.150.20.250.30.350.4Order Size sOccurrenceProbabilityF(s,θ) θ = 1 (λ = 1)θ = 2 (λ = 10)Figure 5.1: True objective function values F (s, θ).101 102 103 104 10500.10.20.30.40.50.60.70.80.91Number of Simulations nProportionofSimulationsinGlobalOptimumSet UCBPSRSSAFigure 5.2: Proportion of simulations performed on the global optimizers set (λ = 1, S = 100).1735.7. Numerical StudyS = 10 S = 100n PS SA RS UCB PS SA RS UCB10 48 42 39 86 9 9 6 4350 80 59 72 90 27 17 18 79100 96 69 82 95 45 21 29 83500 100 94 96 100 85 35 66 891000 100 99 100 100 91 50 80 915000 100 100 100 100 100 88 96 9910000 100 100 100 100 100 99 100 100(a) λ = 1S = 10 S = 100n PS SA RS UCB PS SA RS UCB100 45 17 30 41 16 5 9 13500 64 25 43 58 29 7 21 251000 72 30 59 74 34 10 26 305000 87 46 75 86 59 15 44 4410000 94 57 84 94 70 24 49 5920000 100 64 88 100 82 29 61 7450000 100 72 95 100 96 41 65 81(b) λ = 10Table 5.1: Percentage of independent runs of algorithms that converged to the global optimumset in n iterations.1745.7. Numerical Studys[0], and move towardsM. The speed of convergence, however, depends on the rate parameterλ and the search space size S + 1 in all methods. Close scrutiny of Table 5.1 reveals that, inall four algorithms, the speed of convergence decreases when either S or λ (or both) increases.However, the effect of increasing λ is more substantial since the objective function values ofthe worst and best elements become closer when λ = 10. Given equal number of simulations,higher percentage of cases that a particular method has converged to the true global optimumindicates convergence at a faster rate. Table 5.1 confirms that the PS algorithm ensures fasterconvergence in each case. This is further highlighted in Figure 5.3, where the sample paths ofthe global optimum estimate are plotted versus the number of simulation experiments.Figure 5.2 illustrates the proportion of simulations performed on the global optimum whenλ = 1 and S = 100, which reflects efficiency of the algorithms. As can be seen, the SA and RSalgorithms spend a large proportion of simulations on non-optimal solution as they randomizeamong all feasible solutions (except the previously sampled candidate) at each iteration. Fur-ther, the UCB algorithm switches to its exploitation phase after a longer period of explorationas compared with the PS algorithm. Figure 5.2 thus verifies that the PS algorithm has thesupreme balance between exploration of the search space and exploitation of the simulationdata.Next, we consider the case where λ(θ[n]) jumps following the sample path of the Markovchain {θ[n]} with state space Q = {1, 2} and transition matrixP ε = I + εQ, Q =[−0.5 0.50.5 −0.5]. (5.31)The regime switching discrete stochastic optimization problem that we aim to solve is then (5.29),where S = 10, andd[n] ∼ f(s;λ(θ[n])) = [λ(θ[n])]s exp(−λ(θ[n]))s! , λ(1) = 1, and λ(2) = 10. (5.32)Accordingly, the sets of global optimizers are M(1) = {0, 1} and M(2) = {9, 10}, respectively.We use the same values as before for all algorithms, with the exception that all decreasingstep-sizes are now replaced with constant step-size µ = 0.01, and δ = 0.1 in the PS algorithm.Figure 5.4 illustrates the tracking properties of all schemes by plotting the proportion of timethat the estimate of the global optimum does not match the true optimum for different valuesof ε in (5.31). Recall that ε determines the speed of evolution of the hyper-model. Each pointon the graph is an average over 100 independent runs of 106 iterations of the algorithms. Asexpected, the global optimum estimate spends more time in the global optimum for all methodsas the speed of time variations decreases. The PS algorithm, however, has proven to be moreresponsive.1755.7. Numerical Study100 101 102 103 104 105051015 100 101 102 103 104 105510152025 100 101 102 103 104 10568101214 100 101 102 103 104 105050100Number of Simulations n UCBPSRSSAFigure 5.3: Sample path of the global optimum estimate (λ = 10,S = 100).1765.7. Numerical Study10−3 10−2 10−10102030405060Speed of Markovian Switching ε%ofSimulationsOutofGlobalOptimumSet UCBPSRSSASpeed of Markovianswitching ε matchesalgorithm step-size μFigure 5.4: Proportion of the time that global optimum estimate does not match the trueglobal optimum after n = 106 iterations versus the switching speed of the global optimizersset (S = 10).5.7.2 (s, S) Inventory PolicyThe (s, S) policy is a classic control policy for inventory systems with periodic review, see,e.g., [296]. At the beginning of each period, the inventory level of some discrete commodityis reviewed. An (s, S) ordering policy then specifies that, if the inventory position—inventoryon-hand plus orders outstanding minus backlogs—falls below s units, an order is placed to raiseit to S > s units; otherwise, no order is placed. Figure 5.5 illustrates the (s, S) inventorypolicy. Here, to avoid confusion with the notation used so far, we denote s and S by S andS, respectively. Let x[n], and d[n] denote the inventory position, and demand in period n,respectively. Suppose further that d[n] is Poisson distributed with mean λ. We then seek tosolvemin(S,S)∈S F (S, S) (5.33)whereF (S, S) = limN→∞1NN∑n=1fn(S, S)1775.7. Numerical Studyis the average long term cost per period, and the (random) cost incurred in period n is givenbyfn(S, S) = (K + c(S − x[n])) · I(x[n] < S) + h ·max {0, x[n]} − p ·min{0, x[n]}.Here, K is the setup cost for placing order, c is the ordering cost per unit, h is the holding costper period per unit of inventory, and p is shortage cost per period per unit of inventory. Inthis example, we set λ = 25, K = 32, c = 3, h = 1, and p = 5. The set of feasible solutions isfurther defined asS ={(S, S) ; S, S ∈ N, 20 ≤ S ≤ 60, 40 ≤ S ≤ 80, S − S ≥ 15}.The search space comprises 1051 candidate solutions. Given the feasibility constraints above,one can solve them to obtain the feasible candidates, and then solve the unconstrained problem.In the numerical evaluations provided below, we instead choose to use the technique of discreteLagrange multipliers to solve the constrained problem [279]. The true objective function valuesare shown in Figure 5.6. The global optimum is (20, 53) with the expected cost per period of111.1265.We use P (x) = ∑Si=1 exp(10 · (x+i )50)− S in Definition 5.1 which gives the potential-basedstrategyp(r[n]) =(p1(r[n]), . . . , pS (r[n])), where pi(r[n]) = (1− δ)(r+i )49·exp(10·(r+i )50)∑Sj=1(r+j )49·exp(10·(r+j )50) + δS .The parameters of the other schemes are chosen as in Section 5.7.1. To reduce the initialcondition bias, the average cost per period in each iteration is computed after the first 100review periods and averaged over the next 30 periods. Table 5.2a provides data from numer-ical evaluation of algorithms in a static setting (no underlying time variations). The leftmostcolumns of Table 5.2a confirm faster convergence of the PS algorithm. The rightmost columnsillustrate efficiency of algorithms by comparing the proportion of simulations expended on theglobal optima. The SA and RS schemes spend less time simulating the global optimum as thenumber of iteration increase. This is mainly due to the nature of the acceptance rules (Step 2 inAlgorithms 4.2 and 5.2): Each time, a sample is chosen uniformly from the set of candidate so-lutions excluding the previous sample. If the previous sample occurs to be the global optimum,the following simulations will roughly—ignoring the stochastic nature of simulation data—beall expended on non-optimal candidates. This prevents all schemes except the PS method tobe deployable as an online control mechanism for inventory management.Table 5.2b considers the (S, S) inventory problem in a non-stationary setting, where thedemand distribution jumps between Poisson processes with means λ(1) = 25 and λ(2) = 10according to the Markov chain {θ[n]} described in Section 5.7.1. The global minimum respec-1785.7. Numerical StudyInventoryLevelTimeOrder SizeSSReplenishment Lead TimeUndershootReview PeriodFigure 5.5: (S, S) inventory problem. The dash-dotted line depicts the case where orders aredelivered immediately.Figure 5.6: Steady-state expected cost per period.1795.7. Numerical Study% of Runs Converged % of Simulations Performedto Global Optimum on Global Optimumn PS SA RS UCB PS SA RS UCB500 17.1 16.2 18.3 0 0.1 0.210 0.209 0700 25.6 20.4 23.6 1.3 0.2 0.140 0.141 0.091900 41.9 31.5 35.7 8.2 0.3 0.112 0.114 0.0111100 67.7 45.1 48.2 55.9 1.9 0.096 0.091 0.0111500 80.4 55.9 56.6 69.1 17.9 0.067 0.067 0.0132000 98.5 75.6 68.7 87.6 34.8 0.051 0.050 0.0153500 100 85.2 84.1 99.1 58.2 0.030 0.028 0.017(a) Static Setting% of Iterations in Whichsmax[n] =M(θ[n])ε PS SA RS UCB2.5× 10−3 94.3 70.8 79.2 87.75.0× 10−3 93.5 69.4 78.4 87.17.5× 10−3 92.9 68.1 77.1 86.61.0× 10−2 92.3 67.9 76.7 83.32.5× 10−2 91.2 57.1 65.4 85.15.0× 10−2 90.0 54.7 62.8 79.87.5× 10−2 88.6 51.3 58.5 76.2(b) Non-Stationary SettingTable 5.2: Convergence and efficiency evaluation of algorithms in the (S, S) inventory problem.1805.8. Closing Remarkstively jumps between the globally optimum policies (20, 53) and (20, 40). We run the constantstep-size variants of all algorithms with µ = 0.01, and set δ = 0.15 in the PS scheme. Table 5.2bgives the percentage of times that the estimate of the global optimum matches the jumpingtrue global optimum after n = 106 iterations for different values of ε. Recall that ε capturesthe speed of time variations of the hyper-model. The data in Table 5.2b have been averagedover 100 independent runs of the algorithms. As seen in Figure 5.4, the global optimum es-timate spends less time in the true global optimum in all methods as ε increases. The PSalgorithm, however, has proven to be more responsive by providing the true global optimummost frequently.5.8 Closing RemarksInspired by regret-matching learning rules in game-theory, this chapter presented a regret-basedadaptive search algorithm for simulation-based discrete stochastic optimization problems thatundergo random unpredictable changes. The salient features of the proposed scheme include:1. It assumes no functional properties such as sub-modularity, symmetry, or exchangeabilityon the objective function.2. It allows temporal correlation of the data obtained from simulation experiments, whichis more realistic in practice.3. It can properly track non-stationary global optimum when it evolves on the same timescaleas the updates of the adaptive search algorithm.Numerical examples confirmed superior efficiency and convergence characteristics as comparedwith the existing simulated annealing, random search and pure exploration methods.The regret-based search scheme relied on a potential-based sampling strategy that was ofbetter-response type. This is in contrast to the best-response type sampling strategy of thesmooth best-response adaptive search scheme presented in Chapter 4. One naturally suspectsthat algorithms of best-response type has to always outperform better-response type schemes.Both the analysis and numerical evaluations in this chapter showed that, somewhat surprisingly,by properly choosing the potential function in the sampling strategy, one could obtain trackingand efficiency characteristics that are quite similar to the best-response type adaptive searchscheme of Chapter 4.1815.9. Proofs of Results5.9 Proofs of Results5.9.1 Proof of Theorem 5.2Recall the definitions of tightness and weak convergence (see Definitions 4.2 and 4.4). Thedefinitions of tightness and weak convergence extend to random elements in more general metricspaces. On a complete separable metric space, tightness is equivalent to relative compactness,which is known as Prohorov’s theorem (see [46]). By virtue of this theorem, one can extractconvergent subsequences when tightness is verified. We thus start with proving tightness.Subsequently, a martingale problem formulation is used to establish the desired weak limit.In what follows, We use D([0,∞) : B) to denote the space of functions that are defined in[0,∞) taking values in B, and are right continuous and have left limits (Ca`dla`g functions) withSkorohod topology (see [192, p. 228]).1. Tightness of the interpolated process rµ(·). Consider the discrete-time regret iteratesr[n] in (5.10). In view of the boundedness of the objective function (see Assumption 5.1), forany 0 < T1 <∞ and n ≤ T1/µ, {r[n]} is bounded w.p.1. As a result, for any ϑ > 0,supn≤T/µE ‖r[n]‖ϑ <∞where in the above and hereafter ‖ · ‖ denotes the Euclidean norm and, t/µ is understood to bethe integer part of t/µ for any t > 0. Next, considering the interpolated process rµ(·), definedin (5.13), and the recursion (5.10), for any s, t > 0, ν > 0, and s < ν, it can be verified thatrµ(t+ s)− rµ(t) = µ(t+s)/µ−1∑τ=t/µ[gτ (s[τ ], r[τ ])− x[τ ]].Consequently,Eµt ‖rµ(t+ s)− rµ(t)‖2= µ2(t+s)/µ−1∑τ=t/µ(t+s)/µ−1∑κ=t/µEµt[g(s[τ ], r[τ ], fτ (s[τ ]))− r[τ ]]> [g(s[κ], r[κ], fκ(s[κ]))− r[κ]]≤ Kµ2(t+sµ −tµ)2= O(s2) (5.34)where Eµt denotes conditional expectation given the σ-algebra generated by the µ-dependentpast data up to time t. By virtue of the tightness criteria [187, Theorem 3, p. 47] or [192,1825.9. Proofs of ResultsChapter 7], it suffices to verifylimν→0lim supµ→0{E[sup0≤u≤νEµt ‖rµ(t+ u)− rµ(t)‖2]}= 0.This directly follows from (5.34), hence, rµ(·) is tight in D([0,∞) : RS). In view of [290,Proposition 4.4], θµ(·) is also tight and θµ(·)⇒ θ(·) such that θ(·) is a continuous-time Markovchain with generator Q (see Assumption 5.2). Therefore, (rµ(·), θµ(·)) is tight in D([0,∞) :RS ×Q).2. Weak limit via a martingale problem solution. By virtue of the Prohorov’s theo-rem (see [192]), one can extract a convergent subsequence. For notational simplicity, we stilldenote the subsequence by rµ(·), and denote its limit by r(·). By the Skorohod representationtheorem, rµ(·)→ r(·) w.p.1, and the convergence is uniform on any compact interval. We nowproceed to characterize the limit r(·) using martingale averaging methods.To obtain the desired limit, it will be proved that the limit (r(·), θ(·)) is the solution of themartingale problem with operator L defined as follows: For all i ∈ Q,Ly(r, i) =[∇ry(r, i)]> [(p(r) · F(θ(t)))1S − F(θ(t))− r]+Qy(r, ·)(i)Qy(r, ·)(i) =∑j∈Q qijy(r, j).(5.35)Here, y(·, i) : RS → R is continuously differentiable with compact support, and∇ry(r, i) denotesthe gradient of y(r, i) with respect to r. Using an argument similar to [291, Lemma 7.18], onecan show that the martingale problem associated with the operator L has a unique solution.Thus, it remains to prove that the limit (rµ(·), θ(·)) is the solution of the martingale problem.To this end, it suffices to show that, for any positive integer κ0, and for any t, u > 0, 0 < tι ≤ tfor all ι ≤ κ0, and any bounded continuous function h(·, i) for all i ∈ Q,Eh(r(tι), θ(tι) : ι ≤ κ0)[y (r(t+ u), θ(t+ u))− y (r(t), θ(t))−∫ t+ut Ly (r(v), θ(v)dv)]= 0.To verify the above equation, we work with (rµ(·), θµ(·)) and prove that it holds as µ→ 0. Bythe weak convergence of (rµ(·), θµ(·)) to (r(·), θ(·)) and Skorohod representation, it can be seenthatEh (rµ(tι), θµ(tι) : ι ≤ κ0) [(rµ(t+ u), θµ(t+ u))− (rµ(t), θµ(t))]→ Eh (r(tι), θ(tι) : ι ≤ κ0) [(r(t+ u), θ(t+ u))− (r(t), θ(t))] .Now, choose a sequence of integers {nµ} such that nµ →∞ as µ→ 0, but δµ = µnµ → 0, and1835.9. Proofs of Resultspartition [t, t+ u] into subintervals of length δµ. Then,y(rµ(t+ u), θµ(t+ u))− y (rµ(t), θµ(t))=∑t+u`:`δµ=t [y (r[`nµ + nµ], θ[`nµ + nµ])− y (r[`nµ], θ[`nµ])]=∑t+u`:`δµ=t [y (r[`nµ + nµ], θ[`nµ + nµ])− y(r[`nµ], θ[`nµ + nµ])]+∑t+u`:`δµ=t [y(r[`nµ], θ[`nµ + nµ])− y(r[`nµ], θ[`nµ])] (5.36)where∑t+u`:`δµ=t denotes the sum over ` in the range t ≤ `δµ ≤ t+ u.First, we consider the second term on the r.h.s. of (5.36):limµ→0 Eh(rµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t [y (r[`nµ], θ[`nµ + µ])− y (r[`nµ], θ[`nµ])]]= limµ→0 Eh(rµ(tι), θµ(tι) : ι ≤ κ0)×[∑t+u`:`δµ=t∑Θi0=1∑Θj0=1∑`nµ+nµ−1k=`nµI(θ[k] = i0)×(y (r[`nµ], j0)P (θ[k + 1] = j0|θ[k] = i0)− y (r[`nµ], i0))]= Eh(r(tι), θ(tι) : ι ≤ κ0)[∫ t+utQy(r(v), θ(v))dv]. (5.37)Next, we concentrate on the first term on the r.h.s. of (5.36):limµ→0Eh (rµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t [y (r[`nµ + nµ], θ[`nµ + nµ])− y(r [`nµ], θ[`nµ + nµ])]]= limµ→0Eh (rµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t [y (r[`nµ + nµ], θ[`nµ])− y (r[`nµ], θ[`nµ])]]= limµ→0Eh (rµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t∇ry> (r[`nµ + nµ]− r[`nµ])]= limµ→0Eh (xµ(tι), θµ(tι) : ι ≤ κ0)×[∑t+u`:`δµ=t δµ∇ry>(1nµ∑`nµ+nµ−1k=`nµg(s[k], r[k], fk(s[k]))− 1nµ∑`nµ+nµ−1k=`nµr[k]) ](5.38)where ∇ry is the gradient column vector with respect to r, ∇ry> is its transpose. For notationalsimplicity, we use ∇ry to denote ∇ry(x[`nµ], θ[`nµ]).We start by looking atlimµ→0 Eh (rµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`=t δµ∇ry>[δµnµ∑`nµ+nµ−1k=`nµg(s[k], r[k], fk(s[k]))]]= limµ→0 Eh (rµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t δµ∇ry>[1nµ∑`nµ+nµ−1k=`nµ×[∑Θθˇ=1∑Θθ=1 E`nµ{g(s[k], r[k], fk(s[k], θ))}E`nµ{I(θ[k] = θ|θ[`nµ] = θˇ)}] ]]where fk(s, θ) is defined in (5.2) and Assumption 5.1. For large k with `nµ ≤ k ≤ `nµ +nµ and1845.9. Proofs of Resultsk − `nµ →∞, by [290, Proposition 4.4], for some k̂0 > 0,(I + εQ)k−`nµ = Z((k − `nµ)ε) +O(ε+ exp(− k̂0(k − `nµ)))dZ(t)dt = Z(t)Q, Z(0) = I.For `nµ ≤ k ≤ `nµ + nµ, Assumption 5.3 yields that(k − `nµ)ε→ 0 as µ→ 0.For such k,Z((k − `nµ)ε)→ I.Therefore, by the boundedness of g(s[k], r[k], fk(s[k], θ)), it follows that, as µ→ 0,1nµ∑`nµ+nµ−1k=`nµ∥∥E`nµ{g(s[k], r[k], fk(s[k], θ))}∥∥×∣∣E`nµ{I(θ[k] = θ)|I(θ[`nµ] = θˇ)}− I(θ[`nµ] = θˇ)∣∣→ 0.Therefore,limµ→0 Eh (rµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t∇ry>[δµnµ∑`nµ+nµ−1k=`nµg(s[k], r[k], fk(s[k]))]]= limµ→0 Eh (rµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t δµ∇ry>[∑Θθˇ=1 I(θ[`nµ] = θˇ)×(1nµ∑`nµ+nµ−1k=`nµE`nµ{g(s[k], r[k], fk(s[k], θˇ))})]].It is more convenient to work with the individual elements of gk . Substituting for the i-thelement from (5.10) resultslimµ→0 Eh (rµ(tι), θµ(tι) : ι ≤ κ0)×[∑t+u`:`δµ=t δµ∇ry>[∑Θθˇ=1 I(θ[`nµ] = θˇ)×(1nµ∑`nµ+nµ−1k=`nµE`nµ{fk(s[k], θˇ)− fk(i,θˇ)pi(r[k])I(s[k] = i)})]]= limµ→0 Eh (rµ(tι), θµ(tι) : ι ≤ κ0)×[∑t+u`:`δµ=t δµ∇ry>[∑Θθˇ=1 I(θ[`nµ] = θˇ)×(1nµ∑`nµ+nµ−1k=`nµE`nµ{∑Sj=1 pj(r[`nµ])fk(j, θˇ)− fk(i, θˇ)})]].Here, we usedE`nµ{I(s[k] = i)} = pi(r[`nµ])since s[k] is chosen according to the potential-based sampling strategy p(r[k]) (see Step 1 in1855.9. Proofs of ResultsAlgorithm 5.1). Note that fk(i, θˇ) is still time-dependent due to the presence of noise in thesimulation data. Note further that θ[`nµ] = θµ(µ`nµ). In light of Assumptions 5.1 and 5.2, bythe weak convergence of θµ(·) to θ(·), the Skorohod representation, and using µ`nµ → v, it canthen be shown that, as µ→ 0,∑Θθˇ=11nµ∑`nµ+nµ−1k=`nµE`nµ{fk(i, θˇ)}I(θµ(µ`nµ) = θˇ)→∑Θθˇ=1 F (i, θˇ)I(θ(v) = θˇ)= F (i, θ(v)) in probability.Combining the last two equations, and using r[`nµ] = rµ(µ`nµ) yieldslimµ→0 Eh (rµ(tι), θµ(tι) : ι ≤ κ0)[∑t+u`:`δµ=t δµ∇ry>(∑Θθˇ=1 I(θ[`nµ] = θˇ)×[1nµ∑`nµ+nµ−1k=`nµE`nµ{g(s[k], r[k], fk(s[k], θˇ))}])](5.39)= Eh (r(tι), θ(tι) : ι ≤ κ0)[∫ t+ut[∇ry(r(v), θ(v))]>( [p(r(v)) · F(θ(v))]1S − F(θ(v)))dv]where F(θ) is defined in (5.14), and p(r) is the potential-based sampling strategy in Defini-tion 5.1.Next, we consider the second term in (5.38). By using the technique of stochastic approxi-mation (see, e.g., [192, Chapter 8]), it can be shown thatlimµ→0 Eh(xµ(tι), θµ(tι) : ι ≤ κ0) [∑t+u`:`δµ=t δµ∇ry>[1nµ∑`nµ+nµ−1k=`nµr[k]]]= Eh(x(tι), θ(tι) : ι ≤ κ0)[∫ t+ut[∇ry(r(v), θ(v))]> r(v) dv]. (5.40)Finally, combining (5.39) and (5.40) completes the proof.5.9.2 Proof of Theorem 5.3The first part of the proof shows that the deterministic system (5.15), when θ(t) = θ is heldfixed, is globally asymptotically stable, and Rη constitutes its global attracting set.Recall class K∞ functions from Definition 4.5. Define the Lyapunov functionVθ(r) = α (Pθ(r)) , where α ∈ K∞, and α′(x) > 0, for all x ∈ R+. (5.41)Here, R+ denotes the set of non-negative reals, Pθ(r) is the potential function in Definition 5.1,and the subscript θ emphasizes that θ(t) = θ is held fixed.Taking the time-derivative of the potential function Pθ(r) in Definition 5.1 and apply-1865.9. Proofs of Resultsing (5.15) yieldsddtPθ (r) = ∇Pθ(r) ·drdt = ∇Pθ(r) ·[(p(r) · F(θ))1S − F(θ)]−∇Pθ(r) · r (5.42)where x · y denotes the inner product of vectors x and y. Note in (5.42) that θ(t) = θ is heldfixed. We first concentrate on the first term in (5.42). Rearranging (5.5) in Definition 5.1, weobtain:∇Pθ(r) = ‖∇Pθ(r)‖1[ 11− δp(t)−δ(1− δ)S1S].Subsequently,∇Pθ(r) ·[(p(r) · F(θ))1S − F(θ)]= ‖∇Pθ(r)‖1[11−δp(t)− δ(1−δ)S1S]·[(p(r) · F(θ))1S − F(θ)]= ‖∇Pθ(r)‖11− δ p(t) ·[(p(r) · F(θ))1S − F(θ)]−δ ‖∇Pθ(r)‖1(1− δ)S 1S ·[(p(r) · F(θ))1S − F(θ)]≤ −δβ(θ)(1− δ)S ‖∇Pθ(r)‖1 (5.43)for some constant β(θ). In (5.43), we usedp(t) ·[(p(r) · F(θ))1S − F(θ)]= 0and the fact that the range of(p(r) · F(θ))1S − F(θ)is bounded (see Assumption 5.1). Concerning the second term on the r.h.s. of (5.42), in viewof condition 4 in Definition 5.1,∇P (r) · r ≥ C‖∇P (r)‖1‖r+‖1 .Combining this with (5.43) then yieldsddtPθ (r) ≤ ‖∇Pθ(r)‖1[ δβ(θ)(1− δ)S − C‖r+‖1].For each η > 0, ‖r+‖1 > η implies that δ(η) can be chosen small enough such that, if 0 < δ ≤δ(η),ddtPθ (r) ≤ −‖∇Pθ(r)‖1Cη/2 = −ψθ(r). (5.44)In view of conditions 3 and 4 in Definition 5.1, ψθ(r) > 0 on ‖r+‖1 > η.Since the potential function is positive definite and radially unbounded on RS − RS−, there1875.9. Proofs of Resultsexist K∞ functions α1, α2 such thatα1(dist[r,RS−])≤ Pθ(r) ≤ α2(dist[r,RS−]). (5.45)The result of [176, Lemma 18] can also be extended to show existence of γ ∈ K∞ and ϕ ∈ Lsuch thatψθ(r) ≥ γ(dist[r,RS−])ϕ(dist[r,RS−]). (5.46)A function ϕ : R+ → R+ belongs to class L, if it is continuous, strictly increasing, andlimx→∞ ϕ(x) = 0. Since α1, α2 ∈ K∞, they are both invertible. Accordingly, defineγ̂ = γ ◦ α−12 ∈ K∞, and ϕ̂ = ϕ ◦ α−11 ∈ Lwhere ◦ denotes function composition, and ψ̂ : R+ → R+ such thatψ̂(x) = γ̂(x)ϕ̂(x).Combining (5.44)–(5.46) yieldsddtPθ (r) ≤ −γ(α−12 (Pθ(r)))ϕ(α−11 (Pθ(r)))≤ −ψ̂ (P (r)) . (5.47)Now, taking the time-derivative of the Lyapunov function (5.41) and combining with (5.47), weobtainddtVθ(r) = α′(Pθ(r))(∇Pθ(r) ·drdt)≤ −α′(Pθ(r))ψ̂(Pθ(r)) < 0 (5.48)for all r ∈ RS such that ‖r+‖ > η. The following Lemma is needed for the last step of theproof.Lemma 5.1 ([177, Lemma 5.4]). For each continuous, positive definite function ρ : R+ → R+,there exists α ∈ K∞ that is locally Lipschitz on its domain, continuously differentiable on (0,∞)andα(x) ≤ ρ(x)α′(x) for all x ∈ R+.Applying this lemma to the r.h.s. in (5.48) yiledsddtVθ(r) ≤ −α (Pθ(r)) = −Vθ(r) (5.49)for all r ∈ RS such that ‖r+‖ > η. Therefore, the deterministic limit system is globallyasymptotically stable and, for 0 < δ ≤ δ(η):limt→∞d (r,Rη) = 0.1885.9. Proofs of ResultsNext, using the above Lyapunov function, we prove global asymptotic stability w.p.1 ofthe Markovian switched limit dynamical system. Recall Theorem 4.4 from Section 4.10 thatextends the results of [67, Corollary 12] to the setting of this chapter. In view of conditions1 and 2 in Definition 5.1, hypothesis 1 in Theorem 4.4 is satisfied. Since α ∈ K∞, existenceof α1, α2 ∈ K∞ that satisfy hypothesis 2 is further guaranteed using the same argument asin (5.45). In light of (5.49), hypothesis 3 is also satisfied with λ = 1. Since the Lyapunovfunctions are the same for all deterministic subsystems, existence of v > 1 in hypothesis 4 isalso automatically guaranteed. Finally, hypothesis 5 simply ensures that the switching signalθ(t) is slow enough. Therefore, it remains to ensure that the generator Q of Markov chain θ(t)satisfies 1 + q˜ > q. This is satisfied since |qθθ′ | ≤ 1 for all θ, θ′ ∈ Q, as specified in (5.12). Thisconcludes that, for each η > 0, there exists a δ(η) > 0 such that if 0 < δ < δ(η) in (5.5), theMarkovian switched system (5.15) is globally asymptotically almost surely, and Rη constitutesits global attractors set for all states θ ∈ Q.5.9.3 Proof of Corollary 5.1We only give an outline of the proof, which essentially follows from Theorems 5.2 and 5.3.Definer̂µ(·) = rµ(·+ tµ).It can be shown that r̂µ(·) is also tight since rµ(·) is tight. For any T < ∞, take a weaklyconvergent subsequence of{r̂µ(·), r̂µ(· − T )}and denote its limit by(r̂(·), r̂T (·)).Note that r̂(0) = r̂T (T ). The value of r̂T (0) may be unknown, but the set of all possiblevalues of r̂T (0) (over all T and convergent subsequences) belongs to a tight set. Using this andTheorems 5.2 and 5.3, for any % > 0, there exists a T% <∞ such that for all T > T%,dist[r̂T (T ),Rη]≥ 1− %.This implies thatdist[r̂(0),Rη]≥ 1− %and the desired result follows.5.9.4 Proof of Theorem 5.4The proof is based on [39, Theorem 7.3]. Consider the coupled system (r(t), r̂(t)), whereboth components use the same potential-based sampling strategy p(r(t)). This is in contrast1895.9. Proofs of Resultsto the limit system associated with (5.22) which uses p(r̂(t)). DefineD(t) = ‖r̂(t)− r(t)‖.Then, for all 0 ≤ λ ≤ 1,D(t+ λ) =∥∥∥∥r̂(t) + λddt r̂(t)− r(t)− λddtr(t) + o(λ)∥∥∥∥=∥∥∥∥(1− λ) [r̂(t)− r(t)] + λ[ ddt (r̂(t)− r(t)) + r̂(t)− r(t)]+ o(λ)∥∥∥∥≤ (1− λ)D(t) + λ∥∥∥∥ddt (r̂(t)− r(t)) + r̂(t)− r(t)∥∥∥∥+ o(λ).Recall from Section 5.6 that both r̂[n] and r[n] form discrete-time stochastic approximationiterates of the same ODE (5.15). Substituting from (5.15) for both dr/dt and dr̂/dt, the secondterm on the r.h.s. vanishes. Therefore,D(t+ λ) ≤ (1− λ)D(t) + o(λ).Consequently,ddtD(t) ≤ −D(t)for almost every t, from which it follows:D(−t) exp(−t) ≥ D(0), for all t ≥ 0.Note that D(t) is bounded since r(t) and r̂(t) are both bounded. Finally, sincelimt→∞D(−t) exp(−t) = 0, and D(t) ≥ 0we obtainD(t) = D(0) = 0and the desired result follows.1906ConclusionsThe unifying theme of this thesis was to devise and study algorithms to learn and track theoptimal decision in both interactive and non-interactive, uncertain, and possibly changing envi-ronments. The study of strategic decision making in an interactive environment is the subject ofgame theory, and was discussed at length in part I. The problem of finding the optimal decisionin non-interactive environments, however, is a mathematical discrete stochastic optimizationproblem, and was the subject of Part II. This chapter completes the study by summarizingthe results and their implications. Section 6.1 summarizes the major results and lessons ofChapters 2 and 3. Section 6.1 summarizes the contributions of Chapters 4 and 5. We concludethis chapter with a discussion of potential future work in Section 6.3.6.1 Summary of Findings in Part IEmergence of social networking services such as Twitter, Google+, and Facebook has heavilyaffected the way individuals share information within interested communities. This has led tonew information patters that facilitate the resolution of uncertainties in many decision problemsthat individuals face on a daily basis. One can argue that, intuitively, this has to accelerateboth the process of learning the optimal decision, and coordination among individuals’ optimaldecisions, which is essential in view of the recent fast and unpredictable changes in trends.Chapter 2 developed a non-cooperative game-theoretic framework that incorporates thisnew information patterns into the game model, and focused on the correlated equilibrium asthe solution concept for the proposed game model. We stated in Section 1.1.1 that correlatedequilibrium is a generalization (superset) of the Nash equilibrium. Another point of view isthat a correlated equilibrium is a refinement, in the sense that it is the Nash equilibriumof a game augmented by communication. Although these seemingly conflicting interpretationspresent an interesting puzzle, the practical consequence is that communication in a game allowsmore flexibility of operation, and facilitates coordination, while leaving intact the fundamentalfreedom of choice of its players.In view of this game-theoretic model, and building upon the regret-matching procedures,Chapter 2 then presented an adaptive rule-of-thumb learning procedure that takes into accountthe excess information the decision maker acquires by subscribing to social groups. It was shown1916.1. Summary of Findings in Part Ithat, if the decision maker individually follows the proposed algorithm, her regret after sufficientrepeated plays of the game can be made and kept arbitrarily small as the game evolves. Further,if all decision makers follow the proposed scheme, they can be coordinated (in a distributedfashion) in such a way that their global behavior mimics a correlated -equilibrium. Theseresults are quite remarkable as they were obtained assuming that unobserved random evolutionof the parameters underlying the game occur on the same timescale as the adaptation speed ofthe algorithm. Therefore, the emerging information patterns support the fast and unpredictablechanges in trends if the individual properly exploits them in the decision making process.These results show that sophisticated rational global behavior can be obtained as the con-sequence of boundedly rational agents following simple adaptive learning strategies and inter-acting with each other. Despite its deceptive simplicity and bounded rationality of agents,the proposed adaptive learning strategy leads to a behavior that is similar to that obtainedfrom fully rational considerations, namely, correlated equilibrium. Further, the more informa-tion agents disclose to others of their past decisions, the faster will be the coordination amongagents that leads to such sophisticated global behavior.Limitations. Although the adaptive learning algorithm of Chapter 2 ensures small worst-caseregret for individual decision makers under (almost) no structural assumption and even underrandom unpredictable evolution of the game, the result concerning convergence to correlatedequilibria set requites all decision makers to follow the same learning strategy. This may bequite restrictive in certain scenarios where agents cannot be programmed to make decisionsfollowing the same procedure.It is also natural to ask whether it is possible to use variants of the adaptive learningalgorithm to guarantee convergence to and tracking of Nash equilibria. The answer seems tobe negative. The reason for the failure turns out to be quite instructive as summarized in thefollowing extract form [145]:“[The reason for failure] is not due to some aspect of the adaptive property, butrather to one aspect of the simplicity property: namely, that in the strategies underconsideration players do not make use of the payoffs of the other players. Whilethe actions of the other players may be observable, their objectives and reasons forplaying those actions are not known (and at best can only be inferred). This simpleinformational restriction—which we call uncoupledness—is the key property thatmakes Nash equilibria very hard if not impossible to reach.”With electronics embedded everywhere, it is unavoidable that such devices encounter oneanother as they perform their core functions. The prevalence of these interactions is com-pounded by the recent trend of incorporating wireless communications into devices to enhance1926.1. Summary of Findings in Part Itheir functionality. Indeed, the wireless environment is inherently far-reaching, linking togetherdevices that may be separated by large physical distances. This opens up opportunity to enablethese devices to coordinate themselves by communicating through these wireless links.Motivated by the self-configuration benefit of the game-theoretic methods in Chapter 2 andthe prevalence of wireless-enabled electronics, Chapter 3 provided design guidelines as to howone can employ these methods for autonomous energy-aware activation of wireless-enabled sen-sors in an adaptive network. The nodes were equipped with a diffusion LMS adaptive filter,which comprised of an LMS-type filter for the parameter estimation task, and a diffusion pro-tocol to enable nodes to cooperate. The diffusion strategy exploits the wireless communicationfunctionality of nodes to exchange estimates. Each node then combines the received estimatesby means of a pre-specified combiner function. The diffusion LMS adaptive filter has recentlyattracted attention in the signal processing community due to several advantages including:scalability and stabilizing effect on the network, improved estimation performance, and savingsin bandwidth, processing and energy resources.Chapter 3 first reformulated the diffusion LMS adaptive filter as a classical stochastic ap-proximation algorithm. This new formulation brought forward the advantage to provide asimpler derivation for the known stability and consistency characteristics via the powerful ODEmethod, and helped to combine the diffusion LMS with the autonomous activation mech-anism. Then, a non-cooperative game formulation was provided for the activation controlproblem, which captured both the spatial-temporal correlation among sensors’ measurements,and network connectivity for successful diffusion of estimates across the network. A novel twotimescale stochastic approximation algorithm was subsequently proposed that combined thediffusion LMS (slow timescale) with a game-theoretic activation functionality (fast timescale)that, based on the adaptive learning algorithm of Chapter 2, prescribed the sensor when toenter the sleep mode. The convergence analysis showed that if each sensor individually fol-lows the proposed energy-aware diffusion LMS algorithm, the local estimates converge to thetrue parameter across the network, yet the global activation behavior along the way tracks theevolving set of correlated equilibria of the underlying activation control game.Existing game-theoretic treatments in sensor networks tend to focus on activities such ascommunication and routing of data to obtain savings in energy. The proposed energy-awareactivation mechanism is quite remarka
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Reinforcement learning in non-stationary games
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Reinforcement learning in non-stationary games Namvar Gharehshiran, Omid 2015
pdf
Page Metadata
Item Metadata
Title | Reinforcement learning in non-stationary games |
Creator |
Namvar Gharehshiran, Omid |
Publisher | University of British Columbia |
Date Issued | 2015 |
Description | The unifying theme of this thesis is the design and analysis of adaptive procedures that are aimed at learning the optimal decision in the presence of uncertainty. The first part is devoted to strategic decision making involving multiple individuals with conflicting interests. This is the subject of non-cooperative game theory. The proliferation of social networks has led to new ways of sharing information. Individuals subscribe to social groups, in which their experiences are shared. This new information patterns facilitate the resolution of uncertainties. We present an adaptive learning algorithm that exploits these new patterns. Despite its deceptive simplicity, if followed by all individuals, the emergent global behavior resembles that obtained from fully rational considerations, namely, correlated equilibrium. Further, it responds to the random unpredictable changes in the environment by properly tracking the evolving correlated equilibria set. Numerical evaluations verify these new information patterns can lead to improved adaptability of individuals and, hence, faster convergence to correlated equilibrium. Motivated by the self-configuration feature of the game-theoretic design and the prevalence of wireless-enabled electronics, the proposed adaptive learning procedure is then employed to devise an energy-aware activation mechanism for wireless-enabled sensors which are assigned a parameter estimation task. The proposed game-theoretic model trades-off sensors' contribution to the estimation task and the associated energy costs. The second part considers the problem of a single decision maker who seeks the optimal choice in the presence of uncertainty. This problem is mathematically formulated as a discrete stochastic optimization. In many real-life systems, due to the unexplained randomness and complexity involved, there typically exists no explicit relation between the performance measure of interest and the decision variables. In such cases, computer simulations are used as models of real systems to evaluate output responses. We present two simulation-based adaptive search schemes and show that, by following these schemes, the global optimum can be properly tracked as it undergoes random unpredictable jumps over time. Further, most of the simulation effort is exhausted on the global optimizer. Numerical evaluations verify faster convergence and improved efficiency as compared with existing random search, simulated annealing, and upper confidence bound methods. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2015-01-29 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 2.5 Canada |
DOI | 10.14288/1.0135671 |
URI | http://hdl.handle.net/2429/51993 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2015-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2015_may_namvargharehshiran_omid.pdf [ 4.48MB ]
- Metadata
- JSON: 24-1.0135671.json
- JSON-LD: 24-1.0135671-ld.json
- RDF/XML (Pretty): 24-1.0135671-rdf.xml
- RDF/JSON: 24-1.0135671-rdf.json
- Turtle: 24-1.0135671-turtle.txt
- N-Triples: 24-1.0135671-rdf-ntriples.txt
- Original Record: 24-1.0135671-source.json
- Full Text
- 24-1.0135671-fulltext.txt
- Citation
- 24-1.0135671.ris