Reinforcement Learning for Data Scheduling inInternet of Things (IoT) NetworksbyHootan RashtianA THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)August 2020© Hootan Rashtian, 2020The following individuals certify that they have read, and recommend tothe Faculty of Graduate and Postdoctoral Studies for acceptance, the thesisentitled:Reinforcement Learning for Data Scheduling in Internet ofThings (IoT) Networkssubmitted by Hootan Rashtian in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy in Electrical and Computer Engi-neering.Examining Committee:Sathish Gopalakrishnan, Electrical and Computer EngineeringSupervisorVijay Bhargava, Electrical and Computer EngineeringSupervisory Committee MemberLutz Hans-Joachim Lampe, Electrical and Computer EngineeringUniversity ExaminerBhushan Gopaluni, Chemical and Biological EngineeringUniversity ExaminerLouis Almeida, Electrical and Computer Engineering, University of PortoExternal ExamineriiAbstractI investigate data prioritization and scheduling problems on the Internetof Things (IOT) networks that encompass large volumes of data. The re-quired criteria for prioritizing data depend on multiple aspects such aspreservation of importance and timeliness of data messages in environ-ments with different levels of complexity. I explore three representativeproblems within the landscape of data prioritization and scheduling.First, I study the problem of scheduling for polling data from sensors whereit is not possible to gather all data at a processing centre. I present a central-ized mechanism for choosing sensors to gather data at each polling epoch.Our mechanism prioritizes sensors using information about the data gen-eration rate, the expected value of the data, and its time sensitivity. Ourwork relates to the restless bandit model in a continuous state space, unlikemany other such models. The contribution is to derive an index policy andshow that it can be useful even when not optimal through a quantitativestudy where event arrivals follow a hyper-exponential distribution.iiiSecond, I study the problem of balancing timeliness and criticality whengathering data from multiple sources using a hierarchical approach. A cen-tral decision-maker decides which local hubs to allocate bandwidth to, andthe local hubs have to prioritize the sensors’ messages. An optimal policyrequires global knowledge of messages at each local hub, hence imprac-tical. I propose a reinforcement-learning approach that accounts for bothrequirements. The proposed approach’s evaluation results show that theproposed policy outperforms all the other policies in the experiments ex-cept for the impractical optimal policy.Finally, I consider the problem of handling timeliness and criticality trade-off when gathering data from multiple resources in complex environments.There exist dependencies among sensors in such environments that lead topatterns in data that are hard to capture. Motivated by the success of theAsynchronous Advantage Actor-Critic (A3C) approach, I modify the A3Cby embedding Long Short Term Memory (LSTM) to improve performancewhen vanilla A3C could not capture patterns in data. I show the effective-ness of the proposed solution based on the results in multiple scenarios.ivLay SummaryWhen we have numerous sources of data, we cannot collect data from everysource. Our ability to gather data is bounded by fundamental constraintssuch as bandwidth. Consequently, we need to prioritize data sources basedon criticality and timeliness requirements. We explore data prioritizationquestions in the Internet of Things setting wherein we may have to managedata collection from many sensors; in the presence of requirements andrestrictions in the network environment. The solutions we present herebuild upon recent work in the area of reinforcement learning.vPrefaceAll of the work presented henceforth (Chapters 2-4) was conducted in theReal-Time and Dependable Computing Laboratory (RADICAL) at the Uni-versity of British Columbia, Point Grey campus. I developed the algorithms,worked on proofs of correctness, and carried out the numerical evalua-tions described in this dissertation. My supervisor, Sathish Gopalakrish-nan, worked closely with me on sharpening the problem formulation andclarifying the research context. Chapter 2 includes to-be-published workco-authored by Bader Alahmad.• Chapter 2 is under peer review, following revisions.I developed, implemented and evaluated the method. Bader Alahmadhelped with proofs verification and proof reading of the work. ProfesorGopalakrishnan provided feedback and suggestions in improving the for-mulation, the methodology and the evaluation.• Chapter 3 has been published as H. Rashtian and S. Gopalakrishnan,vi“Balancing Message Criticality and Timeliness in IoT Networks” inIEEE Access, vol. 7, pp.145738-145745.I developed, implemented and evaluated the method. Profesor Gopalakr-ishnan provided feedback and suggestions in improving the formulation,the methodology and the evaluation.• Chapter 4 has been published as H. Rashtian and S. Gopalakrishnan,“Using Deep Reinforcement Learning to Improve Sensor Selection inthe Internet of Things” in IEEE Access, vol. 8, pp. 95208-95222.I developed, implemented and evaluated the method. Profesor Gopalakr-ishnan provided feedback and suggestions in improving the methodologyand the evaluation.viiContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viContents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxviiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxviii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Broad Research Agenda . . . . . . . . . . . . . . . . . . . . 21.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 4viii1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3.1 Polling IoT sensors with time-sensitive data . . . . . 51.3.2 Balancing criticality and deadline in IoT networks . . 61.3.3 Handling the trade-off of criticality vs. timeliness incomplex environments . . . . . . . . . . . . . . . . . 72 Polling Sensors with Time-Sensitive Data: Restless Bandits Re-visited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Indexability, or the Existence of a Priority-Driven Policy . . . 192.4 Computing the Index/Priority . . . . . . . . . . . . . . . . . 262.5 Adaptive Estimation of Accrued Utility . . . . . . . . . . . . 322.6 Explicit Analysis of Stochastic Arrivals . . . . . . . . . . . . 342.7 Numerical Evaluation and Analysis . . . . . . . . . . . . . . 442.7.1 Identical bandwidth costs . . . . . . . . . . . . . . . 492.7.2 Varied bandwidth costs . . . . . . . . . . . . . . . . . 492.7.3 IPv vs. IPf . . . . . . . . . . . . . . . . . . . . . . . . 512.7.4 Selecting arms using indices: top-k arms vs. knapsackpacking . . . . . . . . . . . . . . . . . . . . . . . . . 522.7.5 Insight from numerical evaluations . . . . . . . . . . 542.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 552.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 62ix3 Balancing Message Criticality and Timeliness in IoT Networks:A Q-Learning Approach . . . . . . . . . . . . . . . . . . . . . . . 663.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . 683.3 Optimal Offline Policy . . . . . . . . . . . . . . . . . . . . . 703.4 Using Reinforcement Learning in a Decentralized Policy . . . 753.4.1 At Local Hubs . . . . . . . . . . . . . . . . . . . . . . 763.4.2 At the Central Hub . . . . . . . . . . . . . . . . . . . 783.5 Alternative Policies at the Central Hub . . . . . . . . . . . . 793.6 Quantitive Evaluation . . . . . . . . . . . . . . . . . . . . . 803.6.1 Experimental Parameters . . . . . . . . . . . . . . . . 813.6.2 Evaluation Scenarios . . . . . . . . . . . . . . . . . . 823.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 853.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 873.8 Conclusions and Future Work . . . . . . . . . . . . . . . . . 904 Handling the Message Criticality vs. Timeliness Tradeoff inComplex IoT Environments: A Deep RL Approach . . . . . . . . 924.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 934.1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . 984.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . 1004.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . 1014.2.1 Performance Metric . . . . . . . . . . . . . . . . . . . 102x4.2.2 Problem Statement . . . . . . . . . . . . . . . . . . . 1034.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044.3.1 Overview of Deep Reinforcement Learning . . . . . . 1044.3.2 A3C Network . . . . . . . . . . . . . . . . . . . . . . 1094.3.3 Proposed Approach . . . . . . . . . . . . . . . . . . . 1144.3.4 Environment . . . . . . . . . . . . . . . . . . . . . . 1194.4 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . 1194.4.1 Evaluation Scenarios . . . . . . . . . . . . . . . . . . 1204.4.2 Experimental setup . . . . . . . . . . . . . . . . . . . 1224.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 1234.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 1274.6 Conclusions and future work . . . . . . . . . . . . . . . . . . 1285 Conclusions & Future Work . . . . . . . . . . . . . . . . . . . . . 1325.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 1345.2 Potential Future Directions . . . . . . . . . . . . . . . . . . . 136Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138A Supplementary material for Chapter 2 . . . . . . . . . . . . . . 147B Supplementary results of Chapter 4 . . . . . . . . . . . . . . . . 149B.1 Experiments results for 4 sensors . . . . . . . . . . . . . . . 149B.2 Experiments results of 16 sensors . . . . . . . . . . . . . . . 152B.3 Table of results summary . . . . . . . . . . . . . . . . . . . . 155xiList of TablesTable 2.1 Comparison of algorithms in the cases of fixed and variedbandwidth costs. The table shows the percentage of simu-lation runs wherein the index policy outperforms other algo-rithms as well as the average performance advantage. . . . . . 48Table 3.1 The table shows the average Missed Criticality Ratio forthe policies in each of the scenarios. As shown in greencolored cells, the proposed policy (VWP) consistently per-forms as the second best policy after the optimal policy. . 87Table 4.1 Example of sensor selection in a 4-step time window, whendependency cannot be captured. In this case, there is atemporal dependency for the arrival of data packets. Theavailable data packets at sensor i is shown as Si =(cr, d).As the table shows, the total accumulated criticality at theend of the time step t = 4 is 8. . . . . . . . . . . . . . . . 96xiiTable 4.2 Example of sensor selection in a 4-step time window, whendependency can be captured. In this case, there is a tem-poral dependency for the arrival of data packets. Theavailable data packets at sensor i is shown as Si =(cr, d).As the table shows, the total accumulated criticality at theend of the time step t = 4 is 11. . . . . . . . . . . . . . . . 96Table 4.3 The table summarizes the results for the case of 8 sensors.It reports the average Criticality-weighted Deadline MissRatio for the policies in each of the scenarios. As shownin green-coloured cells, the proposed policy (m A3C) con-sistently performs as the best policy. Even when it is notthe best (i.e., Scenario II), it still performs reasonably wellwith only one percent of difference (concerning ρ) com-pared to the best result. Also, the red-coloured cells repre-sent the worst performance (by the greedy policies) acrossall the scenarios. . . . . . . . . . . . . . . . . . . . . . . . 131Table B.1 The summary of all experiments concerning the Criticality-weighted Deadline Miss Ratio. . . . . . . . . . . . . . . . 155xiiiList of FiguresFigure 2.1 The flowchart of sensor polling process. . . . . . . . . . . 31Figure 2.2 Performance of different policies with identical bandwidthcosts: γ = 1 for all sensors where #Bandits = 4, BandwidthLimit = 2. The proposed index policy typically outperformsgreedy and round-robin sensor selection. The Y-axis is the Av-erage Reward / Greedy Reward (hence the Greedy policy alwayshas the value of 1) and the X-axis represents the Workload In-tensity that we define as: ∑i∈M νi× µarrival × βi. According tothe definition of the workload intensity, it increases with anincrease in data value (νi), an increase in data arrival (µarrival)or an increase in decay rate (βi) of messages. . . . . . . . . . 50xivFigure 2.3 Performance of different policies with identical bandwidthcosts: γ = 1 for all sensors where #Bandits = 8, BandwidthLimit = 4. The proposed index policy typically outperformsgreedy and round-robin sensor selection. The Y-axis is the Av-erage Reward / Greedy Reward (hence the Greedy policy alwayshas the value of 1) and the X-axis represents the Workload In-tensity that we define as: ∑i∈M νi× µarrival × βi. According tothe definition of the workload intensity, it increases with anincrease in data value (νi), an increase in data arrival (µarrival)or an increase in decay rate (βi) of messages. . . . . . . . . . 51Figure 2.4 Performance of different policies with identical bandwidthcosts: γ = 1 for all sensors where #Bandits = 16, Band-width Limit = 8. The proposed index policy typically outper-forms greedy and round-robin sensor selection. The Y-axis isthe Average Reward / Greedy Reward (hence the Greedy pol-icy always has the value of 1) and the X-axis represents theWorkload Intensity that we define as: ∑i∈M νi×µarrival×βi. Ac-cording to the definition of the workload intensity, it increaseswith an increase in data value (νi), an increase in data arrival(µarrival) or an increase in decay rate (βi) of messages. . . . . 52xvFigure 2.5 Performance of different policies with identical bandwidthcosts: γ = 1 for all sensors where #Bandits = 32, Band-width Limit = 16 The proposed index policy typically outper-forms greedy and round-robin sensor selection. The Y-axis isthe Average Reward / Greedy Reward (hence the Greedy pol-icy always has the value of 1) and the X-axis represents theWorkload Intensity that we define as: ∑i∈M νi×µarrival×βi. Ac-cording to the definition of the workload intensity, it increaseswith an increase in data value (νi), an increase in data arrival(µarrival) or an increase in decay rate (βi) of messages. . . . . 53Figure 2.6 Performance of different policies with randomly selectedbandwidth costs where #Bandits = 4, Bandwidth Limit =2. The index policy tends to outperform the greedy as wellas the round-robin policy except for a few cases when round-robin selection has a small advantage. The Y-axis is the AverageReward / Greedy Reward (hence the Greedy policy always hasthe value of 1) and the X-axis represents the Workload Intensitythat we define as: ∑i∈M νi× µarrival ×βi. According to the def-inition of the workload intensity, it increases with an increasein data value (νi), an increase in data arrival (µarrival) or anincrease in decay rate (βi) of messages. . . . . . . . . . . . . 54xviFigure 2.7 Performance of different policies with randomly selectedbandwidth costs where #Bandits = 8, Bandwidth Limit =4. The index policy tends to outperform the greedy as wellas the round-robin policy except for a few cases when round-robin selection has a small advantage. The Y-axis is the AverageReward / Greedy Reward (hence the Greedy policy always hasthe value of 1) and the X-axis represents the Workload Intensitythat we define as: ∑i∈M νi× µarrival ×βi. According to the def-inition of the workload intensity, it increases with an increasein data value (νi), an increase in data arrival (µarrival) or anincrease in decay rate (βi) of messages. . . . . . . . . . . . . 55Figure 2.8 Performance of different policies with randomly selectedbandwidth costs where #Bandits = 16, Bandwidth Limit= 8. The index policy tends to outperform the greedy as wellas the round-robin policy except for a few cases when round-robin selection has a small advantage. The Y-axis is the AverageReward / Greedy Reward (hence the Greedy policy always hasthe value of 1) and the X-axis represents the Workload Intensitythat we define as: ∑i∈M νi× µarrival ×βi. According to the def-inition of the workload intensity, it increases with an increasein data value (νi), an increase in data arrival (µarrival) or anincrease in decay rate (βi) of messages. . . . . . . . . . . . . 56xviiFigure 2.9 Performance of different policies with randomly selectedbandwidth costs where #Bandits = 32, Bandwidth Limit =16. The index policy tends to outperform the greedy as wellas the round-robin policy except for a few cases when round-robin selection has a small advantage. The Y-axis is the AverageReward / Greedy Reward (hence the Greedy policy always hasthe value of 1) and the X-axis represents the Workload Intensitythat we define as: ∑i∈M νi× µarrival ×βi. According to the def-inition of the workload intensity, it increases with an increasein data value (νi), an increase in data arrival (µarrival) or anincrease in decay rate (βi) of messages. . . . . . . . . . . . . 57Figure 2.10 Comparison of IPv and IPf in terms of total accrued re-wards with identical bandwidth. IPv shows a consistent ad-vantage with respect to IPf in all simulations setups. . . . . . 58Figure 2.11 Comparison of IPv and IPf in terms of total accrued re-wards with varied bandwidth. IPv shows a consistent advan-tage with respect to IPf in all simulations setups. . . . . . . . 59xviiiFigure 2.12 Comparing top-k selection with knapsack packing for in-dex policy: We notice no significant difference between thetwo approaches so the top-k approach suffices. The Y-axis isthe Average Reward / Greedy Reward (hence the Greedy pol-icy always has the value of 1) and the X-axis represents theWorkload Intensity that we define as: ∑i∈M νi×µarrival×βi. Ac-cording to the definition of the workload intensity, it increaseswith an increase in data value (νi), an increase in data arrival(µarrival) or an increase in decay rate (βi) of messages. . . . . 60Figure 3.1 The hierarchical model where unit-length messages withdeadlines (di) and criticalities (κi) arrive at local hubsand then transmitted to the central hub (based on a policy). 69Figure 3.2 The details of processing messages at local hubs. First,sensors detect events and send messages to local hubs.Second, messages are routed to appropriate queues basedon the priority assigned to each messages. Messages inthe highest priority queue, are emptied first before mes-sages at queues with lower priority level. . . . . . . . . . 75xixFigure 3.3 We use four evaluation scenarios based on varying thedistribution of messages in terms of criticality and dead-line values. Scenario V, not in the figure, is where eachthe deadline and criticality of each message is chosenuniformly at random from a range of values. . . . . . . . 81Figure 3.4 Range of criticality-weighted miss ratio (ρ) values for ScenarioI. The box plots indicate the range of miss ratio values usingthe same data reported in Table 3.1. The proposed reinforce-ment learning approach (VWP) is surpassed only by the offlineoptimal policy. . . . . . . . . . . . . . . . . . . . . . . . . 82Figure 3.5 Range of criticality-weighted miss ratio (ρ) values for ScenarioII. The box plots indicate the range of miss ratio values usingthe same data reported in Table 3.1. The proposed reinforce-ment learning approach (VWP) is surpassed only by the offlineoptimal policy. . . . . . . . . . . . . . . . . . . . . . . . . 83Figure 3.6 Range of criticality-weighted miss ratio (ρ) values for ScenarioIII. The box plots indicate the range of miss ratio values usingthe same data reported in Table 3.1. The proposed reinforce-ment learning approach (VWP) is surpassed only by the offlineoptimal policy. . . . . . . . . . . . . . . . . . . . . . . . . 84xxFigure 3.7 Range of criticality-weighted miss ratio (ρ) values for ScenarioIV. The box plots indicate the range of miss ratio values usingthe same data reported in Table 3.1. The proposed reinforce-ment learning approach (VWP) is surpassed only by the offlineoptimal policy. . . . . . . . . . . . . . . . . . . . . . . . . 85Figure 3.8 Range of criticality-weighted miss ratio (ρ) values for ScenarioIV. The box plots indicate the range of miss ratio values usingthe same data reported in Table 3.1. The proposed reinforce-ment learning approach (VWP) is surpassed only by the offlineoptimal policy. . . . . . . . . . . . . . . . . . . . . . . . . 86Figure 4.1 The system model where the central unit selects a sensorat a time to transmit data. Sensors hold a message withdeadline (di) and criticality (κi). The model makes noassumptions about the arrival rate of events sensed bythe sensors. . . . . . . . . . . . . . . . . . . . . . . . . . 102xxiFigure 4.2 An example of A3C’s limitations, where its performancedegrades in complex scenarios with 8 sensors. The Y-axisis the Criticality-weighted Deadline Miss Ratio and the X-axis represents the Workload Intensity that we define as:∑i∈MCri×λarrivaldi. According to the definition of the work-load intensity, it increases with an increase in criticality(Cri), an increase in data arrival (λarrival) or a decreasein deadline (di) of messages. . . . . . . . . . . . . . . . . 112Figure 4.3 The proposed A3C-based network with embedded mem-ory (i.e., LSTM layer). . . . . . . . . . . . . . . . . . . . 116Figure 4.4 An example graph of the reward function for some pa-rameter choices (ι = 1, α = 2, β = 4, and κ and d arerandomly selected from interval of (1,5). The peaks cor-respond to the cases where no penalty is incurred (i.e.,I = 0), whereas the troughs correspond to the cases witha penalty (i.e., I = 1). . . . . . . . . . . . . . . . . . . . . 118Figure 4.5 Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 1 with8 sensors. The plot indicates the range of criticality-weighted deadline miss ratio values using the same datareported in Table 4.3. The proposed approach (m A3C)performs competitively compared to the vanilla A3C. . . 123xxiiFigure 4.6 Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 2 with8 sensors. The plot indicates the range of criticality-weighted deadline miss ratio values using the same datareported in Table 4.3. The proposed approach (m A3C)performs competitively compared to the vanilla A3C. . . 124Figure 4.7 Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 3 with8 sensors. The plot indicates the range of criticality-weighted deadline miss ratio values using the same datareported in Table 4.3. The proposed approach (m A3C)outperforms other policies. . . . . . . . . . . . . . . . . . 126Figure 4.8 Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 4 with8 sensors. The plot indicates the range of criticality-weighted deadline miss ratio values using the same datareported in Table 4.3. The proposed approach (m A3C)outperforms other policies. . . . . . . . . . . . . . . . . . 126xxiiiFigure B.1 Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 1 with4 sensors. The plot indicates the range of criticality-weighted deadline miss ratio values using the same datareported in Table B.1. The proposed approach (m A3C)outperforms the vanilla A3C. . . . . . . . . . . . . . . . 150Figure B.2 Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 2 with4 sensors. The plot indicates the range of criticality-weighted deadline miss ratio values using the same datareported in Table B.1. The proposed approach (m A3C)outperforms the vanilla A3C. . . . . . . . . . . . . . . . 150Figure B.3 Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 3 with4 sensors. The plot indicates the range of criticality-weighted deadline miss ratio values using the same datareported in Table B.1. The proposed approach (m A3C)outperforms the vanilla A3C. . . . . . . . . . . . . . . . 151xxivFigure B.4 Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 4 with4 sensors. The plot indicates the range of criticality-weighted deadline miss ratio values using the same datareported in Table B.1. The proposed approach (m A3C)outperforms the vanilla A3C. . . . . . . . . . . . . . . . 151Figure B.5 Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 1 with16 sensors. The plot indicates the range of criticality-weighted deadline miss ratio values using the same datareported in Table B.1. The proposed approach (m A3C)performs competitively compared to the vanilla A3C. . . 152Figure B.6 Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 2 with16 sensors. The plot indicates the range of criticality-weighted deadline miss ratio values using the same datareported in Table B.1. The proposed approach (m A3C)outperforms the vanilla A3C. . . . . . . . . . . . . . . . 153xxvFigure B.7 Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 3 with16 sensors. The plot indicates the range of criticality-weighted deadline miss ratio values using the same datareported in Table B.1. The proposed approach (m A3C)outperforms the vanilla A3C. . . . . . . . . . . . . . . . 153Figure B.8 Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 4 with16 sensors. The plot indicates the range of criticality-weighted deadline miss ratio values using the same datareported in Table B.1. The proposed approach (m A3C)outperforms the vanilla A3C. . . . . . . . . . . . . . . . 154xxviGlossaryIOT Internet of ThingsA3C Asynchronous Advantage Actor-CriticLSTM Long Short Term MemoryVWP Value-Weighted PolicyDG Deadline-GreedyCG Criticality-GreedyCDG Criticality Density GreedyRA RandomGRA Global RandomxxviiAcknowledgmentsI want to express my most profound appreciation to my parents, brother,and my wife, for always standing by me. I love you!I would also like to acknowledge the support of my supervisor ProfessorSathish Gopalakrishnan, who believed in me and encouraged me to gainnew industrial and academic experiences. The shortest of conversationsand interactions with him were a learning experience for me.Also, there are many friends of mine with whom I enjoyed exchangingideas.Finally, I would like to thank the Natural Sciences and Engineering Re-search Council of Canada (NSERC) and the University of British Columbiafor their financial support.xxviiiChapter 1IntroductionThe amount of data being generated – by sensors and by humans – andshared online through the Internet has reached a volume that is exceed-ingly challenging to manage and work with. IBM reported that more than2.5 quintillion data bytes are being generated each day [2]. An exampleof a rapidly growing segment of the cyber-physical world that will add tothe data volume is the Internet of Things (IOT) [3]. Whether data are gen-erated for direct human consumption or preliminary processing by othercomputational devices, the amount of data being generated will stress com-putational infrastructure (e.g., networks and processing nodes) and com-pete for human attention.As another example of this enormous generation of data, Twitter reported1that its users add 350,000 tweets each minute [5]. Similarly, there are680,000 new Facebook posts [1] and 100 hours worth of videos are up-loaded to YouTube [6] each minute.Computational approaches to prioritize data are essential to coping withthe generation of such data. Such methods will allow system operators(such as IoT network admins) to allocate suitable resources to content thatneed it (e.g., replication levels to manage content that is frequently re-quested). For individuals that consume the data/content, these methodswill help prioritize content and draw attention to relevant items. Algo-rithms for prioritizing data will prevent users from being overwhelmed bythe data deluge. Such methods will also have to incorporate the time-varying nature of data: only some items may be effectively useful over thelong-run, but capturing short-term usefulness is also essential. A notion of“value” also captures the criticality of some data. In some cases, criticalitymay be established ahead of time, but it is also the case that we may haveto infer criticality in certain situations.1.1 Broad Research AgendaThe goal of the work we performed was to prioritize data streams effectivelyand quickly. Quick detection is essential because some content may losevalue after a short period and to be effective, we need to make decisionstimely. An example in this context may be data from IoT sensors. For2example, traffic intensity information at an intersection is valuable whensomeone is planning their travel route, and any data obtained late is notof much value. On a longer timescale and different contexts, we see thesame effect with data (i.e., videos) on playing a game like Pokemon Go.Tutorial videos on how to play the game may have high viewership whenthe game is released but may see diminished interest after gameplay hasbeen assimilated into popular culture [4].Discerning when a data source is important/critical is difficult becausethere are no apparent cues in a general setting. One will need to usecertain aspects of the content – the metadata – as features that help inthe identification process. In the IoT context, despite the large volume ofdata, less metadata may be available. However, we will still need to pri-oritize data flows when resources (e.g., storage, bandwidth, and power)are limited. Such property imposes an additional challenge compared tothe systems (e.g., social media) that social engagement features (e.g., likesand comments) could be used as metadata.Broadly, this thesis explores three problems related to prioritizing datastreams in different IoT settings. These settings can be articulated, at leastpartially, along three dimensions: (i) resource limitations (ii) data proper-ties (e.g., criticality and timeliness) and (iii) environment complexity (e.g.,whether data resources (e.g., sensors) are correlated in time or space).The first problem is about collecting valuable data in an environment that3is constrained by resources (i.e., bandwidth limit). At the same time, itcontains time-sensitive data from sensors in a less complex environmentdynamics. The second problem is very much about balancing sensors dataproperties of criticality and timeliness while dealing with some limitationsin less complex settings. In the third problem, we considered a case thatencompasses resource limitations and data properties in a complex envi-ronment, where data resources (e.g., sensors) may be correlated in time orspace.The goal here was to study these cases as a representative set of cases along-side the above dimensions. Indeed, these cases can be further exploredin scenarios with different configurations concerning resource limitations,properties associated with sensible data, and environmental complexity.1.2 Problem StatementGiven the research agenda explained in Section 1.1, our enquiries can besummarized by the following question:What techniques can we use to prioritize data given the existing resource con-straints and properties in data-rich environments such as IoT networks?While this research question shapes the theme of the research throughoutthe thesis, we answer three specific questions that help us answer the moregeneral question state above.41.3 ContributionsThe contributions of this dissertation are three-fold:1.3.1 Polling IoT sensors with time-sensitive dataWhen we have to periodically poll sensors (from a large set), we presenta mechanism for determining which sensors to gather data from at eachpolling epoch. Our sensor polling mechanism prioritizes sensors using in-formation about the data generation rate, the expected value of the data aswell as its time sensitivity.Our problem formulation and its solution relate to the restless bandit modelfor sequential decision making (Section 2.2). The problem we study bearssimilarity with the work by Kleinberg [31] but the fact that a sensor’s statemay grow with time (as events are observed) and decay (as the data be-comes stale) does not permit for the same analysis as in the case of recharg-ing bandits. Prior work has focused on the state space of the arms beingdiscrete, but, as we shall see, in our formulation, the state space is contin-uous (albeit closed) and requires that we establish the existence (Section2.3) and suitability of priority/index policies [50]. We proved that similartechniques as those in restless bandits with discrete space could be used be-cause of particular characteristics of the underlying problem (Section 2.7).We then showed that our approach could be advantageous even when notoptimal through an extensive quantitative study where event arrivals fol-5low a hyper-exponential distribution (Section 2.7).1.3.2 Balancing criticality and deadline in IoT networksWhen sensors are organized hierarchically, we present a reinforcementlearning approach for message scheduling in a system with many nodesthat will help us strike a balance between timeliness and criticality (Section3.4). In a two-level hierarchical architecture, devices that generate datatransmit them to a local hub. A central decision-maker then has to decidewhich local hubs to allocate bandwidth to, and the local hubs have to pri-oritize the messages they transmit when allowed to do so. We showed andproved that an optimal policy does exist for this problem (Section 3.3).Though such a policy would require global knowledge of messages at eachlocal hub, rendering such a scheme impractical.Despite similarities of job scheduling in distributed environments [36, 37]on timing constraints, our work is distinguished by focusing on strikingcriticality and timeliness in a distributed IoT-like setup. Also, there hasbeen a large amount of optimization research done on job scheduling inthe distributed computing environment [12, 28, 56, 63, 72], which is notdirectly applicable to the massage scheduling problem that we addressedin hierarchical networks.We evaluated our solution using a criticality-weighted deadline miss ratioas the performance metric (Section 3.6). The performance analysis was6done by simulating the behaviour of the proposed policy as well as thatof several natural policies under a wide range of system conditions. Theresults show that the proposed policy outperforms all the other policies -except for the optimal but impractical policy - under the range of systemconditions studied and that in many cases, it performs close (3% to 12%lower performance depending on the condition) to the optimal policy.1.3.3 Handling the trade-off of criticality vs. timelinessin complex environmentsWhen data sources may be correlated in space and time, we suggest a deepreinforcement learning solution to the problem of handling timeliness andcriticality trade-offs (Section 4.3.3). Such correlation patterns are difficultto utilize without enhancing existing schemes (4.4.1).While our work is similar to a body of research that addresses understand-ing patterns in complex networks [14, 33, 71], it is different from themboth in terms of the system model (i.e., centralized) and the solution ap-proach (i.e., memory-based DRL approach).We chose Asynchronous Advantage Actor-Critic (A3C) as the underlyingmodel for our proposed solution. We first mapped vanilla A3C into ourproblem to compare its performance in terms of Criticality-weighted Dead-line Miss Ratio to the considered baselines in multiple scenarios. Then,we modified the A3C network by embedding “Long Short Term Memory7(LSTM)” to improve performance in cases that vanilla A3C could not cap-ture patterns in data. Simulation results (Section 4.4.3) show that themodified A3C reduces the metric of “Criticality-weighted Deadline Miss Ra-tio” from 0.3 to 0.19. Moreover, results indicate that the proposed solutionperforms well in less complex cases, which makes it a generalized solution.8Chapter 2Polling Sensors withTime-Sensitive Data: RestlessBandits RevisitedSummary. In a sensor-rich Internet of Things environment, we may beunable to gather all data at a processing centre at the rate at which thedata is generated. The rate of data collection from a sensor may be lim-ited by available bandwidth/cost (or energy considerations), especially ifone were to use cellular networks for such systems. In this context, wepresent a mechanism for determining which sensors to gather data fromat each polling epoch. Our sensor polling mechanism prioritizes sensors9using information about the data generation rate, the expected value ofthe data as well as its time sensitivity. Our problem formulation and itssolution relate to the restless bandit model for sequential decision making.Whereas existing methods for the restless bandit model are not directly ap-plicable because the state space is continuous and not discrete, we provethat similar techniques can be used because of special characteristics of theunderlying problem. We then show that our approach can be very effec-tive even when not optimal through an extensive quantitative study whereevent arrivals follow a hyper-exponential distribution.2.1 IntroductionSensor systems are a significant component of the Internet of Things (IoT).Dense sensor deployments are a rich source of data about the physicalworld and are used to inform decisions in a variety of applications. Inessence, agriculture, transportation, and emergency responses are a fewsuch applications. Another source of data is certain types of online socialnetworks/services, such as Twitter, which also contain feeds that influencedecision making in the types of applications that also rely on physical sens-ing. Also, privacy in data collection can be an important application thatuses data from sensor deployments in environments.We consider a scheduling problem that arises in the context of detectingevents across multiple locations in a system. Imagine that we would like to10know if an event was observed at some location. We would like to know ofthis event relatively quickly (delay sensitivity). However, there are manylocations to monitor, making it infeasible to poll each location at each de-cision epoch. One possible approach to this problem would be to set uptriggers at each location for every object or event of interest and be in-formed when the event occurs. A downside to this approach – one thatwe consider – is that we may have to inform each location of the events ofinterest, potentially compromising the privacy of the events of interest (pri-vacy sensitivity). We, therefore, study a model where a central entity pollsan observation point (location) and obtains all observations since the pre-vious polling request to that point. This central entity may then identify theevents of interest from the obtained set of observations to take suitable ac-tion. We can also use the term sensors to refer to these observation points,and we will use these terms interchangeably for the rest of this chapter.From the design perspective, and with the above-mentioned central entity,we can assume that we have a centralized data service that can provide sub-scription services to applications. The centralized entity may not have themost accurate knowledge of sensor data (and thus, the value of the data).However, one benefit of performing centralized scheduling is to enforcea global bandwidth constraint. (“Centralization” may still use distributedcomputing techniques but we can think of a single entity that plans the datacollection and data warehousing.) The central entity has limited bandwidth11and can only poll some sensors at each decision epoch. The question thatthen arises is: Which sensors should be polled? We may want to prioritizesensors that have not been polled recently and at the same time accountfor the fact that some sensors may provide more valuable data consistently.Another advantage of this model is that sensor deployment costs may beamortized over multiple applications, and simple application programminginterfaces (APIs) allow many applications to be developed easily. Thismodel has also been referred to as sensing-as-a-service [45]. In the sensing-as-a-service model, as more sensors join the network, it becomes increas-ingly challenging to select appropriate sensors such that the services men-tioned above (e.g., APIs) are efficiently available.Our focus is on the abstraction of polling sensors where we have somenotions of delay sensitivity and privacy sensitivity (or other limitations)that prevent a push-driven architecture. If the value derived from differentevents may be different, then a simple round-robin polling policy may notbe suitable. We study how a single entity can perform as an efficient sched-uler for sensors data collection operations when detecting incidents/events inIoT platforms. The growth in sensor deployments and the volumes of datagenerated [45] necessitate careful scheduling of data collection from sen-sors. Such a process is essential due to potential constraints such as theavailable communication bandwidth (which can be overwhelmed by anapproach that simply queries all sensors periodically) or the direct cost of12accessing the underlying network. One possible design strategy for suchsystems would be to utilize the cellular network infrastructure, which canbe expensive, although simple to implement and manage.We propose a centralized approach to periodically collecting sensor data.To manage the process of polling sensors for data, we consider two issuesrelated to the sensors and the data they collect:1. Data Value: Not all data is equal in a data-rich world. The data streamfrom one sensor may be deemed more important than the streamfrom a different sensor. Such judgment of value may be a result offactors such as sensor location and sensor type. The value can bequantified, for instance, by the number of API queries to use the datain the sensing-as-a-service model.2. Time Sensitivity: Some data may be important for long-range statis-tics, but some data may be needed soon after the associated observa-tion. Further, data may lose its value as the time from the associatedobservation increases.Using the context that we have provided so far, we propose a policy to pri-oritize sensors at each polling (data collection) epoch to respect the band-width (or cost) constraints and to maximize the long-term average valueobtained. In modelling the underlying problem (Section 2.2), we treat thesensors as offering time-varying rewards based on when we poll them. The13reward from a sensor is the “value” seen by the sensing service. Such a re-ward depends on the data that the sensor recorded and on when it recordedthe data.Our problem model is related to the restless bandit model for sequential de-cision making [69] due to Whittle. The central difference, however, betweenour formulation and the classic restless bandit formulation is that the state-space for our problem is continuous, whereas it is discrete in Whittle’s formu-lation. It is this difference that requires the rigorous treatment we present.Our main contributions are as follows:• Establishing that the polling problem can be solved using a dynamicprogram that has a unique solution (Section 2.3);• Deriving a simple dynamic priority, or index, policy that allows us toapproximate the solution to the [intractable] dynamic program (Sec-tions 2.4 and 2.6);• Identifying an adaptive and improved index policy when event ar-rivals at sensors are modeled using hyper-exponential distributionsto capture a wider range of operating conditions (Section 2.5);• Demonstrating the effectiveness of the index policy using numericalevaluations. (Section 2.7).14Organization. We start by presenting the related work (Section 2.8) tosupport and position our contributions. We then discuss the model thatcaptures the sensor scheduling problem (Section 2.2) before presenting themain technical results highlighted in the list of contributions (Sections 2.4,2.5, 2.6 and 2.7). The last section summarizes our findings and discussesextensions to this work (Section 2.9).2.2 ModelConsider a sensor deployment consisting of N sensors S1, . . . ,SN . Let [N] ={1, . . . ,N}. The data sensed by sensor i ∈ [N] has initial value νi, whichis a non-negative random variable with known finite mean ν i < ∞, andthis value decreases exponentially with rate βi > 0. This means that datacollected t units of time after it was initially sensed has (random) valueνie−βit . We assume that ν1, . . . ,νN are i.i.d. Also, we assume that data iscollected periodically with period P> 0.Moon et al. have presented the approach of assigning value to gathereddata based on its use in their work on a learning framework for improvingsearch results [41].We assume that the successive events that a sensor may detect are such thatthe inter-arrival times are governed by a hyper-exponential distribution,which is a mixture of exponential distributions. This assumption aboutinter-arrival times captures a wide range of operating conditions because15hyper-exponential distributions can capture exponential distributions andapproximate heavy-tailed distributions. Sensor Si, i∈ [N] senses new eventsfrom the environment at rate µi, j > 0 with a probability q j; we assume thatevents are sensed, or equivalently, events arrive at the sensors, accordingto a hyper-exponential process that is a mixture of M exponential distri-butions. At each collection period, we select the sensors to poll and thuscollect the sensed event data.Our next step is to describe the expected utility/value that we can obtainwhen we poll a sensor. We define the expected utility/value from pollingsensor Si as the average value over the average discounted time spanningone period; that is,υi =M∑j=1q jµi, j∫ P0ν ie−βit dt =∑Mj=1 q jµi, jβiν i(1− eβiP). (2.1)We will use the estimate of utility accrued in a period, as described above,in our initial discussion. Later (Section 2.5), we will show that we canobtain improved estimates using a property of the hyper-exponential dis-tribution.Each sensor may have sensed multiple events between two sampling in-stants. Let ϒi(t) be the utility accumulated at sensor Si at time t. This isalso the state of Si at time t. Let ai(t) be the action taken for sensor Si at16time t, which we define asai(t) =1, poll0, idle (do not poll).The evolution of the state of each sensor depends on the polling action.There are two cases. For convenience, let αi = e−βiP. First, if ai(t) = 0, thenno reward is obtained; that is, ri(t) = ri(ϒi(t),ai(t))= 0, and the state willbe changed at the next polling period according toϒi(t+P) = αiϒi(t)+υi. (2.2)Second, if ai(t) = 1, then the reward obtained is the accumulated utilityup to time t; that is, ri(t) = ri(ϒi(t),ai(t))= ϒi(t). In this case, the state is“reset” to its initial value: ϒi(t+P) = υi.Note that the controlled state process ϒ= {ϒ(t) : t ≥ 0} depends on the se-quence of actions. Moreover, ϒ is a deterministic process; this is because1) we are working with the expectations of the random elements involvedin the definition of the state, and 2) we are considering deterministic deci-sions.In this model, our objective is to maximize the average reward over the17infinite horizon:liminfj→∞N∑i=11jj∑m=0ri(ϒi(mP),ai(mP)).There is also a “bandwidth” constraint defined aslimsupj→∞N∑i=11jj∑m=0γiai(mP) = B,where γi > 0 is the “bandwidth” required by sensor Si, and B is the givenaverage bandwidth that is available for the overall expected sensor pollingactivity.We assume that sensor Si requires bandwidth γi irrespective of the numberof observations it reports when polled. We could treat γi as a random vari-able, but the long-term behaviour can be approximated by using the meanγi. We will assume a deterministic γi for the rest of the chapter.Our formulation is related to the restless multi-armed bandit framework [69],but the difference in our formulation is that the state space is a closed do-main in RN . The general analysis in this setting requires establishing someimportant results.In the specific case that γi = 1 for all i ∈ [N], our problem reduces to theoriginal restless multi-armed bandit setting and B becomes the expectednumber of sensors that need to be polled at every decision epoch.We shall next examine index policies, wherein the global problem with18N sensors (arms) is decomposed into N single-sensor problems. In whatfollows, we shall use the terms arm and sensor interchangeably. Whenreferring to an arm, we will use the term pulling for the sensor pollingactivity.In the first part of our discussion (Sections 2.3 and 2.4), we will assume afluid-flow approximation of the stochastic event arrival process, which triggersobservations at sensors. Event arrivals over a time window are averaged, andeach observation at a sensor has the same value, although observations atdifferent sensors may have a different value. Later (Section 2.6) we will relaxthis assumption and study the truly stochastic behaviour of the system whereevents arrive at discrete time points, and the value of observation may differat a sensor, but the mean value of a sensor observation is known.2.3 Indexability, or the Existence of aPriority-Driven PolicyWe consider the average reward problem for Si using the theory of Lagrangemultipliers. To this end, denote as ϒi the state space of sensor Si, and letvi(si,a) be the immediate reward that Si receives when the state is si ∈ ϒiand the action taken is ai ∈ {0,1}. That is,vi(si,ai) := siai+λγi(1−ai), (2.3)19where λ is a Lagrange multiplier. One may interpret λ as a subsidy allo-cated to sensor i so as to make idling (non-polling) more attractive.Using ρ as discount factor ρ, we can define the discounted reward over aninfinite horizon as:∞∑t=0ρ tvi(ϒi(t),ai(t)). (2.4)The value function associated with 2.4 isVρ(si) := sup{ai(t):t}{∞∑t=0ρ tvi((ϒi(t),ai(t)): ϒi(0) = s0}.Now the dynamic programming equation may be written asVρ(si) =max{λγi+ρVρ(αisi+υi),si+ρVρ(υi)}. (2.5)Lemma 1. The solution to dynamic program defined by 2.5, Vρ is (i) uniqueand bounded, (ii) continuous over ρ ∈ (0,1) (in the Lipschitz sense), and (iii)monotonically increasing and convex.Proof of Lemma 1Proof. To show 1), we refer to the theory of Discrete Time Markov Chains(DTMC) where having a unique bounded continuous solution for Vρ isstandard [57]. For 2), consider si,s′i ∈ ϒi with s′i 6= si, and consider pro-cesses {ϒi(t)} and {ϒ′i(t)} with initial conditions (states) si and s′i, respec-20tively. Both processes are controlled by the same actions {a(t)}. Denoteas T the first time instant at which the action is to poll the sensor; that is,T = inf{t ≥ 0 : a(t) = 1}. Since the state is reset when polling the sensor,it follows that ϒi(t) = ϒ′i(t) for all t > T (since the action sequence is thesame for both processes), and thus[vi(ϒ′i(t),a(t))− vi(ϒi(t),a(t))] = 0 forall t > T . On the other hand, the action at each t < T is to keep the sensoridle, and by 2.3, vi(ϒ′i(t),a(t))= vi(ϒi(t),a(t))= λγi (independently of thestate), and therefore[vi(ϒ′i(t),a(t))− vi(ϒi(t),a(t))]= 0 for all t < T . Thus,we haveVρ(s′i)−Vρ(si) =∞∑t=0ρ t[vi(ϒ′i(t),a(t))− vi(ϒi(t),a(t))]= ρT(ϒ′i(T )−ϒi(T )).At t = T , the states of processes {ϒi(t)} and {ϒ′i(t)} will have evolved ac-cording to (2.2), which for any integer k ≥ 0 and initial state si givesϒi(kP) =(1+αi+ · · ·+αki)si =[1−αk+1i1−αi]si.Observing that T is an integer multiple of P, we haveϒi(T ) = ϒi((T/P)P)=[1−α(T/P)+1i1−αi]si.21Thus, we haveVρ(s′i)−Vρ(si) = ρT[1−α(T/P)+1i1−αi](s′i− si).Interchanging the roles of s′i and si, we obtain a symmetric inequality. Thus|Vρ(si)−Vρ(s′i)| ≤ ρT[1−α(T/P)+1i1−αi]|si− s′i|.In order to prove 3), take si,s′i ∈ ϒi with s′i > si. Consider processes {ϒi(t)}and {ϒ′i(t)} generated by a common action sequence {a(t)}, where the twoprocesses differ only by the initial state. One can merely verify that ϒ′i(t)≥ϒi(t) for all t. Thus∞∑t=0ρ tvi(ϒ′i(t),a(t))≥ ∞∑t=0ρ tvi(ϒi(t),a(t)).The claimed monotonicity then follows by taking the supremum of bothsides of the latter inequality overall valid action sets.To establish convexity, let Vρ,T (si) be the finite horizon discounted value:Vρ,T (si) = sup{a(t)}{T∑t=0ρ tvi(ϒi(t),a(t)): ϒi(0) = s0}.22This satisfies the following dynamic programming equation:Vρ,T (si) =max{λγi+ρVρ,T−1(αisi+υi),si+ρVρ,T−1(υi)}for all T ≥ 1, with Vρ,0(si) = si. We can establish convexity of Vρ,T byinduction on T . Since pointwise limits preserve convexity, the fact thatVρ(si) = limT→∞Vρ,T (si) implies that Vρ is also convex.As discussed earlier, the reward is discounted over time. Now, let V˜ρ(s) =Vρ(s)−Vρ(υ), x ∈ ϒi. We can deduce that V˜ρ(s) also satisfies Lemma 1.Furthermore, (1−ρ)Vρ(u) is bounded. Using the Bolzano-Weierstrass The-orem [51] and Arzela-Ascoli Theorem [10], we may pick a subsequencealong which(V˜ρ(·),(1− p)Vρ(u))converges to (V,ξ ). From 2.5 we haveV˜ρ(s)+(1−ρ)Vρ(υ) =max{γiλ +ρV˜ρ(αis+υ),s}. (2.6)As ρ → 1 along an appropriate subsequence, (2.6) becomesV (s)+ξ =max{γiλ +V (αis+υ),s}, (2.7)which can be written asV (s)+ξ = maxa∈{0,1}{as+(1−a)(γiλ +V (αis+υi)}. (2.8)23Now that we have derived the dynamic programming equation, we wantto show that the value function increases monotonically, is convex, andthat V (s) = 0. These conditions are easily verified since point-wise limitspreserve convexity and monotonicity.Also, we want to show that maximizing V (s) in (2.8) is achieved by theoptimal action and, correspondingly, that ξ is the optimal reward. To es-tablish this, consider the following argument. Let a∗(s) be the action thatmaximizes [as+(1−a)(γiλ +V (αis+υi))].If multiple actions maximize the foregoing function then we can pick oneof those actions arbitrarily. Under the condition{a(t) = a∗(s(t)): t ≥ 0},V(ϒ(t))+ξ = v(ϒ(t),a(t))+V(ϒ(t+1)).Now, if we consider the average value of both sides over time, we getξ +1TT∑t=0V(ϒ(t))︸ ︷︷ ︸L=1TT∑t=0[v(ϒ(t),a(t))+V(ϒ(t+1))]︸ ︷︷ ︸R. (2.9)As T → ∞, ξ is the average reward per chosen control policy. For any otheraction set, we will have L≥ R in (2.9) and hence ξ is greater than equal theaverage reward under a different set of actions. This implies the optimalityof ξ .24Let us now define the set of states when we do not poll a sensor as well asthe states when we do poll a sensor:Dc = {s ∈ S : γλ +V (αs+υ)> s}D= {s ∈ S : γλ +V (αs+υ)≤ s} .If t0 is when this sensor is first polled and if t0 <∞ (t0 =∞ is the “never poll”case), by using the optimal policy and iterating t0 times with the optimalvalue function in (2.7), we may write the dynamic programming equationasV (s) = (γλ −ξ )t0+[α t0s+(1−α t01−α)υ−ξ].If we use a different policy that is not optimal, then we will haveV (s)≥ (γλ −ξ )t0+[α t0s+(1−α t01−α)υ−ξ].Consequentially, we can write V (s) as:V (s) =max{(γλ −ξ )t0+[α t0s+(1−α t01−α)υ−ξ]}where we maximize the reward over all action sequences. This implies thatequation (2.7) has a unique solution.252.4 Computing the Index/PriorityNow, we show that an index policy (or a dynamic priority policy) existsfor the problem. (This is akin to Whittle’s approach for the classic restlessbandit problem.) To establish the existence of an index policy, we utilizethe following facts:• The value function is monotone;• The value function is convex;• The mapping from s to s−V (αs+υ) is concave.Consequently, as λ is varied from −∞ to +∞, the set D grows in a monotonefashion from the empty set to the entire state space S.First, we will show that some corner cases can be ignored.1. If υ∗ ∈D, that is the optimal action at υ∗ is not to poll the sensor, andthe related cost is γλ . Then, ξ = γλ and the optimal strategy is to notpoll the sensor at any state. Thus, D = [υ ,υ∗] and Dc = /0. The indexwould then be calculated asλ ≥ λυ = maxs∈[υ ,υ∗](s−V (αs+υ))/γ2. If υ ∈Dc, we have 0+ξ = u+0⇒ ξ = υ . This means that it is optimal26to poll the sensor when the reward is υ Also, Dc = [υ ,υ∗] and D= /0.In this case λ should obey the following inequality:λ ≤ λl := mins∈[υ ,υ∗](s−V (αs+υ))/γ.What we have now is that the deterministic control policies a(t) = 0 anda(t) = 1 have cost γλ and υ respectively, and ξ must then satisfy two con-ditions:• ξ ≥min(γλ ,υ), and• ξ ≥min(γλ ,υ) when λ ∈ (λl,λυ) and λl and λυ are lower bound andupper bound, respectively. Also, D and Dc are non-empty.There is also some υ+ ∈ (υ ,υ∗) where polling or not polling a sensor areequally good. In this case, υ∗ increases with λ . We can obtain g(x) as theinverse of this function. g(x) increases with x ∈ (υ ,υ+). g(x) is, in essence,λ when polling or not polling a sensor are both suitable decisions.Lemma 2. The sets [υ ,υ+) and (υ+,υ∗] correspond to D and Dc for someυ+ ∈ [υ ,υ∗].Proof of Lemma 2Proof. Since V is convex, either271. For some υ2 > υ1, D= [υ ,υ1)∩ (υ2,υ∗], or2. For some υ+,D= [υ ,υ+) and Dc = (υ+,υ∗].But at s = υ∗, the optimal action is to poll the sensor. Thus, s∗ ∈ Dc andhence only the second condition can hold.Corollary 1. The function from s toV (αs+υ) is monotonically non-decreasingover [υ ,υ∗].Let λ = g(s) for some s ∈ (υ ,υ∗). Every time we poll the sensor, the cor-responding state resets to υ . The optimal policy is then periodic: do notpoll a sensor until the state enters Dc and then poll it. Finite perturba-tions in initial conditions do not impact long-term behaviour, so we as-sume without loss of generality that s(0) = υ (i.e., initial state). Defineτ(s) =min{t : ϒ(t) ∈ Dc}, where s is the initial state. Thenϒ(τ(s))=(1−ατ(s))υ∗⇒ τ(s) = ⌈log+α (1− sυ∗)⌉ ,wherelog+α (s) =logα(s), s> 00, otherwise.In the long-run, the overall average cost will converge to the average cost28over one polling period. Thus,ξ =λγ(τ(s)−1)+ϒ(τ(s))τ(s). (2.10)Theorem 1. The index of sensor Si isgi(s) =1γi[τi(s)((1−αi)s−υi)+(1−ατi(s)i1−αi)υi]whereτi(s) =⌈log+α(υi− (1−αi)sυi)⌉.Note. We introduced the subscript i so that we can have independent in-dices for different sensors. Also, if a sensor is polled even once, the states{ϒi(t)} are discrete after that with jumps every time step. The states de-pend on υi and αi solely. For a sensor that is never polled, the states can berestricted to discrete values depending on ϒi(0). If we restrict attention tosuch discretized states, the index can be reduced togi(s) =1γi[τi(s)((1−αi)s−υi)+ s].Proof. For state s ∈ Dc, we obtain V (s) = s− ξ using (2.7). Also, usingLemma 2, for s′ = αs+ ν ∈ Dc, we can obtain V (s′) = s′− ξ . Combining29these results with (2.7) and the definition of an index for restless bandits,we havegi(s) =(1−αi)s−νi+ ξ˜i(s)γi, (2.11)where ξ˜i, due to (2.10), is the optimal policy cost when λi = gi(s).Using (2.10) yieldsξ˜i =1τi(s){γigi(s)(τi(s)−1)+(1−ατi(s)i υi)}, (2.12)where τi(s) =⌈log+α(υi−(1−αi)sυi)⌉.Using (2.12) in (2.11), we can solve a linear equation for gi(s). Thus:gi(s) =1γi[τi(s)((1−αi)s−υi)+(1−ατi(s)i1−αi)υi].Index Policy. At each integer multiple of the period, poll the sensors withthe highest indices until the bandwidth constraint is violated. That is, at adecision epoch, calculate the indexes g1(s1), . . . ,gN(sN), where si is the stateof sensor i ∈ [N] at the decision epoch, sort the sensors in non-increasingorder of their indices, and activate the sensors with the highest indexesuntil the sum of sensor bandwidths γi exceeds the total available bandwidth30Figure 2.1: The flowchart of sensor polling process.M. Figure 2.1 depicts the step by step flowchart of sensor polling processin our proposed approach. Note that an alternative approach is to treat thedecision at each epoch as a knapsack problem [17] after we have computedthe indices but, as we discuss later, this approach does not result in significantbenefits for the extra work involved.Computational Complexity: The Whittle-like index priority calculation inthe previous section is very easy to compute and implement with only alinear increase in space and time complexity with the number of sources.Note that following the Whittle’s approach, we decouple our problem intoN sub-problems (one per sensor), each of which involves a constant-time31update. Therefore, the worst-case complexity of calculating indices is linear(Θ(N)) in the number of resources. Once we have indices at each timestep, the index policy requires that the indices be sorted (Θ(N logN), inthe worst case) such that it can select the best sensors. However, we neednot perform a complete sorting at each epoch. There is a semi-periodicbehaviour that we can exploit, and this allows us to select the top fewsensors more efficiently in practice, often in sub-linear time.2.5 Adaptive Estimation of Accrued UtilityIn our initial analysis, we made the model of the utility accrued at sensorSi during each period using (2.1), which we reproduce here for reference:υi =M∑j=1q jµi, j∫ P0ν ie−βit dt =P∑j=1q jµi, jβiν i(1− eβiP).An insight that we can use into refining this estimate is as follows: supposea sensor has not observed an event for t time units, what is the probabilitythat this sensor not observe any event after t+P time units? When eventsarrive with separation that is hyper-exponentially distributed, we can show– rather easily – not seeing an event can allow us to model the future with amodified hyper-exponential distribution. Let us suppose that t ′ is the inter-arrival time between two events.32P(t ′ > t+P|t ′ > t) = P(t′ > P|t ′ > t+P)P(t ′ > t+P)P(t ′ > t)=P(t ′ > t+P)P(t ′ > t)=∑Mi=1 pie−λi(t+P)∑Mj=1 p je−λ j(t)=M∑i=1qie−λi(P),where qi = pie−λi(t)/∑Mj=1 p je−λ j(t).The derivation above illustrates that when we do not see arrivals undera hyper-exponential distribution, then we can make predictions using amodified hyper-exponential distribution.We can use this insight as follows:• If we poll a sensor and do not find an any useful data, then we canmodify the arrival distribution. We can change our estimate for theexpected utility in the next period according to the modified hyper-exponential distribution.• If we do obtain useful data when we poll a sensor, we make our nextestimate using the original hyper-exponential distribution associatedwith that sensor.33We also note that if we have multiple consecutive periods when a sensordoes not produce useful data, then we can keep shifting the associateddistribution based on the number of periods that have elapsed with noevent.One can comfortably accommodate this observation in the analysis we haveshown so far. Therefore, we can derive an alternative index policy usingthis adaptive approach, and we denote this policy IPv in our numericalstudy (Section 2.7) and compare it to the original policy that we denoteIPf.2.6 Explicit Analysis of Stochastic ArrivalsWe now consider the case when the true discrete-event stochastic processgoverns observations at the sensors. In this case, the value of observationsat a sensor may differ (and are unknown ahead of time), but the mean datavalue at each sensor is known.Let{t ik}represent the times at which sensor i records a new observation.The value of these observations is represented by{ν ik}. The index k here isthe observation count.We assume that a sensor-dependent Poisson process governs the arrivalsof observations. Further, we assume the observation values{ν ik}are alsoindependent , and that they are identically distinguished for sensor i.34The value accumulated at source i during the jth epoch (between samplinginstants j−1 and j) will be:υi( j) := ∑t ik:( j−1)P≤t ik< jPν ike−µi( jP−t ik)The state of the system at t = ( j+1)P is then:ϒi( j+1) = ai(t) =αiXi( j)+υi( j+1), if not polledυi( j+1), if polled.The average expected reward can be defined aslimsupt→∞N∑i=11tt∑m=0E[r(ϒi(m),ai(m))],and we want to maximize this function subject to the cost/bandwidth con-straintlimsupt→∞1tN∑i=1riE[ai(t)] =M.For the immediate discussion, we will not use the index i. We can focus onany one sensor and explicitly use i later, as needed.Vρ(s) := sup{V (t)},ϒ(0)=sE[ ∞∑t=0ρ tV (ϒ(t),a(t))]35is the discounted value function satisfying the following:Vρ(s) =max{γλ +ρ∫Vρ(αs+υ)ϕ(dυ), (2.13)s+ρ∫Vρ(υ)ϕ(dυ)}.In the expressions above, ϕ represents the function that governs υ(t),∀ t.Lemma 3. The solution to (2.13) has the same properties as the solution to(2.5):1. It is unique and bounded;2. It is continuous over ρ ∈ (0,1) in the Lipschitz sense;3. It increases monotonically and is convex.Proof of Lemma 3Proof. Claim (i) can be derived from the standard theory of DTMCs, asstated before.Claim (ii): Let ϒ(t) and ϒ′(t) be defined with identical processes for whenthe sensor records data and for the control (poll/not polled). With a(·)begin optimal for ϒ(·). ϒ and ϒ′ only differ in their initial states, which are36s and s′, respectively. Then,Vρ(s′)−Vρ(s)≤ E[∞∑t=0ρm{V(s′(t),a(t))−V(ϒ(t),a(t))}]≤ E[1− (αρ)t−1−αρ](s′− s),where t− := min{t ≥ 0 : ϒ(t) = ϒ′(t)}. t− represents that instant at whichwe first poll this sensor. We can show that the value function is continuousas we did earlier.To show claim (iii), we consider the processes{ϒ(t)}and{ϒ′(t)}generatedby a common action sequence a(t). Similar to the proof of Lemma 2, we willconsider the supremum over all action sets but after taking expectationson both sides. This would establish monotonicity. Convexity follows in afashion similar to the deterministic case.Now, the dynamic program considering average costs can be written asV (s)+ξ =max{γλ +∫V (αs+υ)ϕ(dυ),s},we can use the discounting approach from earlier and also makeV (·) uniqueby forcing∫V (s)ϕ(dυ) = 0.We can show that V (·) is monotone and convex using pointwise limits.37When f (·) is convex, we have s 7→ ∫ f (bs+y)ω(dy), for all bs and probabilitymeasures ω on R.We getE[V (ϒ(t))]+ξ = E[V (ϒ(t),a(t))]+E[V (ϒ(t+1))].Using the above equation, we can then establish that:a?(s) ∈ argmax{as+(1−a)(λ +∫V (αs+a)ϕ(dυ))},with s ∈ S, using a line of reasoning similar to the deterministic case.Then, through iteration, we can show that V is the unique solution to thedynamic programming equation by representing V (s) as follows. For pas-sive s:V (s) =maxEs[(γλ −ξ (λ ))θ +αθ s+ θ∑t=1αθ−tυ(t)]−ξ (λ ) (2.14)=maxEs[(γλ −ξ (λ ))θ +αθ s+ (1−αθυ?)]−ξ (λ ) (2.15)where θ is the time at which the sensor is polled for the first time. Wetake the maximum over all valid sequences of actions {υ(t)}, where υ :=E[υ( j)]+υ? = υ1−α .38For active s, we haveV (s) = s−ξ (λ ). (2.16)The second equality for V (s) when s is passive is a consequence of theOptional Stopping Theorem (due to Doob) [38].Now, considering only the interesting cases where ξ (λ ) > min(λγ,υ), ther.h.s. of (2.15) will be less than s−ξ (λ ) for s> υ?. This observation impliesthat V (s) is smaller than the r.h.s. of (2.16), which suggests that s musthave been active. Reasoning as we did in the deterministic or fluid-flowapproximation situation, we can assume the state space as [υ ,υ?].We can show that the optimal policy for polling sensors is a threshold policyexactly like in the deterministic case but with the appropriate change ofdefinition for sets D and Dc.D :={s ∈ S : λγ+∫V (αs+υ)ϕ(dυ)> s}Dc :={s ∈ S : λγ+∫V (αs+υ)ϕ(dυ)≤ s}.Lemma 4. There is an index policy to solve the sensor selection problem withstochastic observations.Proof of Lemma 4Proof. Consider V (s) =maxEs[(γλ −ξ (λ ))θ +αθ s+(1−αθ )υ?]−ξ (λ ).39The maximum is over every valid policy, and consequently, over everythreshold policy too. Pick some threshold, and let the initial condition bes ∈ [υ ,υ?]. Now, consider a process that uses the threshold policy. Then, θis an r.v. That does not depend on λ . Now, ξ ′(λ ) < γ so we can infer thatthe expression on the r.h.s. monotonically increases with λ . This propertyholds for when we maximize over every threshold-based policies. Let s?(λ )be the optimal threshold with λ as the Lagrange multiplier (or subsidy) forpassivity.Now, (2.15) and (2.16) will continue to hold for s= s?(λ ).Define F(λ ,s) asF(λ ,s) :=maxEs[(γλ −ξ (λ ))θ +αθ s+(1−αθ )υ?], ∀s ∈ D,s, ∀s ∈ Dc.Here the maximization is over all threshold policies. F(λ ,s) is a convexincreasing function in s because V is convex and increasing in s. F(λ ,s)also increases with λ . s?(λ ) is a fixpoint of F(λ , ·). The best action at s= νis to be passive because ξ (λ )> υ , hence F(λ ,υ)> υ ,∀λ .Next, we note that it is optimal to be active when s = υ?. This gives us40F(λ ,υ?) = υ?. The convex curve s 7→ F(λ ,s) intersects y = s at preciselyone point in [υ ,υ?] and the intersection is at s?(λ ), by definition. This pointincreases with λ because F(·, ·) does.Hence we conclude that an index policy must exist.Define τ as the first time ≥ 1 when the state of an arm enters Dc withϒ(0) = u(0). This is the next polling time after 0.ϒ(τ) =τ∑t=1ατ−tυ(t), j ≥ 1.We can restrict our attention to stationary Markovian policies because ofthe underlying dynamic programming formulation.Let λ = g(s) be the index value at s. With λ = g(s),τ(s) = E[τ], and ξ as theoptimal set.Using the standard theory for renewal-reward processes [62], we haveξ (s) =λγ(τ(s)−1)+E[ϒ(τ(s))]τ(s).The definition of g(s) is that it is a subsidy needed to make an arm passiveat s. Then, we can note that s ∈ Dc and the best action would be to poll41the sensor. Using the definition of g(s), s = γg(s)+E[V(αs+υ(1))], andthereforeγg(s) = s−E[V (αs+υ(1))]= s−∫ϕ(dυ)Eαs+υ[(γg(s)−ξ (s))θ +αθx+θ∑t=0αθ−tυ(t)−ξ (s)]= s− (γg(s)−ξ (s))∫ϕ(dυ)Eαs+υ [θ ]+∫ϕ(dυ)Eαs+υ [αθ ]s−ξ (s)+∫ϕ(dυ)Eαs+υ[ θ∑t=0αθ−tυ(t)].From the equation above, we can solve for g(s) by the observation that allthe expectations are computable. To solve for g(s), we could adopt thefollowing computational procedure. Fix the threshold policy to use sˆ as thethreshold. V should satisfy the following conditions:V (s) = γλ −ξ +∫V (αs+υ)ϕ(dυ), s< sˆ (2.17)V (s) = s−ξ , s> sˆ (2.18)42V (u) = 0 (2.19)The index g(sˆ) must satify g(sˆ) = λ andγλ +∫V (αs+υ)ϕ(dυ)− s= 0,which is derived from 2.18. We can write V (s), the unique solution to(2.17), (2.18), and (2.19) to make an explicit connection to λ . We thenlearn g(sˆ) through a series of stochastic approximations:gm+1 = gm−b(m)[γgm+V(gm+V (gm,α sˆ+υm)− sˆ)],where gm is forced towards (2.18) holding.In this procedure, we assume that {υm} are r.v.s that are independent andidentically distributed according to distribution φ . This assumption leadsto significant computational demands but can be reduced to relative valueiteration algorithms where gm is time-dependent [46].gm will, asymptotically, converge to the index. This computational ap-proach is needed before each sˆ, but one could select a finite set of sˆ valuesand then interpolate as a further approximation strategy.Computational challenges with the index policy: When we consider the43fully stochastic nature of the sensing process, the computational complexityof the index policy is high and this approach, even though it is sub-optimal,may not be practical for large sensor deployments. Consequently, we canuse the index policy derived using the fluid-flow approximation as a re-placement. We applied the index policy that we first derived to the utterlystochastic scenario. Then, we compared our proposed index policy withthe two other heuristics (greedy sensor selection and round-robin) that wehave in our numerical evaluations (Section 2.7) and found that first indexpolicy does outperform the other heuristics. We, therefore, believe that thefluid-flow approximation does preserve some of the essential problem char-acteristics and can yield satisfactory policies. We do not include this set ofnumerical evaluations in the next section because the results are similar tothe other results we report.2.7 Numerical Evaluation and AnalysisThe index policies we propose are sub-optimal. The decomposition of thestochastic dynamic program into separate problems per arm results in thesub-optimality. On the other hand, we find that the index policies performwell in comparison to some other conceivable policies. We compare thetwo proposed policies (labelled as IPv and IPf) with another two policies:• Greedy (GD): The greedy sensor selection strategy chooses the sen-sors that have highest value at the time of polling (epoch) until it44reaches the bandwidth limit.• Round-robin (RR): The round-robin strategy selects sensors by turnuntil it reaches the bandwidth limit.Our discussion here is restricted to the fluid-flow approximation (Section 2.4)since the application of the four policies (IPv, IPf, GD, RR) to the completelystochastic model yield similar results. We simulated our problem environ-ment for our numerical evaluations in python 2.7 on a PC with 8GB RAM,and core i5 intel cpu.The greedy strategy chooses sensors based on their current accumulatedreward (that is accumulated for each sensor since its last selection). Theround-robin strategy selects sensors in turn and ignores accumulated valuesand other factors.For the index policies (IPv and IPf), in general, we selected the arms withthe highest indices until the bandwidth constraint was not exceeded. How-ever, the hyper-exponential distribution varies in IPv (unlike remainingfixed in IPf) based on occurence of event(s) in each period of time (asdiscussed in Section 2.5).To compare IPv and IPf with GD and RR, we carry out two types of ex-periments. In the first set of experiments, the bandwidth cost of each sen-sor/arm was kept identical (i.e. fixed); in the second, we selected these45costs at random (i.e. variable).The other parameters in these experiments are chosen as follows:• The value of a data item (ν) at the sensor is selected from an expo-nential distribution with parameter 1.0. Note that when we use thefluid-flow approximation, all observations at a particular sensor pro-vide the same value νi at sensor i, and therefore we use the describeddistribution to select this value; the value of data may differ from onesensor to another.• The rate at which the value of a data item decays (β) is chosen froma uniform distribution over [0.01,0.99]. A different decay rate is se-lected for each sensor.• The rate at which events arrive (µ) in each of the two distributions(in the hyper exponential) is chosen from a uniform distribution over[0.01,25]. Again, a different arrival rate is chosen for each sensor.Workload intensity: We define a workload intensity metric as:∑i∈Mνi×µarrival×βi (2.20)where M is the set of all messages arriving in a simulation run, µarrival is thearrival rate of events, and νi and βi are the value and decay rate for eachmessage, respectively. We consider the decay rate as an approximation for46the deadline of each message. We could calculate this metric of “workloadintensity” for each simulation over the total number of time steps. Such ametric essentially captures, to some extent, the workload intensity. There-fore, we report the performance of the policies concerning the “averagereward” (on the Y-axis) along with the “workload intensity” (on the X-axis)when presenting the simulation results.47EvaluationsetupSimulationswhenIPv/IPfwinAverageperformanceadvantage#BanditsBandwidthIPvvs.RRIPfvs.RRIPvvs.GDIPfvs.GDIPvvs.RRIPfvs.RRIPvvs.GDIPfvs.GD4◦298%75%100%100%12.1%6%15.6%11.5%8◦499%85%100%100%14.2%9.1%17.1%12.3%16◦899%92%100%100%15.6%10.8%17.3%12.4%32◦16100%98%100%100%16.1%11.8%16.8%12.4%4•289%76%97%97%11%7%22.1%18%8•499%89%100%100%14.5%10.2%23.9%20.1%16•899%90%100%100%16.2%11.5%25.7%21.5%32•1699%93%100%100%15.9%11.3%26.1%21.7%◦Banditswithfixedbandwidthcosts.•Banditswithvariousbandwidthcosts.Table2.1:Comparisonofalgorithmsinthecasesoffixedandvariedbandwidthcosts.Thetableshowsthepercentageofsimulationrunswhereintheindexpolicyoutperformsotheralgorithmsaswellastheaverageperformanceadvantage.482.7.1 Identical bandwidth costsFirst, we consider the simple case of equal sensor polling costs (γ=1 for allsensors).We start with four bandit arms and a bandwidth limit of two (M= 2), whichimplies that we can pull two arms at any given epoch. We increase thenumber of bandits to 32 bandits and with a bandwidth of 16 to observe theperformance of our algorithms at a slightly larger scale. We ran 1000 MonteCarlo trials for each policy and calculated the average reward attained bya strategy.The experiments indicate that IPv and IPf dominantly outperform the othertwo algorithms (Figures 2.2, 2.3, 2.4, and 2.5). To be more specific, IPvand IPf always outperformed GD. However, in comparison with RR, bothpolicies performed better in a majority of cases. IPv outperformed RR in al-most all cases (98% to 100%) with a performance margin of 12.1% to 16.1%.Also, IPf outperformed RR in most cases (75% to 98%) with a performancemargin of 6% to 11.8%. In both comparisons when RR did outperform IP,the performance difference was only 0.5%.2.7.2 Varied bandwidth costsIn order to further evaluate our index-based approach, we considered thecase of having different bandwidth costs among sensors/arms. We ran-domly assigned γ values to each sensor/arm; the value was selected from a49Figure 2.2: Performance of different policies with identical bandwidthcosts: γ = 1 for all sensors where #Bandits = 4, Bandwidth Limit= 2. The proposed index policy typically outperforms greedy andround-robin sensor selection. The Y-axis is the Average Reward /Greedy Reward (hence the Greedy policy always has the value of 1)and the X-axis represents the Workload Intensity that we define as:∑i∈M νi× µarrival ×βi. According to the definition of the workload in-tensity, it increases with an increase in data value (νi), an increase indata arrival (µarrival) or an increase in decay rate (βi) of messages.uniform distribution between 0 and M/2, where M is the bandwidth limit.We did not see differences in performance relative to the earlier set of ex-periments with fixed bandwidth costs. Again, IPv and IPf almost alwaysoutperformed GD with higher performance margin with respect to the caseof “fixed bandwidth costs” as shown in Table 2.1. In comparison with RR,both policies performed better in the majority of cases. IPv outperformedRR in (89% to 99%) of the times with a performance margin of 11% to15.9%. Also, IPf outperformed RR in (76% to 93%) of the times with a50Figure 2.3: Performance of different policies with identical bandwidthcosts: γ = 1 for all sensors where #Bandits = 8, Bandwidth Limit= 4. The proposed index policy typically outperforms greedy andround-robin sensor selection. The Y-axis is the Average Reward /Greedy Reward (hence the Greedy policy always has the value of 1)and the X-axis represents the Workload Intensity that we define as:∑i∈M νi× µarrival ×βi. According to the definition of the workload in-tensity, it increases with an increase in data value (νi), an increase indata arrival (µarrival) or an increase in decay rate (βi) of messages.performance margin of 7% to 11.3%. In both comparisons, when RR didoutperform IP, the performance difference was less than 1.5%. The perfor-mance details of IPv and IPf relative to the other policies is also tabulated(second half of Table 2.1).2.7.3 IPv vs. IPfBesides the comparison results of our proposed policies to other policies,we compared IPv and IPf in terms of total rewards accrued over simula-51Figure 2.4: Performance of different policies with identical bandwidthcosts: γ = 1 for all sensors where #Bandits = 16, BandwidthLimit = 8. The proposed index policy typically outperforms greedyand round-robin sensor selection. The Y-axis is the Average Reward /Greedy Reward (hence the Greedy policy always has the value of 1)and the X-axis represents the Workload Intensity that we define as:∑i∈M νi× µarrival ×βi. According to the definition of the workload in-tensity, it increases with an increase in data value (νi), an increase indata arrival (µarrival) or an increase in decay rate (βi) of messages.tion runs. As implicitly implied from the above results in sections 2.7.2and 2.7.1, IPv outperformed IPf in all simulation setups (as shown in Fig-ure 2.10 and Figure 2.11). This observation suggests IPv as the dominantpolicy.2.7.4 Selecting arms using indices: top-k arms vs.knapsack packingThe index policy prioritizes the arms/sensors to be polled at an epoch. Wecan either select the arms with the highest indices until we exhaust that52Figure 2.5: Performance of different policies with identical bandwidthcosts: γ = 1 for all sensors where #Bandits = 32, BandwidthLimit = 16 The proposed index policy typically outperforms greedyand round-robin sensor selection. The Y-axis is the Average Reward /Greedy Reward (hence the Greedy policy always has the value of 1)and the X-axis represents the Workload Intensity that we define as:∑i∈M νi× µarrival ×βi. According to the definition of the workload in-tensity, it increases with an increase in data value (νi), an increase indata arrival (µarrival) or an increase in decay rate (βi) of messages.bandwidth, or treat the problem as a knapsack [17]. We believed thatthere might be some gains in solving the knapsack problem (even thoughit is an NP-hard problem) and select arms. However, numerical resultssuggested that the more straightforward approach of selecting the armswith the highest indices performed well, and the more elaborate approachseems unnecessary. Note that we use IPv for this comparison.53Figure 2.6: Performance of different policies with randomly selectedbandwidth costs where #Bandits = 4, Bandwidth Limit = 2. Theindex policy tends to outperform the greedy as well as the round-robinpolicy except for a few cases when round-robin selection has a smalladvantage. The Y-axis is the Average Reward / Greedy Reward (hencethe Greedy policy always has the value of 1) and the X-axis representsthe Workload Intensity that we define as: ∑i∈M νi× µarrival × βi. Ac-cording to the definition of the workload intensity, it increases withan increase in data value (νi), an increase in data arrival (µarrival) oran increase in decay rate (βi) of messages.2.7.5 Insight from numerical evaluationsThe main message that we want to communicate from our numerical eval-uation is this: the index policy usually does better than the other policies,and when it does underperform another policy, the difference in perfor-mance is rather small. These results suggest that IPv is a policy that we canapply consistently and expect reasonable performance from.54Figure 2.7: Performance of different policies with randomly selectedbandwidth costs where #Bandits = 8, Bandwidth Limit = 4. Theindex policy tends to outperform the greedy as well as the round-robinpolicy except for a few cases when round-robin selection has a smalladvantage. The Y-axis is the Average Reward / Greedy Reward (hencethe Greedy policy always has the value of 1) and the X-axis representsthe Workload Intensity that we define as: ∑i∈M νi× µarrival × βi. Ac-cording to the definition of the workload intensity, it increases withan increase in data value (νi), an increase in data arrival (µarrival) oran increase in decay rate (βi) of messages.2.8 Related WorkWe now position our work relative to prior research. Our work is closelyconnected with the work related to multi-armed bandits and, in particular,the model of restless bandits.The model of restless bandits was presented by Whittle [69]. In the origi-nal formulation, there are many arms and pulling arm results in a reward.55Figure 2.8: Performance of different policies with randomly selectedbandwidth costs where #Bandits = 16, Bandwidth Limit = 8. Theindex policy tends to outperform the greedy as well as the round-robinpolicy except for a few cases when round-robin selection has a smalladvantage. The Y-axis is the Average Reward / Greedy Reward (hencethe Greedy policy always has the value of 1) and the X-axis representsthe Workload Intensity that we define as: ∑i∈M νi× µarrival × βi. Ac-cording to the definition of the workload intensity, it increases withan increase in data value (νi), an increase in data arrival (µarrival) oran increase in decay rate (βi) of messages.The underlying probability distribution for rewards is unknown and maychange with time (hence the restlessness). The goal is to identify a policythat maximizes the long-term average reward. This model generalizes themulti-armed bandit model that was initially studied by Gittins and had adetailed presentation in a more recent monograph [21]. The solution ap-proach is to define a priority/index computation that is efficient and helpsdetermine the actions to take at each decision epoch.56Figure 2.9: Performance of different policies with randomly selectedbandwidth costs where #Bandits = 32, Bandwidth Limit = 16.The index policy tends to outperform the greedy as well as the round-robin policy except for a few cases when round-robin selection hasa small advantage. The Y-axis is the Average Reward / Greedy Reward(hence the Greedy policy always has the value of 1) and the X-axis rep-resents the Workload Intensity that we define as: ∑i∈M νi×µarrival×βi.According to the definition of the workload intensity, it increases withan increase in data value (νi), an increase in data arrival (µarrival) oran increase in decay rate (βi) of messages.Most such problems can be tackled using stochastic dynamic programming [50],but such direct approaches suffer from the curse of dimensionality, leadingto impractical solutions for a large number of sensors/bandit arms. The in-dex approach is suboptimal–although asymptotically it is near-optimal–butavoids brute-force dynamic programming by reducing the problem to a setof more straightforward problems (one per arm). Prior work has focusedon the state space of the arms being discrete, but, as we shall see, in our57Figure 2.10: Comparison of IPv and IPf in terms of total accrued re-wards with identical bandwidth. IPv shows a consistent advantagewith respect to IPf in all simulations setups.formulation, the state space is continuous (albeit closed) and requires thatwe establish the existence and suitability of priority/index policies.Kleinberg and Immorlica have suggested the model of recharging bandits [31],wherein an arm that has not been played accrues rewards over time accord-ing to some concave function. The assumption of concavity in how an arm’srewards grow between pulls of the arm allows for a polynomial-time ap-proximation scheme. The problem we study bears similarity with the workby Kleinberg and Immorlica, but that fact that a sensor’s state may growwith time (as events are observed) and decay (as the data becomes stale)does not permit for the same analysis as in the case of recharging ban-dits. We may view the problem that we have presented as one involvingrecharging-discharging bandits, with the recharge-only model being a partic-58Figure 2.11: Comparison of IPv and IPf in terms of total accrued re-wards with varied bandwidth. IPv shows a consistent advantagewith respect to IPf in all simulations setups.ular case. This difference seems to require the scheme we have discussedbecause the general restless bandit problem is PSPACE-Hard even to ap-proximate [44] and work by Guha, Munagala and Shi [23] presents someapproximation algorithms for some special cases. The work we presenthere is more general than the recharging bandits model but not as generalas what was studied by Guha et al., and the results we present are relevantand interesting for a specific class of problems.Sensor scheduling has been modelled as a restless bandit problem but withdifferent constraints and objectives [42]: to find specific elusive targetsusing imperfect sensors. Similarly, there has been an effort to use the rest-less bandit model for sensors with energy harvesting considerations [27].Such work did not address the issues of data value and time sensitivity,59Figure 2.12: Comparing top-k selection with knapsack packing for in-dex policy: We notice no significant difference between the twoapproaches so the top-k approach suffices. The Y-axis is the AverageReward / Greedy Reward (hence the Greedy policy always has thevalue of 1) and the X-axis represents the Workload Intensity that wedefine as: ∑i∈M νi× µarrival × βi. According to the definition of theworkload intensity, it increases with an increase in data value (νi),an increase in data arrival (µarrival) or an increase in decay rate (βi)of messages.which also require some analysis of the continuous state-space. Iannelloand Simeone [26] studied the problem of optimally scheduling stochasti-cally generated independent and time-sensitive tasks, where a centralizedcontroller assigns at each time slot a node to a server for it to execute a task.The setting of their work is similar to ours in that a centralized decision-60making entity is assumed, task inter-arrival times are exponentially dis-tributed, time-sensitivity is an explicit constraint, and the policy derivedis a restless multi-armed bandit. One key difference, however, is that weconsider continuous-time state dynamics, as opposed to the discrete-timestate evolution model that the previous work considers. The considerationof continuous-parameter dynamics—and a continuous state-space—posessignificant analytical challenges that are otherwise not present in discreteparameter/state-space models. In this model, sensors reinitialize state aftersome time steps were explored by Villar [66].Optimal strategies for obtaining data from sensors with delay/freshnessconstraints have been studied in the context of a single sensor, to balanceenergy consumption with data freshness [19]. In these articles, a singlesensor node was considered with the goal of maximizing a weighted func-tion that accounted for sensor energy and data freshness. The problem ofchoosing which sensors to poll when there are multiple available sensorsrequires a different approach.Heuristics for data collection have been explored in the IoT setting [25, 30],but these methods approach the problem from a broader system-buildingperspective, and the algorithms do not have proofs of optimality, or approx-imation ratios, or competitive ratios.The sensor scheduling problem has been tackled recently in the specific61context of networked control systems, with the goal of state estimation.Weerakkody et al., for example, examine the multi-sensor scheduling prob-lem to minimize mean squared error estimates [68]. Such work assumesspecific knowledge of the underlying system state and its dynamics. Ourformulation is relevant in situations where such system dynamics are notclearly defined, and the value of sensor data is measured by some exoge-nous processes such as data use. With similar information, Han et al. havealso examined the stochastic sensor scheduling problem [24].Clark et al. has used the general notion of adaptive delayed polling ofsensors. [16], but not with near-optimal schedules.2.9 ConclusionsWe expect that as the deployment of the Internet of Things progresses, wewill need to manage a massive data volume, and not all the data can begathered in one place for processing. When some centralization is neededfor data processing, sensor data selection is the first step in a pipeline oftasks that includes data management and analysis to support large-scalereasoning and decision making in a variety of applications [7]. We haveconcentrated on the first step alone; the amount of data produced by sen-sors alone can be overwhelming, and we need the right strategies for han-dling this deluge. Effective filtering of data in the early stages can reducethe pressure on other stages of IoT data processing systems that would62store and perform computation on the data.Based on our analysis and evaluation, the index-based approach, which iscomputationally simpler than stochastic dynamic programming, is difficultto realize precisely when we consider the stochastic model for event arrivalsand observation values. On the other hand, a fluid-flow approximation ofthe stochastic process may be sufficient to yield a simple and effective indexpolicy for deciding which sensors to poll periodically. Although we poll sen-sors periodically, the set of sensors polled at each epoch may change and,depending on the underlying parameters; we find that an atomic structuremay not exist for how this set of sensors changes from one epoch to thenext.We have chosen to model a polling approach and restricted our attention tothe problem of deciding which sensors to poll at each epoch. Such pollingsystems are easy to implement, simplify system architecture, and can besuitable for satisfying specific data privacy requirements.To conclude, our focus in this work is on the abstraction of polling sensorswhere we have some notions of delay sensitivity and limitations (such asprivacy sensitivity as discussed in Section 2.1) that prevent a push-drivenarchitecture and we showed that our proposed approach is effective in anysuch applications.We have not answered the question of what is an ideal polling period. The63answer to this question will also depend on the characteristics of what isbeing sensed. We could remove the restriction of periodic polling and allowfor adaptive polling intervals. This problem needs further study, althoughwe believe that such adaptivity will lead to more fragile system architec-ture.We want to emphasize, as we conclude, that sensors need not be physicalsensors but could also represent feeds on services such as Twitter. Theapproach we present can be adapted to a variety of applications.Concerning potential future directions, we envision multiple avenues toexplore:• Can we generalize this work to other distributions that govern arrivaltime? One idea would be to come up with multiple solutions andswitch among a set of possible solutions appropriately based on thesituation. For instance, one could explore coming up with a simi-lar solution (i.e., deriving a priority index) for various assumptionsconcerning arrival distributions. On top of such solutions, we maybe able to effectively select the best solution depending on the mostrecent trend of arrival times.• How can we design strategies when at least some of the strategy (e.g.,distributions) need to be learnt online? This situation will require thatone incorporates parameter estimation as part of the solution. One64could investigate the feasibility of modelling the problem as an onlinelearning problem over the environment. In this broad class of prob-lems, agents aim to learn the true value of a parameter, often calledthe underlying state of the world. The state could represent a prod-uct, an opinion, a vote, or a quantity of interest in a sensor network.It is interesting to verify if such an approach could be applied to ourproblem with slight modifications of the model. For instance, if weassume each sensor has more resources available (e.g., to communi-cate with other sensors), we may consider each sensor as an agent.In this case, each agent observes feedback about the underlying stateat each period (or the value sensed data) and communicates with herneighbours to augment her imperfect observations and dynamicallylearn about the environment. Since each agent aims to minimize itsloss, comparison of each agent’s loss at the end of each polling epochcould be equivalent to a priority policy.65Chapter 3Balancing Message Criticality andTimeliness in IoT Networks: AQ-Learning ApproachSummary. We study the problem of balancing timeliness and criticalitywhen gathering data from multiple sources using a two-level hierarchicalapproach. The devices that generate the data transmit them to a local hub.A central decision-maker then has to decide which local hubs to allocatebandwidth to, and the local hubs have to prioritize the messages they trans-mit when allowed to do so. Whereas an optimal policy does exist for thisproblem, such a policy would require global knowledge of messages at each66local hub, rendering such a scheme impractical. We propose a distributedreinforcement-learning-based approach that accounts for both the timeli-ness requirements and criticality of messages. We evaluate our solutionusing a criticality-weighted deadline miss ratio as the performance metric.The performance analysis is done by simulating the behaviour of the pro-posed policy as well as that of several natural policies under a wide range ofsystem conditions. The results show that the proposed policy outperformsall the other policies – except for the optimal but impractical policy – underthe range of system conditions studied and that in many cases it performsclose (3% to 12% lower performance depending on the condition) to theoptimal policy.3.1 IntroductionWith the proliferation of devices that can gather and transmit data aboutour world, we are confronted with the challenge of collecting and process-ing this data. More specifically, many such devices that we may deploy toobserve and control aspects of our physical environment will use wirelesscommunication links, and the available bandwidth for data transmissionscan easily be saturated. To operate within these limits, we can perform asignificant amount of data processing on the device that observes the dataand we have to prioritize what data we choose to transmit. Some localprocessing is always possible, but certain decisions may require input fromdevices that are scattered quite widely in physical space. Therefore the67data from many devices will need to be collected at some centralized (orsemi-centralized) location for joint processing.We discuss how we can prioritize and gather data from multiple devices whenthe data may have different timeliness and criticality requirements. In thesystem architecture, we consider, data is collected in a two-level setup: devicesmay communicate with a local hub, and local hubs communicate with a globalentity. Our approach uses observations of system behaviour and reinforcementlearning to identify suitable scheduling decisions. We present the mathemat-ical analysis and simulation-based evaluation of scheduling policies for thisproblem.3.2 System ModelWe consider a system with hierarchical network architecture. At the lowestlevel are IoT devices such as sensors. These devices communicate with alocal hub, which is a device that mediates communication with a centralhub. We assume that the communication between the IoT devices andthe local hub is not bandwidth-limited and that the local hubs use anysuitable multiplexing mechanism to send and receive messages from theIoT devices. Local hubs collect messages that need to be transmitted to thecentral hub. They need to decide on how to prioritize these messages fortransmission to the central hub.We consider messages of unit length (equal-sized messages). We use the68Figure 3.1: The hierarchical model where unit-length messages withdeadlines (di) and criticalities (κi) arrive at local hubs and thentransmitted to the central hub (based on a policy).following notation to describe a message and its characteristics: MessageMi has a relative deadline of di and a criticality value of κi. If a message iscreated at time t, then the absolute deadline for delivery of this message ist+di.At the local hub, messages are assigned priorities from a fixed set {1,2, . . . ,P},with P representing the highest priority level.At each scheduling epoch, the central hub selects one (or more) local hubsthat can use the available bandwidth to transmit messages. A local hub willthen transmit messages from its queues, starting with the highest priorityqueue that is not empty and moving to lower priority queues when higherpriority queues are empty.69Performance Metric In the model that we have described, the system goalis to minimize the criticality-weighted deadline miss ratio that is defined asfollows. Let N be the total number of messages generated during the timeinterval of interest that also have deadlines within that time interval. Let xibe an indicator variable for whether message Mi missed its deadline or not.The criticality-weighted deadline miss ratio isρ := ∑Ni=1 xiκi∑Nj=1κ j. (3.1)With this performance metric, we can then state the problem that we wantto solve.Problem Statement We want to determine a priority assignment policy foruse at the local hubs and a bandwidth allocation policy at the central hubto minimize the criticality-weighted deadline miss ratio for the online sce-nario. We do not know what messages will arrive until they arrive, anddecisions need to be made for each new message.3.3 Optimal Offline PolicyThe problem of message prioritization can be solved efficiently, and op-timally, when messages are all of equal length and information about allthe messages is available at a central location. This offline and centralizedpolicy is not realistic for two reasons: (i) In practice, messages arrive and70need to be prioritized as they arrive, and (ii) the cost of centralizing deci-sion making imposes a high overhead on system operation. Nevertheless,we present the scheduling algorithm for this case because the performanceof this approach is an upper-bound on the performance that any online anddecentralized approach can achieve. We refer to this policy as OP.Consider an offline setting, where there is a list of messages that need to bescheduled for transmission. Message Mi has criticality value κi, a relativedeadline di, and unit length. The goal is to develop a scheduling policy (oralgorithm) that schedules/ selects messages to get processed while mini-mizing the metric ρ (Section 3.2).Formally, we can state the problem as follows:• Input: (d1,κ1), (d2,κ2), ..., (dn,κn)• Output: Schedule S= {S(1),S(2), . . . ,S(i), . . . ,S(n)} where |S| ≤ n:– S(i) = j means that the message Mi is scheduled in time slot j;– Any message is scheduled at most once;– If xi = 1 when Mi misses its deadline and xi = 0 otherwise thenρ := ∑ni=1 xiκi∑nj=1 κ jis minimized.In the offline scenario, minimizing ρ is equivalent to maximizing the criti-cality sum of the messages that meet their deadlines.71The proposed greedy algorithm (i.e., Algorithm 1) solves this problem op-timally.Algorithm 1 Optimal Greedy Policy1: sort messages in non-increasing order of criticality values: κ1 ≥ ...≥ κn2: for t← 1 to n do3: S(t)← 04: for i← 1 to n do5: if there is any time slots left before di then6: Schedule Mi in the latest free slot that is possible before di7: else8: skip message MiTheorem 1. Algorithm 1 maximizes the sum of criticalities of messages thatmeet their deadlines.Proof. We use induction for the proof as follows:• Base Case: To see that P(0) holds, consider any optimal scheduleSopt . Clearly, Sopt extends the empty schedule using only messagesfrom {1, ...,n}. So let 0 ≤ i < n and assume P(i) holds. We want toshow P(i+ 1). By assumption, Si can be extended to some optimalschedule Sopt using only messages from {i+1, ...,n}.• Induction Step: Suppose that S is promising, and let Sopt be someoptimal schedule that extends S. Let Si+1 be the result of one moreiteration through the loop where message Mi+1 is considered. Wemust prove that Si+1 continues to be promising, and therefore the72goal is to show there is an optimal schedule solution that extendsSi+1. Hence, we consider the following two cases:– Case 1: Message Mi+1 cannot be scheduled, so Si+1 = Si. SinceSopt extends Si , we know that Sopt does not schedule messageMi+1. Therefore, Sopt extends Si+1 using only messages from {i+2, ...,n}.– Case 2: Message Mi+1 is scheduled by the algorithm, say at timet0 (so Si+1(t0) = i+ 1 and t0 is the latest free slot in Si that is≤ di+1).* Case 2-I: Message Mi+1 occurs in Sopt at some time t1 (wheret1 may or may not be equal to t0). Then t1 ≤ t0 (because Soptextends Si and t0 is as large as possible) and Sopt(t1) = i+1=Si+1(t0).If t0 = t1, we are finished with this case, since Sopt extendsSi+1 using only messages from {i+ 2, ...,n}. Otherwise, wehave t1 < t0. Say that Sopt(t0) = j 6= i+1. Form S′opt by inter-changing the values in slots t1 and t0 in Sopt . Thus, S′opt(t1) =Sopt(t0) = j , and S′opt(t0) = Sopt(t1) = i+ 1. The new sched-ule S′opt is feasible (since if j 6= 0, we have moved messageM j to an earlier slot), and S′opt extends Si+1 using only mes-sages from {i+2, ...,n}. We also have P(Sopt) = P(S′opt), and73therefore S′opt is also optimal.* Case 2-II: Message Mi+1 does not occur in Sopt . Define anew schedule S′opt to be the same as Sopt except for time t0,where we define S′opt(t0) = i+ 1. Then S′opt is feasible andextends Si+1 using only messages from {i+2, ...,n}. To finishthe proof for this case, we must show that S′opt is optimal.If Sopt(t0)= 0, then we have P(S′opt)=P(Sopt)+gi+1≥P(Sopt).Since Sopt is optimal, we must have P(S′opt) =P(Sopt) and S′optis optimal. So say that Sopt(t0) = j, j> 0, j 6= i+1. Recall thatSopt extends Si using only messages from {i+ 1, ...,n}. Soj> i+1, so g j ≤ gi+1. We have P(S′opt) = P(Sopt)+gi+1−g j ≥P(Sopt). As above, this implies that S′opt is optimal.Running Time The initial sorting can be done in Θ(n logn) time in theworst case, and the remaining two loops take Θ(n) time in the worst case.Therefore, Algorithm 1 runs in Θ(n logn) time in the worst case.Although this greedy algorithm is an optimal polynomial-time algorithm,it requires complete and centralized information about all messages in thesystem. This centralization will impose a high overhead, and therefore weseek decentralized solutions to the problem at hand.74Figure 3.2: The details of processing messages at local hubs. First,sensors detect events and send messages to local hubs. Second,messages are routed to appropriate queues based on the priorityassigned to each messages. Messages in the highest priorityqueue, are emptied first before messages at queues with lowerpriority level.3.4 Using Reinforcement Learning in aDecentralized PolicyGiven the two-level model that was described in Section 3.2, the cen-tral hub and local hubs have to make decisions on message transmission.Therefore, all hubs need to have policies for making such decisions. Herewe elaborate on a policy that uses reinforcement learning to achieve near-optimal performance.753.4.1 At Local HubsFor each message, as shown in Figure 3.2, the local hub has to decidewhich queue to place the message. We assume that there are P prioritylevels {1,2, . . . ,P}, each of which is associated with a queue. Messages inthe highest priority queue, associated with priority level P, are emptied firstbefore messages at priority level P−1 are transmitted, and so on. The localhub needs to decide on an action a ∈ {1,2, . . .P} for each message where arepresents the queue a message is assigned to.The state st of a local hub is the state of each of the queues at time t. We willuse n(st) to represent the number of messages queued at time t. A decisionmade to assign a message to a queue changes the state of the system.We can represent the value of a specific message M as a function of itsdeadline and criticality:vM = κMβ(1dM)α. (3.2)The message value is related to its criticality proportionally. In other words,the higher the criticality, the higher is the value. The message value isalso related to its relative deadline (shorter the relative deadline higher thevalue).α and β are parameters that help a system architect achieve a balance76between criticality and timeliness. These may be adapted on a per-local-hub basis, but in this discussion, all local hubs use the same settings forthese two parameters.We define the cost of taking action at at time t for new message M, whenthe system is in state st , as follows:c(at ,st) =∑M∈st+1w(M)v(M)n(st+1). (3.3)We use st+1 to indicate the state we reach after taking action at at statest . w(M) is the weight associated with each message, which represents theabsolute priority of M at the local hub. For the highest priority messagein the system (the message at the head of the highest priority queue withany message in it), w(M) = n(st) at state st and w(M) for the lowest prioritymessage (at the end of the lowest priority queue with any message in it) is1.c(at ,st) can be interpreted as the (weighted) average cost of a messagequeued at a local hub after taking action at .Priority assignment. An incoming message is assigned a priority p∗ basedon the immediately perceived cost and the future cost associated with thedecision:p∗ = argminp∈PE[c(st ,a(p))+ γQ(st+1)]. (3.4)77Q(st) is the value function for the local hub in state st and we use a(p) todenote the action of assignment priority p to the message.Updating the value function. As each local hub evolves by the arrivaland priority assignment of new messages, the value function (i.e., Q) getsupdated:Q(st) = Q(st)+δ [c(st ,at)+ γQ(st+1)−Q(st)], (3.5)where γ is the discounting factor and δ is the learning rate.3.4.2 At the Central HubWe assume that at the start of each scheduling epoch, the central hub gath-ers a snapshot of the state of the local hubs. This snapshot, for local hubL at time t, is the value Q(sLt ) and the number of queued messages, n(sLt )(as explained in Section 3.4.1). This information is relatively small in sizecompared to the actual messages.The central hub computes a weight for local hub L as follows:wL(st) = Q(sLt )n(sLt ) ∑Mi∈TLκi, (3.6)where TL is the set of messages that L can transmit in the one-time slot. Ifm messages can be transmitted in the one-time slot, then TL is the set of the78m highest priority messages at L. For the rest of this discussion, we assumethat a local hub transmits only one message in a time slot, but our workcan be extended to situations when multiple messages can be transmittedin the same time slot.The central hub selects the local hub with the highest weight. We call thispolicy at the central hub, the Value-Weighted Policy (VWP).3.5 Alternative Policies at the Central HubHaving described the policy, based on reinforcement learning, that we haveproposed, we discuss other heuristics that we can use to compare with ourproposal.Recall that the performance goal is to reduce the criticality-weighted dead-line miss ratio. To this end, we can consider four other decentralized poli-cies. In each of these policies, the local hubs assign priorities to messagesusing reinforcement learning (Section 3.4.1), but the central hub uses asimple heuristic.1. Deadline-Greedy (DG): The central hub then selects a local hub us-ing the earliest deadline among the highest priority messages at eachlocal hub. Ties can be broken arbitrarily.2. Criticality-Greedy (CG): The central hub then selects a local hub usingthe highest criticality among the highest priority messages at each79local hub. Ties can be broken arbitrarily.3. Criticality Density Greedy (CDG): We define the criticality density ofmessage Mi as κidi . The central hub selects the local hub that has themessage of the highest criticality density.4. Random (RA): The central hub selects a local hub at random.A fifth alternative policy that we consider, which is entirely centralized,is the Global Random (GRA) policy. We assume that the central hub hasinformation about all messages at all the local hubs and selects a messageto be transmitted at random from a uniform distribution. This policy isimpractical because it is centralized, but we mention it here as anotherpolicy to compare to our proposed policy.3.6 Quantitive EvaluationWe evaluate the policies via simulation. We set up a system with eight localhubs and a central hub, and each local hub has three priority queues (q= 3)representing low, medium, and high priority levels. We assume each queuehas a capacity of p= 50 messages. We model message arrivals according toa Poisson process with rate λ = 0.15.We assume that the processing time to calculate the weight (wl, as shownin Section 3.6) for each local hub, is negligible.80Figure 3.3: We use four evaluation scenarios based on varying the dis-tribution of messages in terms of criticality and deadline values.Scenario V, not in the figure, is where each the deadline andcriticality of each message is chosen uniformly at random froma range of values.3.6.1 Experimental ParametersThere are four parameters involved in the system model. Two parametersare related to the value function (the Q function): the discounting factor(γ) and the learning rate (δ). The other two are local hub balancing pa-rameters for the deadline (α) and the criticality (β).We assume, in our evaluation, that each local hub uses the same choice ofparameter values to balance timeliness and criticality. For all four param-81Figure 3.4: Range of criticality-weighted miss ratio (ρ) values for ScenarioI. The box plots indicate the range of miss ratio values using the samedata reported in Table 3.1. The proposed reinforcement learning ap-proach (VWP) is surpassed only by the offline optimal policy.eters, we assigned one of three values: low (0.1), medium (0.5) and high(0.9). This choice resulted in 34 = 81 possible settings, and we simulatedeach setting for 1000 time steps.3.6.2 Evaluation ScenariosWe have identified an optimal offline policy (Section 3.3) that provides anupper-bound on the performance of any of the decentralized policies. Wecompare the proposed policy (Section 3.4.1) and Section 3.4.2) with thefive alternative heuristics (Section 3.5) as well as the optimal offline policy.To understand the performance differences, we consider five different sce-82Figure 3.5: Range of criticality-weighted miss ratio (ρ) values for ScenarioII. The box plots indicate the range of miss ratio values using thesame data reported in Table 3.1. The proposed reinforcement learningapproach (VWP) is surpassed only by the offline optimal policy.narios concerning message deadlines and message criticalities:• Scenario I: Majority of messages have higher criticality values andshorter deadlines.• Scenario II: Majority of messages have high criticality values. On theother hand, those messages with lower criticality values have shorterdeadlines.• Scenario III: Majority of the messages have low criticality values.On the other hand, those messages with high criticality values have83Figure 3.6: Range of criticality-weighted miss ratio (ρ) values for ScenarioIII. The box plots indicate the range of miss ratio values using thesame data reported in Table 3.1. The proposed reinforcement learningapproach (VWP) is surpassed only by the offline optimal policy.shorter deadlines.• Scenario IV: Majority of the messages have low criticality values. Onthe other hand, those messages with low criticality values have shorterdeadlines• Scenario V: For each message, the deadline and criticality values arechosen from a uniform distribution. Therefore, we chose deadlinesof messages from the uniform distribution Ud(1,50), and criticalityvalues from the uniform distribution Uκ(1,25).84Figure 3.7: Range of criticality-weighted miss ratio (ρ) values for ScenarioIV. The box plots indicate the range of miss ratio values using thesame data reported in Table 3.1. The proposed reinforcement learningapproach (VWP) is surpassed only by the offline optimal policy.In these scenarios, we assumed dshort = 25 as the shorter deadline, anddlong = 100 as the longer deadline. Similarly, we considered κlow = 1, andκhigh = 2 as the lower and higher criticality values, respectively. In Scenar-ios I to IV, we used a triangle distribution to generate 75% and 25% ofmessages from the specified majority and minority groups, respectively.3.6.3 ResultsOur observations, based on the results from the simulations (Figures ( 3.4,3.5, 3.6, 3.7, and 3.8) and Table 3.1), can be broken down across thefive scenarios we considered. In each scenario and for each policy, the ag-85Figure 3.8: Range of criticality-weighted miss ratio (ρ) values for ScenarioIV. The box plots indicate the range of miss ratio values using thesame data reported in Table 3.1. The proposed reinforcement learningapproach (VWP) is surpassed only by the offline optimal policy.gregate results of the 81 settings are considered for the evaluations. Acrossall scenarios, we find that the proposed policy competitively performs whencompared with the optimal performance and outperforms all other policies.The proposed approach offers within 88% and 97% of the upper-bound onperformance.We note that in Scenario II, the GRA and RA policies that pick local hubsuniformly at random perform relatively well because a majority of mes-sages are of high criticality and with extended deadlines, which means thatselecting messages at random also leads to reasonable performance.86Average Criticality-Weighted Miss Ratio (ρ)Scenario OP VWP CDG DG CG RA GRAScenario I 0.1046 0.2250 0.2679 0.7699 0.6840 0.5762 0.3239Scenario II 0.1148 0.1475 0.1810 0.7004 0.8452 0.2972 0.1879Scenario III 0.1460 0.2263 0.2467 0.6505 0.8094 0.4079 0.2938Scenario IV 0.1443 0.1865 0.2559 0.8029 0.5943 0.4820 0.2680Scenario V 0.0413 0.1488 0.2565 0.7171 0.6932 0.3721 0.2725Table 3.1: The table shows the average Missed Criticality Ratio for thepolicies in each of the scenarios. As shown in green colored cells,the proposed policy (VWP) consistently performs as the secondbest policy after the optimal policy.In Scenario III, CDG has a similar performance to our proposed policy. Thisresult may reflect the fact that most low criticality messages have an ex-tended deadline, so it may be useful to prioritize messages with a highcriticality-density.In Scenario IV, our proposed policy is significantly better than most otherpolicies because there are more low criticality messages with short dead-lines, and reinforcement learning adapts to this scenario, but other heuris-tics do not.3.7 Related WorkOur research brings multiple areas such as Reinforcement Learning (RL),performance analysis, and job scheduling together. Therefore, the relatedwork section is mostly about the area of scheduling and its relation to RLtechniques, especially in performance analysis of distributed systems appli-87cations.In distributed environments such as grid and cloud, job scheduling is con-sidered an NP-hard problem given the problem setup [64]. Therefore, theoptimization targets are around metrics such as makespan and load bal-ance.Lucas-Estan and Gozalvez have investigated load balancing for industrialIoT networks but without specific attention to messages with deadlinesand criticalities [36]. Maguluri et al. [37] assumed to have unknown jobsizes with the same criticality levels and optimized throughput using loadbalancing/scheduling algorithm. Shi et al. [55] developed an algorithmfor provisioning and scheduling jobs under deadline constraints to addressunpredictable issues in large-scale scientific computing. Seno et al. havedeveloped soft real-time scheduling algorithms for industrial wireless net-works with timeliness constraints alone [54]. Our work and these effortsrelate to timing constraints, but we focus on striking criticality and overalltimeliness in a distributed IoT-like setup.Tordsson et al. proposed a scheduling and resource allocation method forcloud jobs based on particle swarm optimization, taking the schedulingdeadline, and scheduling budget into account [63]. Classic heuristic op-timization techniques such as parallel genetic algorithms have been oftenemployed to solve the job scheduling problems in cloud computing [12, 56]88and particle swarm optimization algorithm [28, 72]. Although there hasbeen a large amount of research done regarding job scheduling in the dis-tributed computing environment, such work does not directly apply to themessage scheduling problem in hierarchical networks.The performance analysis of infrastructure-as-a-service (IaaS) cloud plat-forms has been studied extensively under various configurations and usecases. Although the environment that we target is different regarding re-source constraints, we share similarities such as preserving service levelagreement (SLA) that we refer to as a criticality-weighted miss ratio (Sec-tion 3.2). For instance, Salah et al. proposed an analytic model based onMarkov chains to predict the number of cloud instances or VMs neededto satisfy a given SLA performance requirement such as response time,throughput, or request loss probability [52]. Khazaei et al. proposed ageneral analytic model for end-to-end performance analysis of a cloud ser-vice [29]. They illustrated their approach using the IaaS cloud with threepools of servers: hot, warm, and cold, using service availability and pro-visioning response delays as the key QoS metrics. The proposed approachreduces the complexity of performance analysis of cloud centers by dividingthe overall model into sub-models and then obtaining the overall solutionby iteration over individual sub-model solutions. Our work in this chapteris similar to that of Khazaei et al. [29] in the sense of having a hierarchysuch that local hubs partially handle prioritization of messages.89Reinforcement learning has been used for scheduling for several applica-tions of distributed computing. Peng et al. proposed an effective RL-basedscheduler scheme for cloud computing under SLA constraint. Quan et al.[47] developed a two-layered RL method offload task. In this method, thefirst layer is in charge of selecting the appropriate cluster of machines, andthe second layer selects a physical machine to execute the task. Chang et al.[13] tackled the data forwarding problem in Under Water Wireless SensorNetworks (UWSN) using an RL-based method that factors in the challengeof timeliness and energy constraints.The model we have studied is different from the work on mixed-criticalityscheduling that has been studied in the context of traditional real-time sys-tems [11]. In the mixed-criticality real-time scheduling problems, the goalis to ensure that high criticality jobs always meet their deadline and thecriticality-weighted deadline miss ratio is not of the metric of importance.To the best of our knowledge, we believe that ours is the first attempt atapplying reinforcement learning to the soft real-time scheduling problemwith message deadlines and criticalities, and to minimize the criticality-weighted deadline miss ratio.3.8 Conclusions and Future WorkWe have demonstrated that a reinforcement learning approach to messagescheduling in a hierarchical system with many nodes can help us strike a90balance between timeliness and criticality. The approach we have proposedoutperforms many other heuristics that one could use in this context. Webelieve that these ideas can be of value in the context of the Internet ofThings, with particular relevance to factory automation and medical sys-tems.We have tackled this problem in a specific setting where all messages areof equal (or unit) length. We have shown that our solution has perfor-mance that is near-optimal by determining an upper-bound using anoptimal offline algorithm. The optimal offline algorithm for the problemwe studied is a greedy centralized algorithm, which is not practical becausethe centralization will come with significant overhead.As a future direction, one would like to understand the problem when mes-sages can be of varying lengths. This change to the problem is significantbecause the offline problem (when we know all messages ahead of time) isNP-Hard and can be solved by a pseudo-polynomial time dynamic programformulation.91Chapter 4Handling the Message Criticalityvs. Timeliness Tradeoff inComplex IoT Environments: ADeep RL ApproachSummary. We study the problem of handling timeliness and criticalitytrade-off when gathering data from multiple resources in complex environ-ments. We use the term “complex” for environments, where data resources(e.g., sensors) may be correlated in time or space. In IoT environments,where several sensors transmitting data packets - with various criticality92and timeliness, the rate of data collection could be limited due to associatedcosts (e.g., bandwidth limitations and energy considerations). Besides, en-vironment complexity regarding data generation could impose additionalchallenges to balance criticality and timeliness when gathering data. For in-stance, when data packets (either regarding criticality or timeliness) of twoor more sensors are correlated, or there exists temporal dependency amongsensors, incorporating such patterns can expose challenges to trivial poli-cies for data gathering. Motivated by the success of the Asynchronous Ad-vantage Actor-Critic (A3C) approach, we first mapped vanilla A3C into ourproblem to compare its performance in terms of criticality-weighted deadlinemiss ratio to the considered baselines in multiple scenarios. We observeddegradation of the A3C performance in complex scenarios. Therefore, wemodified the A3C network by embedding long short term memory (LSTM)to improve performance in cases that vanilla A3C could not capture repeat-ing patterns in data streams. Simulation results show that the modifiedA3C reduces the criticality-weighted deadline miss ratio from 0.3 to 0.19.4.1 IntroductionConnected devices are part of IoT environments in which every device talksto other related devices - by gathering and transmitting data - to timelycommunicate important/critical sensor data to interested parties for fur-ther usage. With the enormous amounts of data generation, selectingonly important/critical/relevant data to be timely used for various usages93is still an issue. Besides, environment complexity regarding data genera-tion could impose additional challenges to balance criticality and timelinesswhen gathering data. We consider an environment to be “complex” when:• Correlation exists between the arrival of events for two or more dataresources/sensors.• Temporal dependence exists among sensors due to the physical ar-rangement of devices.• Correlation exists between (sensors) data.For instance, when data or arrival of data for two or more sensors correlate,or there exists temporal dependency among sensors, incorporating suchpatterns can expose challenges to trivial policies for data gathering. Exam-ples of environments with such attributes could be a network of sensorsfor capturing marine temperature, where there is temporal dependenceamong data [33] or intelligent buildings where deployed sensors generatedata that is temporally and spatially dependent [14].Capturing patterns such as temporal dependency or correlation on the dataarrival of sensors could help to collect data more efficiently concerning thedesired goal. For example, consider a network of four sensors (SA, SB, SC,and SD) where at each time step, we can collect data from only one sensor.The goal is to collect data packets in such a way that we have the highest94accumulated criticality values over a time horizon, where each data packetis characterized by a pair of criticality and deadline value (e.g., data ofsensor Si is referred as (Cr, d)). As an example (shown in Table 4.1 andTable 4.2), imagine that there is a temporal dependency between sensors SAand SB due to their close location. In case of not capturing the dependencymentioned above (detailed in Table 4.1 at the time step t, we have datafrom sensors SA and SC, where data at SC has a higher criticality value anddeemed as the better choice for transmission despite having larger deadlinevalue compared to SA. However, other data packets will be expired bythe time that we want to collect data again at step t + 1, and the totalcriticality value collected at the end of t = 4 would be 8 (equivalent tocriticality-weighted deadline miss ratio of 13−813 =513 = 0.38). Alternatively,as shown in Table 4.2, by knowing and capturing the temporal dependencybetween sensors SA and SB, one could transmit data at SA since a datapacket will arrive for SB soon after SA. So, the total amount of criticality forthe transmitted data packets from SA and SB can be as large as the criticalityof data at SC (if it was alternatively selected), and we may be able to collectdata of SC still if it is not expired yet. In this case, the total criticality valuecollected at the end of t = 4 would be 11 (equivalent to criticality-weighteddeadline miss ratio of 13−1113 =213 = 0.15).95TimestepCurrentavailabledataSelected(sensor)dataTotalaccumulatedcriticalityt=1S A:(3,1),S B:(-),S C:(5,5),S D:(-)S C:(5,5)5t=2S A:(-),S B:(3,2),S C:(-),S D:(2,1)S B:(3,2)5+3=8t=3Noavailabledata(allexpiredandnonewarrivals)-8t=4Noavailabledata(allexpiredandnonewarrivals)-8Table4.1:Exampleofsensorselectionina4-steptimewindow,whendependencycannotbecap-tured.Inthiscase,thereisatemporaldependencyforthearrivalofdatapackets.TheavailabledatapacketsatsensoriisshownasS i=(cr,d).Asthetableshows,thetotalaccumulatedcriticalityattheendofthetimestept=4is8.TimestepCurrentavailabledataSelected(sensor)dataTotalaccumulatedcriticalityt=1S A:(3,1),S B:(-),S C:(5,5),S D:(-)S A:(3,1)3t=2S A:(-),S B:(3,2),S C:(5,4),S D:(2,1)S B:(3,2)3+3=6t=3S A:(-),S B:(-),S C:(5,4),S D:(-)S C:(5,3)3+3+5=11t=4Noavailabledata(allexpiredandnonewarrivals)-11Table4.2:Exampleofsensorselectionina4-steptimewindow,whendependencycanbecaptured.Inthiscase,thereisatemporaldependencyforthearrivalofdatapackets.TheavailabledatapacketsatsensoriisshownasS i=(cr,d).Asthetableshows,thetotalaccumulatedcriticalityattheendofthetimestept=4is11.96We discuss how we can prioritize and gather data from multiple devices whenthe data may have different timelines and criticality requirements - especiallyin complex environments. We use the term “complex” to refer to scenarioswhere the sensors may be correlated in time or space. Such correlationsmake sensor polling decisions harder. Given the challenges in capturingdependencies among sensors, and motivated by the success of Deep Re-inforcement Learning (DRL) in policy estimation, we explore leveragingDRL techniques to tackle our problem. DRL techniques have shown tooutperform alternative methods (e.g., traditional Q-learning) in terms ofhandling large state spaces, which is a requirement for deploying large IoTnetworks. Also, DRL techniques can handle both continuous and discretestate spaces. We propose an approach based on the Asynchronous Advan-tage Actor-Critic (A3C) [39] by improving the network structure such thatit captures the likely case of having temporal dependency and correlationwithin data streams. We achieved this improvement by adding an LSTMlayer to consider some previous states (rather than only one state) to learnrecurring patterns within data. Access to previous states implies the notionof embedding memory into the model. Besides the addition of memoryto the model, the A3C itself offers benefits. It outperforms some other RLalternatives (e.g., Q-learning methods based on Q-table and DQN) con-cerning resource requirement and time performance [39]. We provide theformal presentation of the problem as well as the simulation-based evalua-tion of scheduling policies for this problem.974.1.1 ContextThe Internet of Things (IoT) refers to the vast number of things (i.e.,electronics-infused devices) connected to the internet, which acts as sen-sors in their hosting environment, generating massive volumes of data. Insuch settings, everything from data acquisition to processing and analysiscan leverage Machine Learning (ML) techniques to preserve efficiency andperformance. As all the steps (i.e., data acquisition, processing, and anal-ysis) mostly involve decision making of some sort, ML techniques wouldhelp in making informed decisions by capturing patterns in data and sen-sors behaviour. In a sense, the integration of ML into the IoT world wouldtransform simple sensor-actuator devices into ambient intelligent devices.In different application settings such as factory automation, hospital, andmore environments [48], the notions of message criticality and timelinessco-exist. Imagine the case of factory automation, where sensory messagesthat require immediate actions compete with less urgent messages (i.e., noneed for instant actions) though being critical. In such cases, there needsto be a decision-maker as to act upon sufficiently reasonable concerningthe desired objective function.While the ultimate goal is to deploy applications that can identify changessuch as the one described above in the environment and appropriatelyadapt to them, sometimes such changes could occur in a more challeng-ing way:98• There could be cases that correlation exists between the arrival ofevents for two or more sensors. For instance, in a factory, temperatureand pressure sensors may generate data mostly at about the sametime, different from optical sensors.• Furthermore, there could be cases where temporal dependence existsamong sensors due to the physical arrangement of devices. For in-stance, imagine that two optical sensors (sensor A and sensor B) in-stalled in separate locations in a factory. Sensors A and B may observethe same event, but at different points in time.• It is more challenging to capture the correlation among sensor data. Inthe context of factory automation, there could be cases where the crit-icality/timeliness of messages for some sensors correlate with someother sensors. This situation may happen because of reasons such asrelated functionalities of the operating devices there. For example,two temperature sensors that are installed in the same factory ware-house sense correlated temperature values, whereas their sensed datamay not correlate with temperature data of sensors in other ware-houses.From a system architecture perspective, we consider a network of devices(i.e., sensors) that sense events in the environment. There is a central unitmanaging the transmission of data from sensors by selecting a sensor at99a time. Sensors communicate their sensing data to the central unit uponselection. The system that we study deals with the challenge of possessingminimal resources (e.g., in terms of energy and memory) as a likely settingfor many networks ( [15, 18, 67]), which emphasizes the essence of havingan effective decision-making policy. For efficiency of energy and memoryconsumption, sensors would hold a limited number of messages. Such ar-chitecture makes it possible to be extended and creates clustered networkswith multiple hierarchy levels. For instance, in a more extensive network,each central unit can operate as a relay device, such as the model describedby Rashtian et al. [48]. In the scope of our work here, we consider a centralunit responsible for deciding during each decision-making epoch.4.1.2 ContributionsWe propose and evaluate a scheduling mechanism for the described ar-chitecture where the environment adds complexity by causing correlationsamong messages and arrival dependency messages that need to be trans-mitted. We start by exploring the applicability of the A3C method [39]as a successful Deep Reinforcement Learning method and proposing ourapproach by improving upon that. The contributions of the work have fourfolds:• Mapping the A3C approach to our problem and establishing a base-line method that showed consistent performance in less complex sce-narios.100• Showing that the A3C performs no better than the greedy baselinesin complex scenarios.• Embedding memory to the network in the vanilla A3C to improve theperformance in complex scenarios with high dependencies.• Showing that based on the simulations, the modification to A3C didnot negatively affect the performance in other scenarios where thevanilla A3C had already performed well. Such observation confirmsthe advantage of the proposed solution over the vanilla A3C in allstudied scenarios.4.2 System ModelWe consider a system with a centralized architecture, as shown in Figure4.1. The IoT devices (i.e., sensors) are communicating to a central unit.Sensors sense the environment events and capture them as messages. Thecentral unit transmits one message at a time from one of the sensors.We consider messages of equal-sized lengths arriving at sensors. For energyconsumption and memory costs, we assume each sensor would hold onedata message. Sensors have enough resources to maintain only one mas-sage at the time. We use the following notation to describe a message andits characteristics: Message Mi has an applicable deadline di and criticalityκi. If a message arrives at time t, then the absolute deadline for delivery101Figure 4.1: The system model where the central unit selects a sensorat a time to transmit data. Sensors hold a message with dead-line (di) and criticality (κi). The model makes no assumptionsabout the arrival rate of events sensed by the sensors.of this message is t+ di. Until the message reaches its expiration time, itwill provide its highest value if selected. If the message is selected after itsexpiry, it will provide a discounted value (as will be discussed in Section4.14). If a new message arrived at the sensor, and the sensor already has amessage, the older message would be replaced by the new message.At each scheduling epoch, the central unit selects one sensor to use theavailable bandwidth to transmit its associated message. Also, no assump-tions are being made about the event arrival rate.4.2.1 Performance MetricIn the model that we have described, the system goal is to minimize thecriticality-weighted deadline miss ratio that is defined as follows: Let N be102the total number of messages generated during the time interval of interestthat also have deadlines within that time interval. Let xi be an indicatorvariable for whether message Mi missed its deadline or not. The criticality-weighted deadline miss ratio isρ := ∑Ni=1 xiκi∑Nj=1κ j. (4.1)With such performance, we can define the problem that we want to tackle.4.2.2 Problem StatementThe problem that we want to tackle is to determine an efficient bandwidthallocation policy at the central unit to minimize the miss criticality ratioover a finite time interval as the optimal policies may be intractable:min{ρ := ∑Ni=1 xiκi∑Nj=1κ j} (4.2)We do not know what messages will arrive until they arrive, and decisionsregarding the selection of messages need to be made at each schedulingepoch.1034.3 BackgroundWe have chosen to take advantage of Reinforcement Learning (RL) to tackleour problem. We provide a brief background on RL techniques in general,A3C, and later discuss our proposed method.4.3.1 Overview of Deep Reinforcement LearningIn this section and before discussing the A3C network, we first introducethe basic concepts of RL and Deep Learning (DL), based on which DeepReinforcement Learning (DRL) is defined.Reinforcement Learning (RL)RL is a class of algorithms in machine learning that can achieve optimalcontrol of a Markov Decision Process (MDP) [32, 58, 70]. There are gen-erally two entities in RL - an agent and an environment. The environ-ment evolves in a stochastic manner within a state-space at any time. Theagent operates as the action executor and interacts with the environment.When it acts within a particular state, the environment will generate a re-ward/penalty to indicate how good was the action taken by the agent. Theagent learns from this generated response from the environment over time.The policy determines the strategy for an agent to take action when beingin a state. The agent task is to learn from its already taken actions suchthat in the future, it will take actions that are optimal concerning the valuefunction Vpi(s0). We define the value function as the expected reward from104actions taken by a policy (pi) over a finite time horizon:Vpi(s0) = Eτs0∼pi [Rtotal(τs0)], (4.3)where τs0 denotes a chain of states selected by adopting pi policy startingfrom s0. Rtotal is the reward accumulated from traversing such sequence ofstates.Apart from value function, another important function is Q function Qpi(s0,a0),which is the expected reward for taking action a0 in state s0 and thereafterfollowing a policy pi. When policy pi is the optimal policy pi? , value func-tion and Q function are denoted by V ?(s) and Q?(s,a), respectively. Notethat V ?(s) = maxaQ?(s,a). If the Q functions Q?(s,a), a ∈ A are given, theoptimal policy can be easily found by pi? = argmaxaQ?(s,a). In order tolearn the value functions or Q functions, the Bellman optimality equationscan usually help. Taking the discounted MDP with a discount factor of γfor example, the Bellman optimality equations for the value function andQ function isV ?(st) =maxat[rt+1+ γ ∑st+1P(st+1|st ,at)V ?(st+1)], (4.4)andQ?(st ,at) = rt+1+ γ ∑st+1P(st+1|st ,at)maxat+1Q(st+1,at+1), (4.5)105respectively.Bellman equations represent the relation between the value/Q functions ofthe current state and the next state.Usually, a large amount of memory is required to store the value functionsand Q functions. In some cases, when only small, finite state sets are involved,it is possible to store these in the form of tables or arrays. This method iscalled the tabular method. However, in most of the real-world problems, thestate sets are large, sometimes infinite, which makes it impossible to store thevalue functions or Q functions in the form of tables. Therefore, the trial-and-error interaction with the environment is hard for learning the environmentdynamics due to formidably computation complexity and storage capacity re-quirement. Even if we can learn the dynamics, it imposes massive consumptionof computing resources. In this case, one can approximate some functions ofRL such as Q functions or policy functions with a smaller set of parameters bythe application of DL. The combination of RL and DL results in the morepowerful DRL.Deep learning (DL)DL refers to a family of machine learning algorithms that leverage artificialneural networks (ANN) to learn patterns from a large amount of data. Itcan perform well in tasks like regression and classification. The weightsand bias of every node in a neural network (NN) are the parameters of the106NN. Usually, a neural net with two or more hidden layers is called a DeepNeural Network (DNN). A loss function L(θ) = g(Yˆ (θ),Y ) is used in deeplearning, which is a function of the output Yˆ (θ) from the network and thedesired output (Y ) . The loss function evaluates the performance of the de-sired NN (in terms of learning the corresponding parameters till this point)in terms of modelling the given data (i.e., Y = f (X)). Depending on thetask type, various loss functions can contribute. For instance, the standardregression loss functions include Mean Square Error (MSE), Mean Abso-lute Error (MAE), Mean Bias Error (MBE). Concerning the classificationtask, loss functions such as Cross-Entropy loss and Support Vector Machine(SVM) loss perform well. After that, gradient descent methods are usedto update θ parameters in NNs and consequently minimize the loss func-tion. Given a loss function L(θ), the parameters get updated by a gradientmethod such as the simple gradient:OθL(θ) =∂L(θ)∂θ(4.6)Such gradient descent methods start from an initial point of θ0 . As theinput data is fed to NN, the average loss function over all input data iscalculated and used to minimize L(θ) by taking a step along the descentdirection, i.e.,θ ← θ −αOθL(θ), (4.7)107where α is a hyper-parameter named step size and indicates how fast theparameter values move towards the optimal direction. The above processis repeated iteratively by inputing more data to NN until convergence.Deep Reinforcement Learning (DRL)As discussed earlier, DRL refers to a family of methods that combine RLand DL to approximate either Value or Q functions (or even both) via adeep NN. In general, the DRL approaches can be categorized into two maingroups: Value-based and Policy Gradient.In Value-based methods for DRL the states st ∈ S or state-action pairs (st ,at)∈S× A are inputs of NNs, while Q functions Qpi(st ,at) or value functionsV pi(st) are approximated by parameters θ of NNs. An NN returns the ap-proximated Q functions or value functions for the input states or state-action pairs. There can be a single output neuron or multiple output neu-rons. For the former case, the output can be either V pi(st) or Qpi(st ,at)corresponding to the input st or (st ,at). For the latter case, the outputs arethe Q functions for state st combined with every action, i.e., Qpi(st ,a1), · · ·, Qpi(st ,a|A|).In Policy Gradient methods, NNs can directly approximate a policy as afunction of the state, i.e., piθ (s). The states are used as inputs to the NNs,while policy pi is approximated by parameters θ of NNs as piθ . In contrastto value-based DRL methods, the policy gradient methods for DRL is a direct108mapping from state to action, which leads to better convergence propertiesand higher efficiency in high-dimensional or continuous action spaces [20].Therefore, we chose to take leverage of policy gradient methods for our re-search problem. Specifically, we focused on A3C as the basis of our solutionand improved upon it concerning our problem.4.3.2 A3C NetworkA3c networks consist of multiple independent agents (i.e., neural networks)with their weights, which interact with a different copy of the environmentin parallel. Therefore, they can explore a more significant part of the state-action space in much less time. The agents (or workers) are trained inparallel and periodically update a globally shared neural network, whichholds shared parameters. The updates are not happening simultaneously,and that is where the asynchronous notions come from. After each update,the agents reset their parameters to those of the global network and con-tinue their independent exploration and training until they update them-selves again.We shall now briefly elaborate on Asynchronous Advantage Actor-Critic (A3C)as the underlying structure in our approach.We define a value function V (s) of a stochastic policy pi(s) (that returns dis-109tribution of probabilities over actions) as an expected discounted reward:V (s) = Epi(s)[r+ γV (s′)] (4.8)where V (s) is the weight-average of r+ γV (′(s) for every action that can bepotentially taken in state s ∈ S.We also define the action-value function Q(s,a) as:Q(s,a) = r+ γV (s′) (4.9)where we emphasize that the action is given and there is only one followings′.We define the advantage function as:A(s,a) = Q(s,a)−V (s) (4.10)A(s, a) is the advantage function as it informs how good it is to take actiona in a state s compared to the average performance. In case the action ais better than the average, the advantage function has a positive value. Itgets a negative value when the action is worse than average.Furthermore, we define ρ as the distribution of states, which indicates theprobability of being in states. ρs0 and ρpi denotes the distribution of begin-110ning states in the environment and states under policy pi, respectively.Since policy pi is only a function of state s, we can approximate it directly.In this case, a neural network (with θ as the weights) would take a states and output an action probability distribution piθ . We shall use pi and piθinterchangeably as the policy parametrized by the network weights θ .On the other hand, we want to optimize the policy. We define a metric func-tion J(pi) as an averaged discounted reward that a policy pi can accumulateover possible beginning states s0:J(pi) = Eρs0 [V (s0)] (4.11)Now, we use the gradient of J(pi) to improve it. The gradient of J(pi) isderived in the Policy Gradient Theorem ([59, 61]) and has the followingform:OθJ(pi) = Es∼ppi ,a∼pi(s)[A(s,a).Oθ logpi(a||s)] (4.12)where the first part (i.e., A(s,a)) informs the advantage of taking action ain state s. The second part of it (i.e., Oθ logpi(a||s)) informs a direction inwhich logged probability of taking action a in state s rises. Since both termsare together, the equation 4.12 increases that the likelihood of actions thatare better than average performance while decreases the likelihood of ac-tions worse than average performance. Since it is not feasible to compute111Figure 4.2: An example of A3C’s limitations, where its performancedegrades in complex scenarios with 8 sensors. The Y-axis isthe Criticality-weighted Deadline Miss Ratio and the X-axis rep-resents the Workload Intensity that we define as: ∑i∈MCri×λarrivaldi.According to the definition of the workload intensity, it in-creases with an increase in criticality (Cri), an increase in dataarrival (λarrival) or a decrease in deadline (di) of messages.the gradient over every state and every action, we use sampling for thiscomputation (as the mean of samples lays near the expected value).The advantage function also needs to be computed. Let us expand thedefinition as:A(s,a) = Q(s,a)−V (s) = r+ γV (s′)−V (s) (4.13)where we can see that running an episode with a policy pi would provideus with an unbiased estimate of the Q(s,a). In other words, it is sufficientto know the value of V (s) to compute A(s,a). Therefore, we can also ap-proximate V (s) by a neural network (similar to approximating action-valuein DQN [22]).112Moreover, we can combine the two neural networks for estimatingV (s) andpi(s) to learn faster and more effectively. Also, on the negative side, sep-arate networks are likely to learn very similar low-level features. Besides,combining the networks would act as a regularizing element that results inbetter stability. In the case of the two networks, our neural network shareall hidden layers and outputs two set of results - pi(s) and V (s). The partthat optimizes the policy is the actor and the one that estimates the valuefunction is the critic. In fact, the critic provides actor with insights about itsaction.The asynchronous notion in A3C originates from the fact that the runningof an agent would result in gathering samples with high correlation. Toavoid such an issue in DQN, a method called experience replay is used bystoring the samples in memory and form a batch by retrieving them in ran-dom [8]. However, the way A3C handles this issue is to run multiple agentssimultaneously. In this case, each agent has its copy of the environment andwould use its samples as they arrive. The advantage of this approach is twofolds: First, it avoids the correlation as agents would have their unique ex-perience (i.e., various states and transitions). Second, this method requiresless memory compared to the experience replay method.Limitations of A3C: Given all the properties of the vanilla A3C, we exam-ined its performance (quantified using the criticality-weighted deadline missratio) in the presence of temporal dependencies and correlation of sensor113data values as an example of complex environments. Specifically, we ex-perimented with 8 sensors in the setup explained in detail later (Section4.4.2), where there is temporal dependence concerning the arrival of dataas well as the correlation among data values for two or more sensors asdescribed later in Scenario 4 at Section 4.4.1.Figure 4.2 shows the criticality-weighted deadline miss ratio of policies con-cerning the workload intensity as defined further in equation 4.16. Our nu-merical evaluation reveals that the performance of A3C is not better thansome of the na¨ıve greedy policies (Figure 4.2). With this observation, wehypothesized that such turbulence in the A3C performance relates to thelack of any memorization mechanism to capture recurrent patterns in datastreams. Therefore, we decided to examine our hypothesis by proposing amodified version of A3C.4.3.3 Proposed ApproachWe propose an architecture for determining policy at the central hub fordeciding on data streams in complex environments. Aligned with the ex-amples provided in Section 4.1.1, we consider studying the following cases,where:• There exists a correlation among the arrival of data for two or moresensors in the system.• There is temporal dependence for the arrival of data for two or more114sensors in the system.• There is temporal dependence concerning the arrival of data as wellas the correlation among data values for two or more sensors.We chose the above cases as a representative set of scenarios for a com-plex environment inspired by real-world examples (e.g., in [33, 34]). Weacknowledge that such a set of cases can get extended to represent morecomplex environments. We provide more details about each case when wediscuss the evaluation more completely (Section 4.4).We make no assumptions concerning the similarity of information whenthere is any data arrival correlation or temporal dependency. It means thatin such scenarios, if we collect data from two correlated/dependent sensorsat two consecutive times (e.g., at t and t+1), it does not mean that we havecollected “redundant” data. We are solely focusing on collecting data pack-ets in such a way that we have the highest accumulated criticality valuesover a time horizon. However, in some environment settings, it could be thecase that such data arrival correlations/dependence is lead to redundantdata. As mentioned above, the proposed model does make this assumptionand consequently does not consider having redundant data/messages.1As shown in Figure 4.3, we can abstract the network in two parts: input1As an extension to this work, we can model messages that have related informationusing a metric such as mutual information, and incorporating this metric into our objectivewill allow us to optimize the system behaviour suitably.115and learning model.Figure 4.3: The proposed A3C-based network with embedded mem-ory (i.e., LSTM layer).At the input layer, the system state is an array of n= n(s) tuples (n(s) = |S|,S is the set of sensors in the system), each tuple representing the availabledata-packet (criticality, deadline).Concerning the learning model, we use the Asynchronous Advantage Actor-Critic (A3C) algorithm [39], which leverages a deep neural network tolearn the policy and value functions while running parallel threads to up-date the parameters of the network. Regularization with policy entropy im-proves exploration by limiting the premature convergence to sub-optimalpolicy [39]. The core of our network contains an LSTM layer (outputspace= n(S)) followed by a fully connected network (output space=2×n(S)) to perform the both required estimations. The LSTM layer is intro-116duced to arm the agent such that it can have some memory of previousstates. We propose such embedding of memory to the model as it is likelyfor an agent in complex environments to encounter recurring patterns instates. In these cases, the model is capable of making efficient decisionsonly if it can distinguish such patterns. Therefore, we chose to embedmemory (i.e., by adding the LSTM layer) to enrich the model with thisfeature. Alternatively, one could argue about embedding memory to thesensors. However, we believe that such a decision would lead to a morefragile architecture and would negatively impact the scalability of IoT de-ployments.Similar to any Reinforcement learning technique, we are concerned withhow to take actions in the environment to maximize some notion of cumu-lative reward. It is essential to define a concrete and reasonable optimiza-tion goal, i.e., the reward. We define the reward in such a way that if amessage is chosen after the expiry, a penalty is incurred:r(t) = κα − I( ιt−d )β (4.14)where α and β are parameters that help a system architect achieve a bal-ance between criticality and timeliness. These may be adapted on a per-sensor basis, but in this discussion, all sensors use the same settings forthese two parameters. t denotes the current time step, and d is the corre-117Figure 4.4: An example graph of the reward function for some param-eter choices (ι = 1, α = 2, β = 4, and κ and d are randomly se-lected from interval of (1,5). The peaks correspond to the caseswhere no penalty is incurred (i.e., I = 0), whereas the troughscorrespond to the cases with a penalty (i.e., I = 1).sponding deadline of a message. ι is the parameter that characterizes thepenalty for a message, and it may be adapted per-message or per-class ofmessages. For simplicity of exposition, we use the same value of ι for allmessages. An example graph of the reward function for some choices ofthe parameters mentioned above is shown in Figure 4.4.In terms of the loss function, we define it as follows:L= Lpi + cvLv+ cregLreg (4.15)where Lpi is the loss of the policy, Lv is the value error and Lreg is a regu-larization term. The constants cv and creg can show which part we want toemphasize on.1184.3.4 EnvironmentWe created an environment for a complex IoT environment where data-packet streams arriving at sensors may have not only different distribu-tions but also temporal dependency and correlation concerning their arrivaltimes and data-packet values. Such properties are known in IoT environ-ments such as intelligent buildings [14], marine environments [33], andmultiple mobile sensing and computing applications [71]. In our environ-ment, as mentioned in Section 4.2, data packets arrive at sensors and thecentral unit chooses a sensor to fetch its data at each polling turn. Then,it verifies whether the deadline of the chosen sensor is still valid. If so, itcalculates a reward without any extra penalty; otherwise, it still generatesa reward (like a soft scheduler [48]) but subtracts a penalty to reflect thedeadline expiry for the chosen data packet. This iteration continues at eachtime step until the episode finishes.4.4 Quantitative EvaluationHaving described the proposed approach based on deep reinforcement learn-ing, we discuss other heuristics that we can use to compare with our pro-posal. To recap the performance goal, we want to reduce the criticality-weighted deadline miss ratio.We evaluate the policy from the proposed approach along with three otherpolicies, namely: the vanilla A3C, critical greedy, and deadline greedy.119Given the explanations for the A3C (Section 4.3.2) and the proposed ap-proach (Section 4.3.3), we briefly elaborate on the greedy policies:• Critical greedy: In this policy, the central unit selects a sensor usingthe highest criticality value among the messages at each sensor.• Deadline greedy: In this policy, the central unit selects a sensor usingthe earliest deadline among the messages at each sensor.4.4.1 Evaluation ScenariosAs mentioned above, we have four different approaches (i.e. policies) toselect a sensor each time. We compare these policies concerning theircriticality-weighted deadline miss ratio. To understand the performancedifferences, we consider four different scenarios, each representing an en-vironment concerning data arrival and values. We chose Scenario 1 to studyour proposed solution in less complex environments (where there is no de-pendency among sensors). Inspired by the scenarios reported in the liter-ature ([14, 33]), we considered Scenarios 2, 3, and 4 to explore the per-formance of our approach in complex environments to overcome the A3Climitation as discussed earlier and shown in Figure 4.2:• Scenario 1: The arrival of data to sensors follows different distri-butions: a uniform distribution and multiple Poisson distributions -without dependencies.120• Scenario 2: The arrival of data to half of the sensors is dependent oneach other in a pairwise manner. In other words, sensors in each pairare dependent on each other. In the case of having eight sensors inthe system, two pairs (i.e., half of the sensors) correlated the arrivalof data. For example, the arrivals of data for the (s1,s2) pair of sensorswere correlated. The same correlation existed between the arrival ofdata for the (s3,s4) pair. We did not consider any such correlation forthe rest of the sensors.• Scenario 3: The arrival of data to half of the sensors is temporallydependent on each other in a pairwise manner. Similar to the de-tails provided in Scenario 2, for half of the sensors, pairs of sensorshad temporal dependence on each other. For example, when we hadeight sensors, there was temporal dependence for the (s1,s2) pair ofsensors, and similarly for the (s3,s4) pair while there was no depen-dence between other sensors.• Scenario 4: The arrival of data to half of the sensors is temporallydependent on each other in a pairwise manner. Also, the state value(i.e., data value) of those sensors is dependent on each other. In thecase that of eight sensors, for the pair of (s1,s2), the data arrivalswere temporally dependent, and data values were correlated. Thesame setting holds for another pair of sensors (e.g., (s3,s4)), and therest of the sensors did not have a dependency of any kind.1214.4.2 Experimental setupWe perform our experiments in all of the above scenarios, where we triedmultiple numbers of sensors (n = 4,8, and 16). We chose to report the re-sults for n = 8 here due to the similarity of results and the results for theother two cases of 4 and 16 sensors are available in the Appendices (Sec-tion B.1 and Section B.2), respectively. To be consistent, we fixed the valuesof temporal dependencies and correlations (regarding arrival time or statevalues) wherever they existed across the scenarios. For temporal depen-dency, we used tw = 5 to denote the temporal difference with correlationvalue of corr = 0.9 (i.e., to hold high correlation). This correlation valuewas also used for the cases that only correlation exists (no temporal depen-dency). As mentioned in Section 4.4.1, half of the sensors (i.e., equivalentto four sensors as n= 8) have dependencies in scenarios 2,3, and 4, whichmeans two pairs of sensors have either or both of temporal dependency orcorrelation. We used the arrival rate of λ = 0.15 for these scenarios. Forthe first scenario, we chose the arrival time for one sensor from a uniformdistribution over the time horizon of the study, and the other sensors frommultiple Poisson processes (λ = [0.05,0.15,0.3,0.35,0.5,0.65,0.85]). Also,for each message, the deadline and criticality values are chosen from auniform distribution. Therefore, we chose deadlines of messages from theuniform distribution Ud(1,10), and criticality values from the uniform dis-tribution Uκ(1,10). In the case of dependencies among data values (i.e.,criticality and deadline values), the chosen values are still within the ranges122Figure 4.5: Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 1 with 8 sen-sors. The plot indicates the range of criticality-weighted deadlinemiss ratio values using the same data reported in Table 4.3. Theproposed approach (m A3C) performs competitively comparedto the vanilla A3C.of the above uniform distributions.We ran the simulations with eight threads as it directly impacts the perfor-mance due to affecting the quality of gradient. We tweaked the hyperpa-rameters and ended up having simulation runs through 250 episodes each5000 steps, the minimum batch size of nbatch = 32, and the learning rate ofδ = 0.005. Also, we chose the values for the constants of the loss functionas cv = 0.5 and creg = 0.01.4.4.3 ResultsWe elaborate on our observations from the simulations. In each scenario,we considered the aggregate results of the 250 episodes for the evaluations.123Figure 4.6: Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 2 with 8 sen-sors. The plot indicates the range of criticality-weighted deadlinemiss ratio values using the same data reported in Table 4.3. Theproposed approach (m A3C) performs competitively comparedto the vanilla A3C.Workload intensity: We define a workload intensity metric as:∑i∈MCri×λarrivaldi(4.16)where M is the set of all messages arriving in a simulation episode, λarrival isthe arrival rate of events, and Cri and di are the criticality and deadline foreach message, respectively. We could calculate this metric for each episodeof simulations over 5000 steps. Such a metric essentially captures, to someextent, the workload intensity. Therefore, we report the performance ofthe policies concerning the “criticality-weighted deadline miss ratio” (on theY-axis) along with the “workload intensity” (on the X-axis) when presentingthe simulation results.124As depicted in Figures 4.5, 4.6, 4.7, and 4.8, each data point represents theoverall “criticality-weighted deadline miss ratio” in terms of the workload-intensity metric (as defined earlier) over 5000 steps during an episode. Ta-ble 4.3 summarizes the results of experiments for the case of 8 sensors.Across all scenarios, we find the proposed approach performs well. In two ofthe scenarios (III and IV), it outperforms all others, which confirms the idea ofadding memory to capture temporal relations. In scenario I, it still performsslightly better than the vanilla A3C and outperforms the other two policies. Inscenario II, it performs competitively when compared with vanilla A3C whilestill outperforming the two greedy baselines. Table B.1 (Section B.3) providesa more comprehensive set of results by including the experiments for thecases of having 4 and 16 sensors.Observing that the modification to the A3C did not negatively affect theperformance in less complex scenarios suggests the advantage of the pro-posed solution over the vanilla A3C as a generalized solution.In addition to generalizability, our proposed solution also offers scalabilitycompared to the traditional RL approaches, such as the tabular Q-learning.As described earlier (Section 4.3.1), with the proposed DRL solution, we donot deal with updating the Q-table of states and actions. Instead, we havethe states as the input of the NN and approximate the value function andaction values. Therefore, we alleviate the massive consumption of comput-ing resources for updating a Q-table. In this way, it is easier to increase125Figure 4.7: Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 3 with 8 sen-sors. The plot indicates the range of criticality-weighted deadlinemiss ratio values using the same data reported in Table 4.3. Theproposed approach (m A3C) outperforms other policies.Figure 4.8: Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 4 with 8 sen-sors. The plot indicates the range of criticality-weighted deadlinemiss ratio values using the same data reported in Table 4.3. Theproposed approach (m A3C) outperforms other policies.the number of sensors as it only changes the size of the input array in theNN, which requires considerably lower computing resources than Q-tablesthat grow exponentially. One possible bottleneck towards the large scaleimplementation of our approach would be the training time that could in-126crease in environments with more sensors and complexity. This issue, how-ever, could be addressed by leveraging GPUs for computational speedup,as shown to be effective in improving A3C performance by Babaeizadeh etal. [9].4.5 Related WorkOur research brings multiple areas such as DRL, performance analysis, andjob scheduling together. Therefore, the related work section is mostly abouthow DRL has helped to solve problems of scheduling specifically about theperformance of sophisticated sensors deployment that capturing data pat-terns is non-trivial [60, 73].In the scheduling of distributed tasks in grid and cloud, several optimiza-tion works are available around metrics such as lifespan and load balance[35, 55]. However, such works reflect on either of criticality/importance ortimeliness, which leaves studying the trade-off of the two parameters open- as we do.In terms of performance analysis, our work shares similarities such as thenotion of the metrics between what we study as the criticality-weighteddeadline miss ratio (Section 4.1) and other metrics such as the one ex-plained by Salah er al. [52]. Aside from the performance metrics, differentworks consider hierarchical models, such as the ones studied by Khazaeiet al. [29] and Rashtian et al. [48]. Our work, though, is focusing on127the case of a centralized model as a standard model in infrastructure-as-a-service (IaaS) platforms.On the RL side, our work compares to the body of work done via pol-icy gradient methods. Policy gradient methods include a large family ofreinforcement learning algorithms. They have a long history [70], butonly recently were backed by neural networks and had success in high-dimensional cases. A3C algorithm was published in 2016 and can do betterthan DQN [43] with a fraction of time and resources [39]. Although thereare several papers leveraging policy gradient methods, our work is distin-guished from many of them by both the application and network structure.4.6 Conclusions and future workWe have demonstrated that a deep reinforcement learning approach to bal-ancing message criticality and timeliness in a complex IoT environment iseffective. We proposed an approach by improving upon the A3C algorithmvia embedding memory to the model such that it can capture recurringpatterns of data in complex environments. Our solution outperforms thestudied alternative heuristics that one could use in this context. We envi-sion the applicability of such an idea in the context of IoT environmentsthat host temporally dependent and correlated data arrival. We have tack-led this problem in the specific settings where messages have two essentialproperties of criticality and timeliness.128We have shown that our solution is effective in four scenarios. The pro-posed approach outperforms the rest of the policies in complex scenarios,where we have data arrival correlations, data arrival temporal dependency,and data arrival temporal dependency plus correlation of data (concerningcriticality and deadline values). Also, our solution remains effective in themore straightforward, where the arrival of data to sensors follows differ-ent distribution without dependencies. The observation of the results bothin complex and non-complex scenarios suggest the generalizability of theproposed solution.Concerning model scalability, the proposed model’s training time may varydepending on the input size (i.e. the number of sensors deployed in theenvironment). However, we found that decreasing the training time wouldresult in a slightly less optimal policy. In order to choose between model-based and model-free approaches for deriving a policy (e.g., DRL tech-niques), one may consider that DRL methods require a large number ofsamples from the environment to perform well. While such methods pro-vide more generalized solutions (with minimal assumptions about the en-vironment), in case of severe limitation for training time, model-based ap-proaches may still be reasonable if a comprehensive understanding of theenvironment is available.As future work, we envision multiple avenues to explore. One natural ex-tension to this work is to explore the effectiveness of the approach with129more scenarios. This path may lead to the modification of network archi-tecture. Another possible plan could be to compare the proposed approachwith solutions based on some other DRL algorithms, such as PPO [53].Such explorations could shed light on the trade-off of convergence vs. sta-bility across DRL-based approaches. Lastly, another direction for futurework would be to answer the question of “How should we model infor-mation correlation to preventing redundant message collection?”. In thiscase, we may be able to use mutual information as a metric to capture thevalue of different messages; we could ignore messages with high mutualinformation correlation in relation to messages that have been scheduled.As our proposed model here does not assume and address collecting re-dundant data, it would be interesting to explore potential modification tothe framework, to prevent collecting repetitive data while still capturingrecurring patterns within data. We envision that addressing this problemin our proposed framework would require updates to the reward functionand performance metric, but the overall solution structure is likely to re-main the same. Finally, it would be interesting to solve a similar case whenthe assumption is to have two or more classes of tasks (e.g., critical andregular/normal tasks), and the goal is to prioritize critical tasks over otherclasses of tasks. Such a study would complement our work in this chap-ter as the current problem settings do not assume such a clear distinctionamong tasks and solely focuses on minimizing the weighted-criticality dead-line miss ratio metric over a finite horizon.130Average Criticality-weighted Deadline Miss Ratio (ρ)Scenario m A3C VanillaA3CCriticalityGreedyDeadlineGreedyScenario 1 0.2885 0.2989 0.3277 0.3201Scenario 2 0.2073 0.2063 0.3012 0.3398Scenario 3 0.2077 0.3139 0.3145 0.3056Scenario 4 0.1938 0.2779 0.3018 0.2942Table 4.3: The table summarizes the results for the case of 8 sensors. Itreports the average Criticality-weighted Deadline Miss Ratio forthe policies in each of the scenarios. As shown in green-colouredcells, the proposed policy (m A3C) consistently performs as thebest policy. Even when it is not the best (i.e., Scenario II), itstill performs reasonably well with only one percent of differ-ence (concerning ρ) compared to the best result. Also, the red-coloured cells represent the worst performance (by the greedypolicies) across all the scenarios.131Chapter 5Conclusions & Future WorkWith the proliferation of data-rich environments such as IoT networks,there is an urgent need for the development of proper mechanisms to man-age massive volumes of data. A natural pipeline for the data within anIoT environment is to have data management as the first step, followed byanalysis to support large-scale reasoning and decision making in a varietyof applications [7]. However, one cannot gather all data for further pro-cessing in such a pipeline due to likely limitations concerning resources.In this dissertation, we have concentrated on the first step of the pipelinealone; the amount of data produced by sensors alone can be overwhelming,and we need the right strategies for handling this deluge. Effective filteringof data in the early stages can reduce the pressure on other stages of IoTdata processing systems that would store and perform computation on the132data.Studying statistical models of data-rich environments such as IoT networksbrings us closer to understanding what prioritizing the data means in thecontext of such systems. It makes clearer how to model and interpret suchprioritization. We were successful at providing reasonable policies in casesthat we studied. We anticipate–and hope– that our proposed techniqueshave implications on how data prioritization should be developed for vari-ous systems.Providing high quality of service is a challenging endeavour, especially inlight of resource constraints and even more when environments imposeincreased complexity on patterns in data streams. The models that weproposed in the three sections use performance metrics that enable one toreason about and improve the system functionality concerning the qualityof service in the presence of the challenges above.Numerous modern computing systems integrate intelligent computationalentities, such as machine learning-based approaches, to improve the anal-ysis of data that were somehow collected. However, we applied reinforce-ment learning techniques to develop methods such that with no - and some-times - very little knowledge available about their timeliness and also im-portance (i.e., criticality), one can control system performance. Moreover,in Chapter 4, we tried to improve the work in earlier chapters (Chapters 2133and 3) by making them more generic to avoid high adaptivity concerningsystem settings as we believe this will lead to a more fragile system.Before discussing potential future directions, we wish to emphasize thatthe ultimate goal of our efforts is the development of tools that would as-sist the system designer/architecture in making design decisions relating todata collection aspects of IoT systems. One such tool is an automatic datascheduler (or collector) for systems with minimal knowledge of the arrivingdata streams depending on the system architecture, either in a centralizedor decentralized (e.g., hierarchical) design. In these cases, the data sched-uler/collector combines sensor-based information with machine learningcapabilities to find patterns in data from sensors. It pulls relevant infor-mation together and passes them to other entities in the system for furtheranalysis, understanding, and insights. Considering the whole pipeline inIoT systems, the focus in this dissertation is on the first part that is datacollection.5.1 LimitationsThere are limitations to this work that we would like to acknowledge.While we envision the three studied problems as a representative set interms of system architecture, some of our assumptions would still be lim-ited to the specific cases that we studied and that they could impact on theoverall analysis involved in the final solutions.134Although we chose to focus on three properties of data streams and envi-ronments (i.e., timeliness, criticality, and environment complexity) as es-sential factors, there are potentially other factors that could contribute tocharacterizing IoT systems. Examples are environmental factors that coulddirectly affect the sensor quality of data or potential biases in data samplingby sensors. While such factors can be integrated into our proposed solu-tions, we avoided having too many of them as we believe such adaptivitywill lead to more fragile system architectures.Finally, our solutions were around performance metrics that rely on thecriticality value of the data streams/packets. Another way of envisioningcriticality/importance is to infer criticality from the data itself adaptively.For instance, imagine that the data scheduler collects data to serve a clas-sification task via boosting(i.e., using a series of classifiers rather one clas-sifier to decrease bias). In this case, the more critical data packets are theones that the previous classifiers have failed (i.e. harder for classifiers toget it right), in the hope that the next classifier will focus on these harderexamples - rather than criticality value of the data itself. Although fol-lowing such a notion of criticality may result in missing some critical datapackets, it would be beneficial in the long run as it improves the analyticscapabilities (e.g., classification accuracy in this example).1355.2 Potential Future DirectionsBesides the future work directions we identified in the individual chapters,we would like to point out a few other relevant research directions. Inthis dissertation, the focus is on IoT networks as an example of data-richenvironments. As an interesting area of exploration, one would explorethe adaptation of data prioritization techniques applicable to a broaderrange of cases. For instance, Bandit algorithms have been useful in sys-tems dealing with unknown stochastic environments and seek to optimizea long-term reward by learning and exploiting the unknown environmentsimultaneously. However, such techniques may not be applicable in thesystems, where there exist safety guarantees that have to be met at everysingle round. Therefore, it is crucial to come up with new bandit algorithmsthat account for critical safety requirements.Another exciting direction would be to create a taxonomy of characteris-tics in data-rich environments (not limited to IoT), which could ultimatelyshed light on developing a unified framework for data prioritization withminimal change of settings. The area of Model-Agnostic machine learning[49] seems to be promising to achieve this goal.Another likely direction that can improve generic solutions is to figure outhow to interpret models. In the majority of cases, understanding of whyML models work is limited intuitive heuristics and does not translate into a136precise interpretation of the process. To overcome this issue, “interpretablemachine learning” [40], provides appropriate methods to understand mod-els that are already performing well even to improve them.Being able to interpret generic models results in pinpointing the potentialareas for improvements. In this case, one can explore the effectiveness oflearning to learn (i.e., “meta-learning” [65]), where the idea is to build self-adaptive learners that improve their bias dynamically through experienceby accumulating meta-knowledge.137Bibliography[1] Facebook statistics available at http://rob.social5.net/post/how-to-reach-more-customers-without-adding-more-minutes/. → page2[2] IBM report available athttps://www-01.ibm.com/software/in/data/bigdata/. → page 1[3] https://www.wired.com/insights/2014/11/the-internet-of-things-bigger/. → page1[4] Pokemon Go Engagement Trend:http://www.bloomberg.com/news/articles/2016-08-22/these-charts-show-that-pokemon-go-is-already-in-decline. → page3[5] Twitter statistics available athttp://www.internetlivestats.com/twitter-statistics/. → page 2[6] Youtube statistics available athttp://expandedramblings.com/index.php/youtube-statistics/. →page 2[7] C. C. Aggarwal. Managing and mining sensor data. Springer Science& Business Media, 2013. → pages 62, 132[8] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong,P. Welinder, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba.138Hindsight experience replay. In Advances in Neural InformationProcessing Systems, pages 5048–5058, 2017. → page 113[9] M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, and J. Kautz. Ga3c:Gpu-based a3c for deep reinforcement learning. CoRRabs/1611.06256, 2016. → page 127[10] P. Billingsley. Weak convergence of measures: Applications inprobability. SIAM, 1971. → page 23[11] A. Burns and R. I. Davis. A survey of research into mixed criticalitysystems. ACM Computing Surveys, 50(6), January 2018. → page 90[12] I. Casas, J. Taheri, R. Ranjan, L. Wang, and A. Y. Zomaya. Ga-eti: Anenhanced genetic algorithm for the scheduling of scientificworkflows in cloud environments. Journal of computational science,26:318–331, 2018. → pages 6, 88[13] H. Chang, J. Feng, and C. Duan. Reinforcement learning-based dataforwarding in underwater wireless sensor networks with passivemobility. Sensors, 19(2):256, 2019. → page 90[14] M. Charfi, Y. Gripay, and J.-M. Petit. Spatio-temporal functionaldependencies for sensor data streams. In International Symposiumon Spatial and Temporal Databases, pages 182–199. Springer, 2017.→ pages 7, 94, 119, 120[15] H. Cho and J. Jeong. Implementation and performance analysis ofpower and cost-reduced opc ua gateway for industrial iot platforms.In 2018 28th International Telecommunication Networks andApplications Conference (ITNAC), pages 1–3. IEEE, 2018. → page 100[16] L. L. Clark, C. E. DeArdo, and B. A. Shaffer. Adaptive delayed pollingof sensors. US Patent 4,598,363, 1986. → page 62[17] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introductionto Algorithms, Third Edition. McGraw-HillScience/Engineering/Math, 2009. ISBN 0262033844. → pages31, 53139[18] D. G. Costa. Visual sensors hardware platforms: A review. IEEESensors Journal, 2019. → page 100[19] D. Dhar, S. Gopalakrishnan, and K. Rostamzadeh. Optimal sensingusing query arrival distributions. In Proceedings of the ACMInternational Symposium on Design and Analysis of IntelligentVehicular Networks and applications, pages 39–46, 2012. → page 61[20] V. Franc¸ois-Lavet, P. Henderson, R. Islam, M. G. Bellemare,J. Pineau, et al. An introduction to deep reinforcement learning.Foundations and Trends® in Machine Learning, 11(3-4):219–354,2018. → page 109[21] J. Gittins, K. Glazebrook, and R. Weber. Multi-armed BanditAllocation Indices. John Wiley and Sons, 2011. → page 56[22] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deepq-learning with model-based acceleration. In InternationalConference on Machine Learning, pages 2829–2838, 2016. → page112[23] S. Guha, K. Munagala, and P. Shi. Approximation algorithms forrestless bandit problems. Journal of the ACM, 58(1), Dec. 2010. →page 59[24] D. Han, J. Wu, Y. Mo, and L. Xie. On stochastic sensor networkscheduling for multiple processes. arXiv CoRR, abs/1611.08222,2016. → page 62[25] S. Heo, S. Song, J. Kim, and H. Kim. RT-IFTTT: Real-time IoTframework with trigger condition-aware flexible polling intervals. InIEEE Real-Time Systems Symposium (RTSS), Dec. 2017. → page 61[26] F. Iannello and O. Simeone. On the optimal scheduling ofindependent, symmetric and time-sensitive tasks. IEEE Transactionson Automatic Control, 58(9):2421–2425, Sept 2013. ISSN0018-9286. doi:10.1109/TAC.2013.2258791. → page 60[27] F. Iannello, O. Simeone, and U. Spagnolini. Optimality of myopic140scheduling and whittle indexability for energy harvesting sensors. InAnnual Conference on Information Sciences and Systems (CISS), pages1–6, 2012. → page 59[28] B. Jana and J. Poray. A hybrid task scheduling approach based ongenetic algorithm and particle swarm optimization technique incloud environment. In Intelligent Engineering Informatics, pages607–614. Springer, 2018. → pages 6, 89[29] H. Khazaei, J. Misic, and V. B. Misic. A fine-grained performancemodel of cloud computing centers. IEEE Transactions on parallel anddistributed systems, 24(11):2138–2147, 2012. → pages 89, 127[30] J. E. Kim, T. Abdelzaher, A. Bar-Noy, and R. Hobbs. Sporadicdecision-centric data scheduling with normally-off sensors. In IEEEReal-Time Systems Symposium (RTSS), Dec. 2016. → page 61[31] R. Kleinberg and N. Immorlica. Recharging bandits. In IEEE AnnualSymposium on Foundations of Computer Science (FOCS), Oct. 2018.→ pages 5, 58[32] V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. In Advancesin neural information processing systems, pages 1008–1014, 2000. →page 104[33] J. Liu, T. Zhang, G. Han, and Y. Gou. Td-lstm: Temporaldependence-based lstm networks for marine temperature prediction.Sensors, 18(11):3797, 2018. → pages 7, 94, 115, 119, 120[34] M. A´. Lo´pez Medina, M. Espinilla, C. Paggeti, and J. Medina Quero.Activity recognition for iot devices using fuzzy spatio-temporalfeatures as environmental sensor fusion. Sensors, 19(16):3512,2019. → page 115[35] M. C. Lucas-Estan˜ and J. Gozalvez. Load balancing for reliableself-organizing industrial iot networks. IEEE Transactions onIndustrial Informatics, 2019. → page 127[36] M. C. Lucas-Estan and J. Gozalvez. Load balancing for reliable141self-organizing industrial IoT networks. IEEE Transactions onIndustrial Informatics, 2019. → pages 6, 88[37] S. T. Maguluri and R. Srikant. Scheduling jobs with unknownduration in clouds. IEEE/ACM Transactions On Networking, 22(6):1938–1951, 2013. → pages 6, 88[38] P. A. Meyer. Probability and potentials, volume 357. BlaisdellWaltham, 1966. → page 39[39] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,D. Silver, and K. Kavukcuoglu. Asynchronous methods for deepreinforcement learning. In International conference on machinelearning, pages 1928–1937, 2016. → pages 97, 100, 116, 128[40] C. Molnar. Interpretable machine learning. Lulu. com, 2019. → page137[41] T. Moon, W. Chu, L. Li, Z. Zhang, and Y. Chang. An online learningframework for refining recency search results with user clickfeedback. ACM Transactions on Information Systems, 30(4),November 2012. → page 15[42] J. Nin˜o-Mora and S. S. Villar. Sensor scheduling for hunting elusivehiding targets via Whittle’s restless bandit index policy. InProceedings of the International Conference on Network Games,Control and Optimization, pages 1–8, 2011. → page 59[43] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep explorationvia bootstrapped dqn. In Advances in neural information processingsystems, pages 4026–4034, 2016. → page 128[44] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of optimalqueuing network control. Mathematics of Operations Research, 24(2):293–305, May 1999. → page 59[45] C. Perera, A. Zaslavsky, P. Christen, and D. Georgakopoulos. Sensingas a service model for smart cities supported by internet of things.Trans. Emerg. Telecommun. Technol., 25(1):81–93, Jan. 2014. ISSN1422161-3915. doi:10.1002/ett.2704. URLhttp://dx.doi.org/10.1002/ett.2704. → page 12[46] M. L. Puterman. Markov Decision Processes: Discrete StochasticDynamic Programming. John Wiley and Sons, New York, 5 edition,2005. → page 43[47] L. Quan, Z. Wang, and F. Ren. A novel two-layered reinforcementlearning for task offloading with tradeoff between physical machineutilization rate and delay. Future Internet, 10(7):60, 2018. → page90[48] H. Rashtian and S. Gopalakrishnan. Balancing message criticalityand timeliness in iot networks. IEEE Access, 7:145738–145745,2019. → pages 98, 100, 119, 127[49] M. T. Ribeiro, S. Singh, and C. Guestrin. Model-agnosticinterpretability of machine learning. arXiv preprintarXiv:1606.05386, 2016. → page 136[50] S. M. Ross. Introduction to stochastic dynamic programming.Academic Press, 2014. → pages 5, 57[51] H. L. Royden and P. Fitzpatrick. Real analysis, volume 198.Macmillan New York, 1988. → page 23[52] K. Salah, K. Elbadawi, and R. Boutaba. An analytical model forestimating cloud resources of elastic services. Journal of Network andSystems Management, 24(2):285–308, 2016. → pages 89, 127[53] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov.Proximal policy optimization algorithms. arXiv preprintarXiv:1707.06347, 2017. → page 130[54] L. Seno, G. Cena, A. Valenzano, and C. Zunino. Bandwidthmanagement for soft real-time control applications in industrialwireless networks. IEEE Transactions on Industrial Informatics, 13(5):2484–2495, October 2017. → page 88143[55] J. Shi, J. Luo, F. Dong, and J. Zhang. A budget and deadline awarescientific workflow resource provisioning and scheduling mechanismfor cloud. In International Conference on Computer SupportedCooperative Work in Design, pages 672–677. IEEE, 2014. → pages88, 127[56] H. Y. Shishido, J. C. Estrella, C. F. M. Toledo, and M. S. Arantes.Genetic-based algorithms applied to a workflow schedulingalgorithm with security and deadline constraints in clouds.Computers & Electrical Engineering, 69:378–394, 2018. → pages6, 88[57] W. J. Stewart. Introduction to the numerical solutions of Markovchains. Princeton Univ. Press, 1994. → page 20[58] R. S. Sutton and A. G. Barto. Reinforcement learning: Anintroduction. MIT press, 2018. → page 104[59] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policygradient methods for reinforcement learning with functionapproximation. In Advances in neural information processing systems,pages 1057–1063, 2000. → page 111[60] A. A. Szpiro, P. D. Sampson, L. Sheppard, T. Lumley, S. D. Adar, andJ. D. Kaufman. Predicting intra-urban variation in air pollutionconcentrations with complex spatio-temporal dependencies.Environmetrics, 21(6):606–631, 2010. → page 127[61] P. S. Thomas and E. Brunskill. Policy gradient methods forreinforcement learning with function approximation andaction-dependent baselines. arXiv preprint arXiv:1706.06643, 2017.→ page 111[62] H. C. Tijms. Stochastic models: an algorithmic approach, volume 303.John Wiley & Sons Inc, 1994. → page 41[63] J. Tordsson, R. S. Montero, R. Moreno-Vozmediano, and I. M.Llorente. Cloud brokering mechanisms for optimized placement of144virtual machines across multiple providers. Future generationcomputer systems, 28(2):358–367, 2012. → pages 6, 88[64] D. C. Vanderster, N. J. Dimopoulos, R. Parra-Hernandez, and R. J.Sobie. Resource allocation on computational grids using a utilitymodel and the knapsack problem. Future Generation ComputerSystems, 25(1):35–50, 2009. → page 88[65] R. Vilalta and Y. Drissi. A perspective view and survey ofmeta-learning. Artificial intelligence review, 18(2):77–95, 2002. →page 137[66] S. S. Villar. Indexability and optimal index policies for a class ofreinitialising restless bandits. Probability in the Engineering andInformational Sciences, 30(1):1–23, Jan. 2016. → page 61[67] A. A. Visheratin, M. Melnik, and D. Nasonov. Workflow schedulingalgorithms for hard-deadline constrained cloud environments.Procedia Computer Science, 80:2098–2106, 2016. → page 100[68] S. Weerakkody, Y. Mo, B. Sinopoli, D. Han, and L. Shi. Multi-sensorscheduling for state estimation with event-based, stochastic triggers.IEEE Transactions on Automatic Control, 61(9):2695–2701,September 2016. → page 62[69] P. Whittle. Restless bandits: Activity allocation in a changing world.Journal of Applied Probability, 25:287–298, 1988. ISSN 00219002.URL http://www.jstor.org/stable/3214163. → pages 14, 18, 55[70] R. J. Williams. Simple statistical gradient-following algorithms forconnectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992. → pages 104, 128[71] S. Yao, S. Hu, Y. Zhao, A. Zhang, and T. Abdelzaher. Deepsense: Aunified deep learning framework for time-series mobile sensing dataprocessing. In Proceedings of the 26th International Conference onWorld Wide Web, pages 351–360. International World Wide WebConferences Steering Committee, 2017. → pages 7, 119145[72] S. Zhan and H. Huo. Improved PSO-based task scheduling algorithmin cloud computing. Journal of Information & Computational Science,9(13):3821–3829, 2012. → pages 6, 89[73] J. Zhang, Y. Zheng, and D. Qi. Deep spatio-temporal residualnetworks for citywide crowd flows prediction. In Thirty-First AAAIConference on Artificial Intelligence, 2017. → page 127146Appendix ASupplementary material forChapter 2The following table summarizes the symbols used in Chapter 2.147Symbol DescriptionSi ith sensorνi The value associated with data from Siβ Rate of data value decayP Polling periodµ Rate at which new events occur in theenvironmentq Probability of sensing a new eventM Number of exponential distributions in ahyper-exponential processϒi(t) Utility accrued at sensor Si at time tai(t) Action taken at time t concerning sensor Siγi Bandwidth required by sensor SiB Given bandwidth limit for all sensorsρ Discount factor for reward calculationVρ(Si) Value function associated withchoosing sensor Siξ Average reward for the chosen control policy148Appendix BSupplementary results of Chapter4Besides the experiment results reported for the case of 8 sensors in Chapter4, we provide the extended experiments results for the cases of 4 and 16sensors as well as the complete table of results.B.1 Experiments results for 4 sensorsIn this section, we present the experiment results for the case of having 4sensors:149Figure B.1: Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 1 with 4 sen-sors. The plot indicates the range of criticality-weighted deadlinemiss ratio values using the same data reported in Table B.1. Theproposed approach (m A3C) outperforms the vanilla A3C.Figure B.2: Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 2 with 4 sen-sors. The plot indicates the range of criticality-weighted deadlinemiss ratio values using the same data reported in Table B.1. Theproposed approach (m A3C) outperforms the vanilla A3C.150Figure B.3: Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 3 with 4 sen-sors. The plot indicates the range of criticality-weighted deadlinemiss ratio values using the same data reported in Table B.1. Theproposed approach (m A3C) outperforms the vanilla A3C.Figure B.4: Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 4 with 4 sen-sors. The plot indicates the range of criticality-weighted deadlinemiss ratio values using the same data reported in Table B.1. Theproposed approach (m A3C) outperforms the vanilla A3C.151B.2 Experiments results of 16 sensorsIn this section, we present the experiment results for the case of having 16sensors:Figure B.5: Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 1 with 16sensors. The plot indicates the range of criticality-weighteddeadline miss ratio values using the same data reported in Ta-ble B.1. The proposed approach (m A3C) performs competi-tively compared to the vanilla A3C.152Figure B.6: Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 2 with 16sensors. The plot indicates the range of criticality-weighteddeadline miss ratio values using the same data reported in Ta-ble B.1. The proposed approach (m A3C) outperforms thevanilla A3C.Figure B.7: Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 3 with 16sensors. The plot indicates the range of criticality-weighteddeadline miss ratio values using the same data reported in Ta-ble B.1. The proposed approach (m A3C) outperforms thevanilla A3C.153Figure B.8: Range of criticality-weighted deadline miss ratio (ρ) val-ues concerning the workload intensity for scenario 4 with 16sensors. The plot indicates the range of criticality-weighteddeadline miss ratio values using the same data reported in Ta-ble B.1. The proposed approach (m A3C) outperforms thevanilla A3C.154B.3 Table of results summaryThe following table summarizes the results for the cases of 4,8, and 16sensors. It reports the average Criticality-weighted Deadline Miss Ratio forthe policies in each of the scenarios. As shown in green-coloured cells, theproposed policy (m A3C) almost consistently performs as the best policy.Even when it is not the best, it still performs reasonably well with onlyone percent of difference (concerning ρ) compared to the best result. Thered-coloured cells show the worst performance among the policies, whichis dominantly shared between the two greedy policies.Average Criticality-weighted Deadline Miss Ratio (ρ)#Sensors Scenario m A3C VanillaA3CCriticalityGreedyDeadlineGreedy4 Scenario 1 0.2436 0.2493 0.2806 0.2744Scenario 2 0.1643 0.1683 0.2616 0.2934Scenario 3 0.1710 0.2586 0.2749 0.2674Scenario 4 0.1585 0.2306 0.2636 0.25728 Scenario 1 0.2885 0.2989 0.3277 0.3201Scenario 2 0.2073 0.2063 0.3012 0.3398Scenario 3 0.2077 0.3139 0.3145 0.3056Scenario 4 0.1938 0.2779 0.3018 0.294216 Scenario 1 0.3234 0.3335 0.3550 0.3494Scenario 2 0.2222 0.2249 0.3367 0.3805Scenario 3 0.2254 0.3482 0.3542 0.3458Scenario 4 0.2084 0.3076 0.3431 0.3350Table B.1: The summary of all experiments concerning the Criticality-weighted Deadline Miss Ratio.155
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Reinforcement learning for data scheduling in internet...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Reinforcement learning for data scheduling in internet of things (IoT) networks Rashtian, Hootan 2020
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Reinforcement learning for data scheduling in internet of things (IoT) networks |
Creator |
Rashtian, Hootan |
Publisher | University of British Columbia |
Date Issued | 2020 |
Description | I investigate data prioritization and scheduling problems on the Internet of Things (IOT) networks that encompass large volumes of data. The required criteria for prioritizing data depend on multiple aspects such as preservation of importance and timeliness of data messages in environments with different levels of complexity. I explore three representative problems within the landscape of data prioritization and scheduling. First, I study the problem of scheduling for polling data from sensors where it is not possible to gather all data at a processing centre. I present a centralized mechanism for choosing sensors to gather data at each polling epoch. Our mechanism prioritizes sensors using information about the data generation rate, the expected value of the data, and its time sensitivity. Our work relates to the restless bandit model in a continuous state space, unlike many other such models. The contribution is to derive an index policy and show that it can be useful even when not optimal through a quantitative study where event arrivals follow a hyper-exponential distribution. Second, I study the problem of balancing timeliness and criticality when gathering data from multiple sources using a hierarchical approach. A central decision-maker decides which local hubs to allocate bandwidth to, and the local hubs have to prioritize the sensors’ messages. An optimal policy requires global knowledge of messages at each local hub, hence impractical. I propose a reinforcement-learning approach that accounts for both requirements. The proposed approach’s evaluation results show that the proposed policy outperforms all the other policies in the experiments except for the impractical optimal policy. Finally, I consider the problem of handling timeliness and criticality trade-off when gathering data from multiple resources in complex environments. There exist dependencies among sensors in such environments that lead to patterns in data that are hard to capture. Motivated by the success of the Asynchronous Advantage Actor-Critic (A3C) approach, I modify the A3C by embedding Long Short Term Memory (LSTM) to improve performance when vanilla A3C could not capture patterns in data. I show the effectiveness of the proposed solution based on the results in multiple scenarios. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2020-09-01 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0394147 |
URI | http://hdl.handle.net/2429/75833 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2020-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2020_november_rashtian_hootan.pdf [ 3.05MB ]
- Metadata
- JSON: 24-1.0394147.json
- JSON-LD: 24-1.0394147-ld.json
- RDF/XML (Pretty): 24-1.0394147-rdf.xml
- RDF/JSON: 24-1.0394147-rdf.json
- Turtle: 24-1.0394147-turtle.txt
- N-Triples: 24-1.0394147-rdf-ntriples.txt
- Original Record: 24-1.0394147-source.json
- Full Text
- 24-1.0394147-fulltext.txt
- Citation
- 24-1.0394147.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0394147/manifest