A Constrained MDP-based Vertical Handoff Decision Algorithm for Wireless Networks by Chi Sun B.Sc., Electrical Engineering, University of Alberta, Canada, 2006 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE in THE FACULTY OF GRADUATE STUDIES (Electrical and Computer Engineering) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) July 2008 c© Chi Sun, 2008 ii Abstract The 4th generation wireless communication systems aim to provide users with the con- venience of seamless roaming among heterogeneous wireless access networks. To achieve this goal, the support of vertical handoff is important in mobility management. This thesis focuses on the vertical handoff decision algorithm, which determines the crite- ria under which vertical handoff should be performed. The problem is formulated as a constrained Markov decision process. The objective is to maximize the expected to- tal reward of a connection subject to the expected total access cost constraint. In our model, a benefit function is used to assess the quality of the connection, and a penalty function is used to model the signaling incurred and call dropping. The user’s velocity and location information are also considered when making the handoff decisions. The policy iteration and Q-learning algorithms are employed to determine the optimal policy. Structural results on the optimal vertical handoff policy are derived by using the concept of supermodularity. We show that the optimal policy is a threshold policy in bandwidth, delay, and velocity. Numerical results show that our proposed vertical handoff decision algorithm outperforms other decision schemes in a wide range of conditions such as vari- ations on connection duration, user’s velocity, user’s budget, traffic type, signaling cost, Abstract iii and monetary access cost. iv Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x List of Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Motivations and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Contents v 2 CMDP Formulation of the Vertical Handoff Decision Problem . . . . 17 2.1 CMDP Formulation and System Model . . . . . . . . . . . . . . . . . . . 17 2.1.1 States, Actions, and Transition Probabilities . . . . . . . . . . . . 20 2.1.2 Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.1.3 Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2 CMDP Optimality Equations . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3 Monotone Optimal Vertical Handoff Policy . . . . . . . . . . . . . . . . . 35 2.3.1 Threshold Structure of Unconstrained MDP . . . . . . . . . . . . 36 2.3.2 Threshold Structure of Constrained MDP . . . . . . . . . . . . . 40 3 Numerical Results and Discussions . . . . . . . . . . . . . . . . . . . . . 43 3.1 Comparison between Different Vertical Handoff Decision Algorithms . . . 43 3.1.1 Results for CBR Voice Traffic over UDP . . . . . . . . . . . . . . 46 3.1.2 Results for FTP Data Traffic over TCP . . . . . . . . . . . . . . . 52 3.2 Structure of the Optimal Policy . . . . . . . . . . . . . . . . . . . . . . . 55 4 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 A Proofs of Lemma 1 and Theorems 1, 2, and 3 . . . . . . . . . . . . . . . 67 A.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Contents vi A.2 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 A.3 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 A.4 Proof of Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 vii List of Tables 3.1 Summary of simulation parameters. . . . . . . . . . . . . . . . . . . . . . 44 viii List of Figures 2.1 Coverage area of different networks. . . . . . . . . . . . . . . . . . . . . . 20 3.1 Expected total reward under different discount factor λ for CBR traffic. . 47 3.2 Expected total reward under different mean of user’s velocity µ for CBR traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3 Expected total reward under different user’s budget on expected total ac- cess cost Cmax for CBR traffic. . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4 Expected total reward under different switching cost Ki,a for CBR traffic. 50 3.5 Expected total reward under different access cost of the cellular network C1 for CBR traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6 Expected total reward under different mean of user’s velocity µ for FTP traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.7 Expected total reward under different user’s budget on expected total ac- cess cost Cmax for FTP traffic. . . . . . . . . . . . . . . . . . . . . . . . . 53 3.8 Expected total reward under different access cost of the cellular network C1 for FTP traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 List of Figures ix 3.9 Structure of the unconstrained MDP optimal vertical handoff policy on the available bandwidth when s = [i = 3, b1 = 1, d1 = 3, d2 = 1, d3 = 3, v = 2, l = 3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.10 Structure of the unconstrained MDP optimal vertical handoff policy on the packet delay when s = [i = 2, b1 = 1, b2 = 4, b3 = 2, d3 = 3, v = 1, l = 3]. 56 3.11 Structure of the unconstrained MDP optimal vertical handoff policy on the mean of user’s velocity when s = [i = 3, b1 = 3, d1 = 3, b2 = 4, d2 = 1, d3 = 3, l = 3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 xList of Acronyms 3GPP 3rd Generation Partnership Project 4G 4th Generation ANN Artificial Neural Network CBR Constant Bit Rate CDMA Code Division Multiple Access CMDP Constrained Markov Decision Process CTMDP Continuous-time Markov Decision Process FC Feature Collector FNQD Fuzzy Logic-based Normalized Quantitative Decision FTP File Transfer Protocol GPRS General Packet Radio Service GSM Global System for Mobile Communications List of Acronyms xi HDM Handoff Decision Manager HHO Horizontal Handoff MADM Multiple Attribute Decision Making MDP Markov Decision Process MIH Media Independent Handover MT Mobile Terminal NHM Network Handling Manager PDA Personal Digital Assistant PEV Performance Evaluation Value PIA Policy Iteration Algorithm PR Predicted RSS QoS Quality of Service RSS Received Signal Strength SAW Simple Additive Weighting SINR Signal to Interference and Noise Ratio SMDP Semi-Markov Decision Process List of Acronyms xii SNR Signal to Noise Ratio TCP Transmission Control Protocol TOPSIS Technique for Order Preference by Similarity to Ideal Solu- tion UDP User Datagram Protocol UMTS Universal Mobile Telecommunications System VHDF Vertical Handoff Decision Function VHM Vertical Handoff Manager VHO Vertical Handoff WiMAX Worldwide Interoperability for Microwave Access WLAN Wireless Local Area Network xiii List of Symbols | · | Cardinality of a set × Cartesian product As Action set when the current state is s B Set of available bandwidth bmmax Maximum bandwidth available in network m Cmax User’s monetary budget Cpi(s) Expected discounted total access cost give policy pi and ini- tial state s c(s, a) Cost function given current state s and chosen action a D Set of packet delay values dmmax Maximum packet delay when using network m E Expectation List of Symbols xiv f(s, a) Total benefit function given current state s and chosen action a fb(s, a) Bandwidth benefit function given current state s and chosen action a fd(s, a) Delay benefit function given current state s and chosen action a g(s, a) Total penalty function given current state s and chosen ac- tion a gdrop(s, a) Call dropping penalty function given current state s and cho- sen action a gswitch(s, a) Switching cost penalty function given current state s and chosen action a Ki,a Normalized switching cost from network i to a L Set of location types lmax Maximum number of location types M Set of available network ID N Connection termination time P ( ) Transition probability function List of Symbols xv Ql Total area of location type l Q̂l Effective area of location type l Qβ(s, a) State-action reward function given Lagrange multiplier β, current state s, and chosen action a r(s, a) Reward function given current state s and chosen action a S State space T Set of decision epochs V Set of possible velocity values Vmax Maximum velocity threshold Vmin Minimum velocity threshold vmax Maximum possible velocity vβ(s) Value function given Lagrange multiplier β and initial state s vpi(s) Expected discounted total reward given policy pi and initial state s α Memory level β Lagrange multiplier List of Symbols xvi δ Stationary policy δt Decision rule at time t ζ Gaussian process κ User’s risky index λ Discount factor µ Mean value of velocity Π Set of all possible policies pi Policy pi∗ Optimal policy ρl User density of location type l σ Standard deviation of velocity φ Weight given to the switching cost factor ω Weight given to the available bandwidth factor xvii Acknowledgements First of all, I would like to express my deepest gratitude to my supervisor, Dr. Vincent Wong, for his patient guidance, constant encouragement, and excellent advice throughout my graduate study. Without his invaluable help, this work would not be possible. A special thanks goes to my colleague, Enrique Stevens-Navarro, who provided me with precious assistance during the completion of this thesis. I am also thankful to all my colleagues in Dr. Wong’s group: Yuxia Lin, Amir Hamed Mohsenian-Rad, Vahid Shah- Mansouri, Mani Malekesmaeili, Keivan Ronasi, Man Hon Cheung, and Joo-Han Song, as well as other friends in the data communications group, for sharing their experiences and knowledge during the time of my study. Finally, I take this opportunity to express my profound gratitude to my beloved parents and grandparents for their understanding, patience, support, and endless love during my study in Canada. To them I dedicate this thesis. 1Chapter 1 Introduction Various wireless technologies have evolved rapidly during the past decade. Mobile devices and gadgets (e.g., cellular phones, personal digital assistants (PDAs), laptops) supported by these technologies are becoming more popular and important in people’s everyday life. Wireless local area networks (WLANs), Worldwide Interoperability for Microwave Access (WiMAX), and cellular networks are three paradigms of such technologies in the present wireless realm. WLANs, which are based on the IEEE 802.11 standards, are able to provide services with data rate up to 11 Mbps (802.11b) or 54 Mbps (802.11a/g) at a relatively low access and deployment cost. Moreover, 802.11n promises to offer data rate higher than 100Mbps [1]. However, the coverage area of WLANs is typically less than 100 meters, making them only suitable for hotspot regions such as hotels, libraries, airports, and coffee shops. On the other hand, WiMAX technology can provide broadband wireless services to users in a variety of ways. This technology is based on the IEEE 802.16 standards, in which the 802.16e-2005 (also called mobile WiMAX) introduces the support of mobility. The WiMAXmobile network deployment is expected to provide up to 15Mbps of capacity within a typical cell radius of up to 3 km [2]. Chapter 1. Introduction 2 The typical supported service data rates of cellular networks such as GSM (Global System for Mobile Communications), GPRS (General Packet Radio Service), UMTS (Universal Mobile Telecommunications System), or CDMA2000 (Code Division Multiple Access 2000) are within the range of a few kbps to several Mbps. The cost of accessing and deploying cellular networks is much higher than that of the WLANs. Driven by the complementary characteristics of these wireless technologies (e.g., high- rate, low-cost, small coverage area of WLANs versus low-rate, high-cost, large coverage area of cellular networks), there is a trend to interconnect these access networks into an integrated system. The inspiration behind it, which is also the goal of the 4th Gener- ation (4G) wireless communication systems, is to utilize these different wireless access technologies in order to provide multimedia services to users on an anytime, anywhere basis. Currently, the standardization bodies such as the 3rd Generation Partnership Project (3GPP) [3], 3GPP2 [4], and the IEEE 802.21 Media Independent Handover (MIH) work- ing group [5] are working towards this vision. In 4G communication systems, users will have a variety of choices on the selection of wireless networks to send and/or receive their data. They can either choose to use UMTS to benefit from good quality of ser- vice (QoS), WiMAX to achieve a high data rate, or WLAN to enjoy a moderate access cost. As a result, the users in the 4G communication systems should be able to switch to whichever wireless network they want to use at any time, in a seamless manner. In other words, seamless mobility must be properly managed to achieve the goal of the 4G Chapter 1. Introduction 3 wireless systems. Consider the following scenario. Imagine that you are at a coffee shop, waiting for your friends Amy and Bob to give you a ride to school. You start to check email using your PDA while waiting for them, using the coffee shop’s WLAN. You find out that the bidding of an eBay item that you are interested in is going to end soon, so you go to the eBay website and start to bid on it. In the meantime, your friends arrive and you hop on their car. When the car moves away from the coffee shop, your PDA switches to the UMTS cellular network automatically, keeping you connected to the eBay website. After you have bought the item, Amy picks up your PDA and starts to watch a drama on YouTube, which she left unfinished before she went out. When the car arrives at the campus, the PDA detects the campus-wide WLAN and then switches to it automatically. Due to the higher bandwidth available in WLANs, the video that Amy is watching can now be played in a higher resolution. In the above scenario, in order to maintain the continuity of the sessions (e.g., eBay webpage, video from YouTube), the switching between different networks (e.g., coffee shop’s WLAN to UMTS, UMTS to campus-wide WLAN) is important. This kind of switching is known as vertical handoff, as opposed to horizontal handoff, which is the switching between cells of the same network (e.g., different cells in the UMTS network). In other words, vertical handoff is responsible for service continuity when a connection needs to migrate across heterogeneous wireless access networks, and is also referred as intersystem handoff, while horizontal handoff is responsible for service continuity when Chapter 1. Introduction 4 a connection needs to migrate across homogeneous wireless access networks, and is also referred as intrasystem handoff. Vertical handoff generally involves three phases [6], [7]: system discovery, vertical handoff decision, and vertical handoff execution. During the system discovery phase, the mobile terminal (MT) with multiple radio interfaces receives advertised information from different wireless access networks. Those information may include the access costs and current QoS parameters for different services. Since the QoS provided by each network may change with time, the system discovery phase should be invoked periodically. In the vertical handoff decision phase, the MT determines whether the current con- nection should continue to use the same network or be switched to another one. The decision is based on the information that it received during the system discovery phase, as well as the conditions of its current state (e.g., MT’s current location, velocity, and battery status). In the vertical handoff execution phase, the connections are seamlessly migrated from the existing network to another. This process involves authentication, authorization, and also the transfer of context information. Various vertical handoff execution schemes have been proposed in the literature (e.g., [8], [9], and [10]) with the aim of reducing the time and complexity spent during the vertical handoff procedure. However, the investigation on the vertical handoff execution is beyond the scope of this thesis. Chapter 1. Introduction 5 1.1 Related Work Vertical handoff decision algorithms provide guidance on how to choose a target network when multiple choices are present. In other words, these algorithms specify the aspects that need to be considered when making the vertical handoff decisions. We now sum- marize some of the recent work on vertical handoff decision algorithms in heterogeneous wireless networks. In [11], a vertical handoff decision algorithm based on fuzzy logic-based normalized quantitative decision (FNQD) algorithm is proposed. Network parameters such as the current received signal strength (RSS), predicted RSS (PR), user’s velocity, and the available bandwidth are considered when making the decision. The MT samples the RSS periodically. With the sampled RSSs, the PR can be obtained by the forward differen- tial prediction algorithm. This algorithm also has a pre-decision (PD) module which is applied before the handoff decision module, so that unnecessary data can be filtered out to provide an accurate handoff trigger. Specifically, the MT’s velocity and the PR are considered as PD metrics to filter data for the FNQD. There are three sub-procedures in the FNQD module: fuzzification, normalization, and quantitative decision. First, the input parameters (i.e., current RSS, PR, and user’s velocity) are calculated, and fed into the fuzzification procedure. After this, the FNQD normalizes these parameters to obtain their normalized evaluation values. Then, the FNQD determines a performance evalua- tion value (PEV) for each candidate network. Finally, the handoff decision algorithm is as follows: the greater the PEV of a certain network is, the higher the probability that Chapter 1. Introduction 6 this network becomes the target network to switch to. In [12], a middleware solution called Vertical Handoff Manager (VHM) is implemented to address the vertical handoff decision problem. The architecture of the VHM consists of three components: Network Handling Manager (NHM), Feature Collector (FC), and Artificial Neural Networks (ANN) selector. The NHM is mainly responsible for handling interactivity between VHM and the available wireless networks, as well as between VHM and wireless devices. NHM handles requests made by wireless devices when initiating a handoff, and contacts the available wireless networks to begin the feature collection process. The FC module retrieves fea- tures that are used as attributes to feed into the neural network. These features can be divided into three categories: network-specific, user-specific, and device-specific. The network-specific features include the network’s access cost, security, transmission range, and data rate. The user-specific features are the user’s preferences, such as cost, security level, and power consumption. The device-specific features provide information about the mobile device, and are used for compatibility issues and other technical information necessary when making a handoff decision. After all the features are collected by the FC module, the ANN is utilized to find the best match between what the user wants and what is available across these wireless networks. Once a match is found, the VHM will initiate a message exchange between the mobile device and the chosen wireless network to execute the vertical handoff procedure. In [13], a handoff decision manager (HDM) is proposed for network selection. The Chapter 1. Introduction 7 proposed HDM algorithm considers different factors such as the candidate network’s cost, coverage, data rate, and traffic load. First, the HDM considers the availability of lower overlay networks in order to decide whether to perform downward vertical handoff (e.g., from UMTS to WLAN). If there are available lower overlay networks, HDM calculates the utilities of these candidates and then compares these utilities with that of the current serving network. If one of the candidates has a consistently better utility, then the HDM decides to make a downward vertical handoff to that network. Secondly, if handoff is not invoked, then the algorithm will consider the RSSs and use the conventional 2-level threshold scheme to decide whether to perform horizontal handoff. Finally, if no handoff has been triggered yet, then the HDM will consider the availability of upper overlay networks for upward vertical handoff (e.g. from WLAN to UMTS). Since the aim of an upward vertical handoff is to maintain the connectivity rather than improving QoS, this decision should be made quickly. As a result, the HDM uses RSS as an indication. If the RSS of the current network decays consistently, then the HDM will decide to switch to an upper overlay network. In [14], the vertical handoff decision is made based on the cost function of each can- didate network. A cost function consists of three aspects such as the access network capacity, signalling cost, and the load balancing factor. The access network capacity is calculated based on parameters such as upstream and downstream bandwidth, power consumption, and charging rate. Moreover, the algorithm gives priority to horizontal handoff by including the signalling cost of vertical handoff into the cost function. Fi- Chapter 1. Introduction 8 nally, the local balancing factor is added into the cost function in order to distribute the traffic over different access networks in a manner such that the total resources of the heterogeneous wireless access networks can be used more efficiently. Given the cost function of each access network, the handoff procedure is as follows. First, it creates a set of candidates by choosing the access networks whose RSSs are higher than the predefined threshold. Next, it classifies the candidate networks by their cost functions. Finally, the chosen network is the one with the least cost function value. In [15], the vertical handoff decision algorithm is built upon three conditions: the QoS parameters, the call dropping probability, and the number of ping-pong events. First, the minimum SNR threshold is determined, for which QoS parameters such as the maximum bit error rate, the maximum data block delay, and the minimum data rate should all be satisfied. Then, the minimum downlink and uplink SNR threshold values are determined to jointly guarantee the QoS requirements for all active downlink and uplink multi-service connections of the MT. Secondly, the handoff entry threshold is derived by the average number of ping-pong events during the entry to the WLAN system. The handoff exit threshold is derived by the call dropping probability. Furthermore, the WLAN is selected as the preferred network for the MT. The goal of this handoff decision algorithm is to maximize the time during which the MT is served by the WLAN, while satisfying the QoS requirements as well as the maximum call dropping probability and the maximum average number of ping-pong events constraints. In [16], the vertical handoff decision is formulated as a fuzzy multiple attribute decision Chapter 1. Introduction 9 making (MADM) problem. Two MADM methods are proposed: SAW (Simple Additive Weighting) and TOPSIS (Technique for Order Preference by Similarity to Ideal Solution). In SAW, the overall score of a candidate network is determined by the weighted sum of all attribute values. This scheme is widely used due to its ease of understanding and simplicity in implementation. For example, in [17], a distributed vertical handoff decision scheme is proposed using the SAW algorithm. The scheme includes the dropping probability, bandwidth, and cost as the handoff parameters in order to make a decision. In TOPSIS, the idea is based on the principle that the chosen candidate should have the shortest distance from the ideal solution and the farthest distance from the negative- ideal solution. As a result, the selected candidate network is the one which is the closest to the ideal network, where the property of the ideal network is obtained by using the best values for each metric considered. In [18], a vertical handoff decision algorithm based on fuzzy logic is presented. This scheme uses the following parameters as the decision metric: RSS, SINR, cost, bit error rate, latency, and data transmission rate. In [19], a vertical handoff decision scheme based on ELECTRE is proposed. ELECTRE is an MADM algorithm, which performs pair-wise comparisons among the alternatives. However, ELECTRE assumes a monotonic utility and does not provide a complete ranking of all the alternatives. Since a complete ranking is necessary to find the top ranking candidate network, Bari et al. modified ELECTRE so that it is able to provide complete ranking of alternatives even in scenarios where the utility of some attributes are non-monotonic, and used the adapted ELECTRE to make Chapter 1. Introduction 10 vertical handoff decision. This modification allows the application of the algorithm in scenarios where the decision maker can choose the alternative that has the attributes closest to a reference set of attribute values. The attributes considered in [19] include the bandwidth, delay, packet jitter, packet loss, utilization, and network cost. In [20], a vertical handoff decision algorithm which uses the received SINR from vari- ous access networks as the handoff criterion is proposed. Since the maximum achievable data rate is a function of the SINR, this SINR-based vertical handoff method is appli- cable. The received SINR from the cellular network is converted to the equivalent SINR required to achieve the same data rate in WLAN, and then compared with the actual received SINR from WLAN. With both SINR being considered, vertical handoff is trig- gered when the user can obtain higher equivalent SINR from another access network. It means that given the receiver end SINR measurements of both WLAN and the cellular network, the handoff mechanism has the knowledge of the estimated maximum possible receiving data rate a user can obtain from either WLAN or the cellular network. This gives the vertical handoff mechanism the ability to make handoff decisions with multime- dia QoS considerations, such as to offer the user maximum downlink throughput from the integrated network, or to guarantee the minimum user required data rate during vertical handoff. In [21], a handoff management system which includes several modules and procedures is proposed to make handoff decisions in heterogeneous wireless networks. Some of these modules collect the link-layer and network-layer information which is useful for handoff Chapter 1. Introduction 11 management, and other modules use this information to decide the appropriate time to initiate the handoff and execute the handoff procedure. When the handoff management system starts, it activates the network environment detection module and the traffic mea- surement module. The network environment detection module gathers the information about the available access networks and their performances. For example, it determines how many WLANs are currently available and their signal strengths. Meanwhile, the traffic measurement module determines the type of the current application and its QoS requirement. For instance, it considers whether the current application is a real-time application, and determines its throughput and delay requirements. If there are multiple networks available and the current access network cannot satisfy the QoS requirement of the running application, the handoff decision module will be invoked. It determines the destination network based on the sojourn time of the MT in the candidate networks and these networks’ QoS estimation, including RSS, channel utilization, and link delay/jitter. Based on the output of the handoff decision module, the system will choose to either enter a handoff routine or keep the current connection. In [22], a vertical handoff decision algorithm based on RSS-based handoff trigger scheme is implemented. By applying the autoregressive integrated moving average model, the future RSS values can be predicted. The handoff decision can then be made according to these RSS predictions. In [23], a vertical handoff decision function (VHDF) is proposed for handoff decision. The VHDF is used to measure the improvement gained by switching to a particular network. It considers the following factors: cost of services, security, power Chapter 1. Introduction 12 consumption, network condition, and network performance. As the MT roams across different networks, VHDF is evaluated for all accessible networks. The network with the highest calculated VHDF value is the most desirable network for the users to switch to based on their specified preferences. In [24], a utility-based network selection strategy is presented. It captures the trade- offs between the users’ preferences and their vertical handoff decisions. Several utility functions are examined which explore different users’ attitudes on their current applica- tions. For example, users who are willing to pay more money are more likely to obtain better services. In [25], a vertical handoff decision algorithm based on dynamic programming is pro- posed. Since the enhancement of user’s satisfaction by a vertical handoff depends on the user’s sojourn time in the wireless network (e.g., WLAN), the algorithm takes the user’s location and mobility information into consideration. Two models are proposed to capture the user’s mobility behavior. The open area model assumes that the MT’s velocity is independent on its location, while the roads model assumes that the users are moving along roads, in which the velocity can be considered as a scalar rather than a vector. The user’s velocity and moving pattern are also considered in the vertical handoff deci- sion algorithms in [26] and [27], where the work in [26] is an adaptive multi-criteria vertical handoff decision algorithm based on fuzzy-logic, and the work in [27] is a movement-aware vertical handoff decision algorithm between WLAN and the WiMAX network. Moreover, Chapter 1. Introduction 13 in [28], a framework is proposed to evaluate different vertical handoff decision algorithms, in which the MT’s mobility is modeled by a Markov chain. In [29], a Markov decision process (MDP) approach for vertical handoff decision mak- ing problem is proposed. This MDP approach takes into account multiple factors such as user’s preference, network situation, device capability, and the effects of the environment. In [30], a vertical handoff decision algorithm is proposed. The problem is formulated as an MDP, and the model considers the available bandwidth and delay of the candidate networks. The model in this thesis is an extension of the one proposed in [30], such that it not only considers the QoS of the candidate networks, but also takes the user’s mobility and location information into account. Moreover, this model addresses a practical issue that users have monetary budgets for their connections. 1.2 Motivations and Objectives Although there have been various vertical handoff decision algorithms proposed in the literature, most of them only make decisions based on the current system state (e.g., con- sider only current QoS of the networks and current MT’s conditions). Handoff decisions should also consider the probabilistic outcomes of the future system states as a result of the current decision. Some work (e.g., [25], [29], and [30]) follows this approach; however those algorithms do not take the user’s monetary budget into consideration. In our work, the vertical handoff decision algorithm considers the following aspects: Chapter 1. Introduction 14 1. The state of the environment. This includes the available bandwidth, delay, switch- ing cost, and access cost information of the overlaying networks. 2. The state of the user. This includes the user’s velocity and location information. 3. The preference of the user. 4. The current condition of the system as well as its future possible evolutions. 5. User’s monetary budget. For example, a user may agree to spend at most $3 for a multimedia session with an average duration of 30 minutes. Considering these aspects, we propose a vertical handoff decision algorithm for wire- less networks based on the constrained MDP (CMDP) model. The first three aspects are embedded into the system model, the fourth aspect is realized by the MDP formulation, and the last aspect is viewed as the constraint of the MDP problem. In our proposed CMDP model, a benefit function is used to capture the advantages, which are assessed in terms of bandwidth and delay that a user can gain by choosing an action. The actions available to the user are related to its location. In the meantime, there is a penalty function that represents the potential drawbacks associated with this action. Specifi- cally, the penalty includes the signaling cost incurred and the call dropping probability of taking this action. A reward function is then used to represent the reward that a user can obtain by choosing an action, considering the associated benefit and penalty. An access cost function is used to capture the access cost that a user needs to spend for using a specific network, and the constraint of the problem is that the user has a Chapter 1. Introduction 15 budget on the total access cost for the duration of each connection. The objective of the problem is to determine the policy which maximizes the expected total reward per connection, while the expected total access cost associated with this connection does not exceed the user’s budget for it. In other words, the goal of this problem is to determine the method of choosing networks such that the user can enjoy the best possible QoS during the connection while he/she will not be charged more than his/her budget for this connection. 1.3 Contributions The main contributions of this thesis are as follows [31]: • Our CMDP-based vertical handoff decision algorithm takes into account the re- sources available in different networks (e.g., QoS, switching costs, access costs), and the MT’s information (e.g., location, velocity). A benefit function is used to model the available bandwidth and delay of the connection. A penalty function is used to model the signaling incurred and the call dropping probability. An access cost function is used to capture the access cost of using a specific network. • We determine the optimal vertical handoff policy for decision making via the use of policy iteration and Q-learning algorithms. • We derive structural results regarding the optimal vertical handoff policy, and show that the optimal policy is a threshold policy in available bandwidth, delay, and Chapter 1. Introduction 16 velocity. • We evaluate the performance of our proposed algorithm under different criteria. Numerical results show that our vertical handoff decision algorithm outperforms other decision schemes (e.g., SAW [16], ELECTRE [19]) in a wide range of con- ditions such as variations on connection duration, user’s velocity, user’s budget, traffic type, signaling cost, and monetary access cost. 1.4 Structure of the Thesis The rest of the thesis is organized as follows. The system model and the CMDP formu- lation are described in Chapter 2, in which the structure of the optimal vertical handoff policy is also investigated. The numerical results and discussions are presented in Chapter 3. Conclusions and future work are given in Chapter 4. 17 Chapter 2 CMDP Formulation of the Vertical Handoff Decision Problem In this chapter, we explain how to formulate the vertical handoff decision problem as a CMDP model in Section 2.1. In Section 2.2, we describe how to obtain the optimal solution. In Section 2.3, we investigate the structure of the optimal policy, and show that the optimal policy has structural results on some state variables. 2.1 CMDP Formulation and System Model First, let’s take a look at what an MDP model is [32]. People make decisions every day. Simple examples include whether to upgrade your car with snow tires in the winter, or whether to eat breakfast when you know that you are going for a lunch buffet. These decisions have not only immediate impacts, but also long-term consequences. One can- not make decisions only based on the current conditions, because your current decision will also affect the decisions in the future. If the decision maker does not consider the relationship between current and future decisions and the consequences, he/she may not Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 18 achieve the best overall performance. As an example, in a marathon race, a player decides to sprint at the beginning will quickly lose energy hence results in a poor finish. The MDP is a model for sequential decision making under uncertainty. It takes into account both the outcomes of the current decisions and the future decision making opportunities. The key elements of the MDP model are as follows [32]: 1. A set of decision epochs. 2. A set of system states. 3. A set of available actions. 4. A set of state and action dependent immediate rewards. 5. A set of state and action dependent probabilities. At each decision epoch (i.e., time), the system state provides the decision maker with all necessary information for choosing an action from the set of available actions in that particular state. After the decision maker makes a choice on the action in that state, two things will happen. First, the decision maker receives an immediate reward. Second, the system evolves to a new (and possibly different) state at the next decision epoch, according to some state transition probabilities. Both the immediate reward and the state transition probabilities depend on the state and the chosen action. As the process evolves through time, the decision maker receives a sequence of rewards. A decision rule prescribes a procedure to choose an action in each state at a specified decision epoch. A policy specifies the decision rules for all decision epochs, and thus can Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 19 be viewed as a sequence of decision rules. The goal of the sequential decision problem is to implement a policy such that the sum of the total rewards is maximized. A CMDP model is an MDP model with constraints [33, 34, 35]. In this thesis, we only consider the case where there is only one constraint. In the CMDP model, after choosing an action at a decision epoch, the decision maker receives both an immediate reward and an immediate cost. The goal is to maximize the sum of the expected total rewards while the sum of the expected cost is less than a predefined value. We now describe how the vertical handoff decision problem can be formulated as a finite state, infinite horizon CMDP. A CMDP model can be characterized by six elements: decision epochs, states, actions, transition probabilities, rewards, and costs [33]. At each decision epoch, the MT has to choose an action (i.e., selects a network) based on the current system state (e.g., QoS that can be provided by each candidate network, velocity and location of the MT). With this state and action, the system then evolves to a new state according to a transition probability function. This new state lasts for a period of time until the next decision epoch comes, and then the MT makes a new decision again (i.e., selects a network again). For any action that the MT chooses at each state, there is a reward and a cost associated with it. The goal of each MT is to maximize the expected total reward that it can obtain during the connection, subject to a constraint on its expected total access cost. Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 20 UMTS WLAN WLAN WiMAX WLAN Figure 2.1: Coverage area of different networks. 2.1.1 States, Actions, and Transition Probabilities We represent the decision epochs by T = {1, 2, . . . , N}, where the random variable N indicates the time that the connection terminates. We denote the state space of the MT by S, and we only consider finite number of states that the system can possibly be in. The state of the system contains information such as the current network that the MT connects to, the available bandwidth and delay that the candidate networks can offer, and the velocity and location information of the MT. Specifically, the state space can be expressed as follows: S =M×B1 ×D1 × · · · × B|M| ×D|M| × V × L, where × denotes the Cartesian product, M represents the set of available network ID that the MT can connect to. Bm and Dm denote the set of available bandwidth and delay Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 21 of network m ∈M, respectively. V denotes the set of possible velocity values of the MT, and L denotes the set of location types that the MT can possibly reside in. Since a finite countable state space is being considered in this thesis, the bandwidth and delay can be quantized into multiples of unit bandwidth and unit delay, respectively [33]. Specifically, the set of available bandwidth of network m is: Bm = {1, 2, . . . , bmmax} , m ∈M, where bmmax denotes the maximum bandwidth available to a connection from network m. For example, the unit bandwidth of WLAN and the UMTS network can be 500 kbps and 16 kbps, respectively. Similarly, the set of packet delay of network m is: Dm = {1, 2, . . . , dmmax} , m ∈M, where dmmax denotes the maximum delay provided to a connection by network m. For example, the unit delay of WLAN and the UMTS network can be 50 ms and 20 ms, respectively. The velocity of the MT is also quantized as multiples of unit velocity. The set of possible velocity values is: V = {1, 2, . . . , vmax} , where vmax denotes the maximum velocity that an MT can travel at. For example, the unit of velocity can be 10 km/h. Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 22 For the set of location types that the MT can possibly reside in, we have: L = {1, 2, . . . , lmax} , where lmax denotes the total number of different location types in the area of interest. Location types are differentiated by the number of networks they are covered by. If the MT is within an area of location type l, it implies that there are l kinds of wireless access networks that the MT can choose from. For instance, if an MT is in the location area covered by both WiMAX and UMTS networks, this place falls into the category of location type 2. However, if an MT is in the location area covered by only an UMTS network, this place is a location type 1 area. Let vector s = [i, b1, d1, . . . , b|M|, d|M|, v, l] denote the current state of the system, where i denotes the current network used by the connection, bm and dm denote the current bandwidth and delay of network m, respectively, v denotes the current velocity of the MT, and l denotes the current location type that the MT resides in. At each decision epoch, based on the current state s, the MT chooses an action a ∈ As, where the action set As ⊂ M consists of the ID of the networks that the MT can potentially switch to. Given the current state is s, if the chosen action is a, the probability that the next state becomes s′ = [j, b′1, d ′ 1, . . . , b ′ |M|, d ′ |M|, v ′, l′] is: P [s′ | s, a] =  P [v′ | v]P [l′ | l]∏m∈M P [b′m, d′m | bm, dm], j = a, 0, j 6= a, (2.1) where P [v′ | v] is the transition probability of the MT’s velocity, P [l′ | l] is the transition probability of the MT’s location type, and P [b′m, d ′ m | bm, dm] is the joint transition Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 23 probability of the bandwidth and delay of network m. We explain how we obtain these transition probabilities as follows. The transition probability of the MT’s velocity is obtained based on the Gauss-Markov mobility model from [36]. Since a mobile user usually travels with a destination in mind, and the change of a mobile user’s velocity within a short period of time is limited due to physical restriction, the MT’s future location and velocity are likely to be correlated with its past and current location and velocity. As a result, the memoryless nature of the random-walk model is not suitable to represent the MT’s behavior. Furthermore, the fluid-flow model that represents vehicular traffic is also not suitable to capture the pedestrian behavior of mobile users. In the Gauss-Markov mobility model, an MT’s velocity is assumed to be correlated in time and can be modeled by a discrete Gauss-Markov random process. The following recursive realization is used to calculate the transition probability of the MT’s velocity: v′ = αv + (1− α)µ+ σ √ 1− α2ζ, (2.2) where v is the MT’s velocity at the current decision epoch, v′ is the MT’s velocity at the next decision epoch, α is the memory level (i.e., 0 ≤ α ≤ 1), µ and σ are the mean and standard deviation of v, respectively, and ζ is an uncorrelated Gaussian process with zero mean and unit variance. By varying v and counting the number of different outcomes of v′ according to (2.2), the MT’s velocity transition probability matrix (i.e., P [v′ | v]) can be determined. For the transition probability of the MT’s location type, we assume that a wireless Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 24 access network which has a smaller coverage area (e.g., WLAN) always lies within another network that has a larger coverage area (e.g., WiMAX). Although this assumption may not hold for the cases when |M| is large, it is still reasonable as long as the number of different network types does not exceed three, which is a typical case in today’s wireless communication systems. We define location type l ∈ L as follows. Location type 1 is the area covered only by the UMTS network, location type 2 is the area covered by UMTS and WiMAX, but not WLAN, and location type 3 is the area covered by all three networks (i.e., UMTS, WiMAX, and WLAN). Let Θl denote the total area of location type l and ρl denote the user density of location type l. The effective area of location type l is defined as: Θ̂l = Θl ρl, l ∈ L. (2.3) In practice, the user density in areas covered by different networks (e.g., WLAN and the UMTS network) are usually not the same [37], [38]. For example, the area covered by both WLAN and the UMTS network usually has more active connections than the area only covered by the UMTS network. As a result, the density index of each location type is considered in order to achieve a more realistic model. We assume that an MT currently at location type l can only move to its neighboring location types (i.e., either l + 1 or l − 1) or stay at l at the next decision epoch. This is because the duration of each decision epoch is too short for the MT to traverse more Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 25 than one location type area1. Thus, the probability that an MT’s next location type is l′ given its current location type is l is assumed to be proportional to the effective area of l′. Specifically, the transition probability of an MT’s location type is defined as follows: P [l′ | l] =  Θ̂l′∑ ξ=l,l+1 Θ̂ξ , for l = 1, l′ = 1, 2, Θ̂l′∑ ξ=l−1,l,l+1 Θ̂ξ , for l = 2, . . . , lmax − 1, l′ = l − 1, l, l + 1, Θ̂l′∑ ξ=l−1,l Θ̂ξ , for l = lmax, l ′ = lmax − 1, lmax. (2.4) For the joint transition probabilities of the bandwidth and delay of each network, we use the following approach to estimate them. For the cellular network, the values of bandwidth and delay are assumed to be guaranteed for the duration of the connection: P [b′1, d ′ 1 | b1, d1] =  1, if b′1 = b1, d ′ 1 = d1, 0, otherwise. For WiMAX and WLAN, we estimate such probabilities in a simulation-based manner. In ns-2 simulator [39], typical IEEE 802.16 WiMAX [40] and IEEE 802.11b WLAN are simulated in which the users arrive and depart from the networks with an average Poisson rate of 0.2 users per second. The resulting available bandwidth and delay are rounded according to the predefined units, and then the counting of transitions among states is performed to estimate the state transition probabilities of WiMAX and WLAN (i.e., P [b′2, d ′ 2 | b2, d2] and P [b′3, d′3 | b3, d3], respectively). 1The time between two successive decision epochs is on the order of seconds. Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 26 2.1.2 Rewards When an MT chooses an action a in state s = [i, b1, d1, . . . , b|M|, d|M|, v, l], it receives an immediate reward r(s, a). This reward function is composed of a benefit function and a penalty function, which are explained in detail below. For the benefit function of the MT, two aspects are considered: bandwidth and delay. Let the bandwidth benefit function represent the benefit that an MT can gain (in terms of bandwidth) by selecting action a in state s (recall i denotes the ID of the current network): fb(s, a) =  ba−bi max k∈M {bk−bi} , if ba > bi, 0, if ba = bi, − ba−bi min k∈M {bk−bi} , if ba < bi. (2.5) The benefit is being assessed as follows. Given that the MT is currently connecting to network i, if the action a leads to a network with a higher bandwidth, then the benefit function value is represented by a fraction, in which the numerator is the MT’s actual increase of bandwidth by choosing action a in state s, and the denominator is the MT’s maximum possible increase of bandwidth. As a result, the benefit function value is a positive number between 0 and 1. Similarly, if the action leads to a network with a lower bandwidth, the benefit function value becomes a negative number between -1 and 0. Finally, if the MT chooses to remain at the same network, then the benefit function value is 0. Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 27 Similarly, a delay benefit function is used to represent the benefit that an MT can gain (in terms of delay) by choosing action a in state s: fd(s, a) =  di−da max k∈M {di−dk} , if da < di, 0, if da = di, − di−da min k∈M {di−dk} , if da > di. (2.6) Given that the MT is currently connecting to network i, if the action a leads to a network with a lower delay, then the benefit function value is represented by a fraction, in which the numerator is the MT’s actual decrease of delay by choosing action a in state s, and the denominator is the MT’s maximum possible decrease of delay. As a result, the benefit function value is a positive number between 0 and 1. Similarly, if the action leads to a network with a higher delay, then the benefit function value it can obtain becomes a negative number between −1 and 0. Finally, if the MT chooses to remain at the same network, then the benefit function value is 0. As a result, the total benefit function is given by: f(s, a) = ωfb(s, a) + (1− ω)fd(s, a), s ∈ S, a ∈ As, (2.7) where fb(s, a) and fd(s, a) are both normalized (i.e., between 0 and 1), and ω is the weight given to the bandwidth aspect with 0 ≤ ω ≤ 1. This weight can be set differently for different types of applications (e.g., constant bit rate (CBR) voice traffic, file transfer protocol (FTP) data traffic). Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 28 We consider two factors for the penalty of the MT. First, the switching cost penalty function is represented by: gswitch(s, a) =  Ki,a, if i 6= a, 0, if i = a, (2.8) where Ki,a is the normalized switching cost from network i to network a. This penalty function captures the processing and signaling load incurred when the connection is migrated from one network to another. Second, we define the call dropping penalty function as: gdrop(s, a) =  0, if i = a, 0, if i 6= a, 0 ≤ v ≤ Vmin, v−Vmin Vmax−Vmin , if i 6= a, Vmin < v < Vmax, 1, if i 6= a, v ≥ Vmax, (2.9) where Vmax and Vmin denote the maximum and minimum velocity thresholds, respectively. When the MT moves faster, the probability that the connection will be dropped during the vertical handoff process increases. For example, if an MT moves out of a WLAN with a high speed, it may enter the area covered only by the UMTS network (hence lose the WLAN signal) before the WLAN-to-UMTS vertical handoff procedure is completed. As a result, the MT’s connection will be dropped. Consequently, the total penalty function of an MT is given by: g(s, a) = φgswitch(s, a) + (1− φ)κgdrop(s, a), s ∈ S, a ∈ As, (2.10) Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 29 where φ is the weight given to the switching cost factor with 0 ≤ φ ≤ 1, and κ ∈ [0, 1] is the MT’s risky index. This risky index represents the user’s preference, since some users would allow vertical handoffs in order to obtain better QoS although there is a risk that the connection may be dropped during handoff, whereas some others may refrain from switching whenever a risk is present. Finally, the reward function between two successive vertical handoff decision epochs is: r(s, a) = f(s, a)− g(s, a), s ∈ S, a ∈ As, (2.11) and is normalized within the range from 0 to 1. Recall this reward function considers the bandwidth and delay of all candidate networks, the signaling cost incurred when switching occurs, the call dropping probability, and the user’s risky preference. 2.1.3 Costs For each period of time that the MT uses a network, it will incur the following access cost (in monetary units per unit time): c(s, a) = ba Ca max m∈M {bm Cm} , s ∈ S, a ∈ As, (2.12) where bm is the available bandwidth in bps and Cm is the access cost of network m in monetary units per bit. This access cost is normalized such that its value is between 0 and 1. The user has a budget such that it is willing to spend up to Cmax monetary units per connection. Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 30 2.2 CMDP Optimality Equations In this section, we describe how to obtain the optimal policy. First, some concepts need to be clarified. A decision rule specifies the action selection for each state at a particular decision epoch. It can be expressed as δt : S → A, where δt represents the decision rule at decision epoch t. A policy pi = (δ1, δ2, . . . , δN) is a set of sequential decision rules to be used at all N decision epochs. Let vpi(s) denote the expected total reward between the first decision epoch and the connection termination, given that policy pi is used with initial state s. We can represent vpi(s) as: vpi(s) = Epis [ EN { N∑ t=1 r(st, at) }] , s ∈ S, (2.13) where Epis denotes the expectation with respect to the variables given policy pi and initial state s, and EN denotes the expectation with respect to the random variable N . The random variable N , which denotes the connection termination time, is assumed to be geometrically distributed with mean 1/(1 − λ). Equation (2.13) can be re-written as follows: vpi(s) = Epis { ∞∑ t=1 λt−1 r(st, at) } , s ∈ S, (2.14) where λ can also be interpreted as the discount factor of the model (i.e., 0 ≤ λ < 1). We need to find the policy that leads to this optimal solution while satisfying the expected total access cost constraint. We define a policy pi∗ = (δ∗1, δ ∗ 2, · · ·) to be optimal in Π if vpi ∗ (s) ≥ vpi(s) for all pi ∈ Π, where Π is the set of all possible policies. We assume stationary policy in this model, and a policy is said to be stationary if δt = δ for all t. Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 31 A stationary policy has the form pi = (δ, δ, · · ·), and for convenience we denote pi simply by δ. A policy is said to be deterministic if it chooses an action with certainty at each decision epoch. We refer to stationary deterministic policies as pure policies [32, pp. 22]. Since our objective is to maximize the expected discounted total reward (i.e., vpi(s)) subject to an access cost constraint, we can state the CMDP optimization problem as: maximize vpi(s) = Epis { ∞∑ t=1 λt−1 r(st, at) } , subject to Cpi(s) = Epis { ∞∑ t=1 λt−1 c(st, at) } ≤ Cmax, (2.15) where Cpi(s) denotes the expected discounted total access cost with respect to policy pi and initial state s ∈ S. To solve (2.15) without the constraint on the expected discounted total access cost, we can use the Policy Iteration Algorithm (PIA) [32] to solve the following optimality equations : v(s) = max a∈As { r(s, a) + ∑ s′∈ S λ P [s′ | s, a] v(s′) } , s ∈ S, (2.16) and the corresponding optimal policy is given as: δ∗(s) = arg max a∈As { r(s, a) + ∑ s′∈ S λ P [s′ | s, a] v(s′) } , s ∈ S. (2.17) However, since (2.15) has a constraint in it, we cannot use the PIA directly. We need to first use the Lagrangian approach [33, 41] to reduce the CMDP problem into an equivalent unconstrained MDP problem. By including the Lagrange multiplier β with β > 0, we have: r(s, a; β) = r(s, a)− βc(s, a), s ∈ S, a ∈ As, (2.18) Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 32 Algorithm 1 - The Policy Iteration Algorithm (PIA): 1: Set k = 0, and select an arbitrary decision rule δ0β(s) ∈ As for all s ∈ S. 2: (Policy evaluation) Obtain vkβ(s) for all s ∈ S by solving: (I − λ P) vkβ = rβ, where I is an identity matrix of size |S|. 3: (Policy improvement) Choose δk+1β (s) to satisfy: δk+1β (s) = arg max a∈As { r(s, a; β) + ∑ s′∈ S λ P [s′ | s, a] vkβ(s′) } , for all s ∈ S. 4: If δk+1β (s) = δ k β(s) for all s ∈ S, stop and set δ∗β(s) = δkβ(s). Otherwise, increment k by 1 and return to step 2. where r(s, a; β) is the Lagrangian reward function. After the Lagrangian approach, the new optimality equations are given by: vβ(s) = max a∈As { r(s, a; β) + ∑ s′∈ S λ P [s′ | s, a] vβ(s′) } , s ∈ S, (2.19) which can be solved by using the PIA with a fixed value of β. The procedures of the PIA algorithm is described in Algorithm 1. We denote the vectors vkβ = ( vkβ(s), s ∈ S ) , rβ = ( r ( s, δkβ(s); β ) , s ∈ S), and the matrix P = (P [s′ | s, δkβ(s)], s ∈ S, s′ ∈ S). The solutions of (2.19) correspond to the maximum expected discounted total reward vβ(s) and the pure policy δ ∗ β(s) which satisfies: δ∗β(s) = arg max a∈As { r(s, a; β) + ∑ s′∈ S λ P [s′ | s, a] vβ(s′) } , s ∈ S. (2.20) Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 33 Algorithm 2 - The Q-learning Algorithm (Part I): 1: Set β1 to an arbitrary number greater than zero, and set n = 1. 2: Solve for δ∗βn(s) in (2.20) via PIA for all s ∈ S. 3: Determine Cδ ∗ βn (s) in the following equation: Cδ ∗ βn (s) = c(s, δ∗βn(s)) + ∑ s′∈ S λ P [s′ | s, δ∗βn(s)] Cδ ∗ βn (s′), (2.21) for all s ∈ S. 4: Update the Lagrange multiplier by: βn+1 = βn + 1 n ( C̄δ ∗ βn − Cmax ) , (2.22) where C̄δ ∗ βn = 1 |S| ∑ s∈ S Cδ ∗ βn (s). (2.23) 5: If |βn+1 − βn| ≤ ², stop and set β∗ = βn+1. Otherwise, increment n by 1 and return to step 2. Note the pure policy δ∗β(s) specifies the network to choose in each state s such that the expected discounted total reward is maximized. When the CMDP problem is converted to an unconstrained MDP problem by a Lagrange multiplier β in (2.18), there is a relationship between the constraint (i.e, the user’s budget Cmax) and the Lagrange multiplier (i.e., β). In this thesis, we use the Q-learning algorithm (Algorithm 2) proposed in [41] to determine the proper β (i.e., β∗) for a feasible Cmax. Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 34 Algorithm 3 - The Q-learning Algorithm (Part II): 1: Perturb β∗ by some ∆β to obtain: β− = β∗ −∆β, β+ = β∗ +∆β. 2: Calculate the pure policies δ− and δ+ (using β− and β+) in (2.20) using the PIA. 3: Calculate the average expected discounted total access costs C̄− and C̄+ (using δ− and δ+) in (2.23). 4: Solve for q in the following equation: qC̄− + (1− q)C̄+ = Cmax. (2.24) 5: The optimal policy δ∗ can be obtained as: δ∗ = qδ− + (1− q)δ+. (2.25) Once β∗ has been obtained, we follow the procedure (Algorithm 3) in [34, 41] to find the optimal policy for the CMDP problem. As discussed in [34], the optimal policy for a CMDP with single constraint is a mixed policy of two pure policies. First, we perturb β∗ by some ∆β to obtain β− = β∗ − ∆β and β+ = β∗ + ∆β. Then, we calculate the pure policies δ− and δ+ in (2.20) (using β− and β+, respectively) via PIA and their corresponding average expected discounted total access costs C̄− and C̄+ in (2.23) (using δ− and δ+, respectively). Next, we define a parameter q such that qC− + (1 − q)C+ = Cmax. Thus the optimal policy δ ∗ of the CMDP problem is a randomized mixture of two Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 35 policies (i.e., δ− and δ+), such that at each decision epoch, the first policy is chosen with probability q and the second one is chosen with probability 1 − q. In other words, the optimal policy can be obtained as follows: δ∗ = qδ− + (1− q)δ+. This also can be viewed as: at each decision epoch, the selection to be made between the two policies is by flipping a biased coin, whose head and tail have probabilities q and 1 − q, respectively. If head appears, the optimal policy will be δ−, while if tail appears, the optimal policy will be δ+. 2.3 Monotone Optimal Vertical Handoff Policy In this section, we investigate the structure of the optimal policy. We provide conditions and show that the optimal policy is monotonic (non-decreasing or non-increasing) in some system states (e.g., available bandwidth, delay, velocity). The importance of these results regarding the structure of the optimal policy are as follows [32]: • They are attractive to decision makers. • They are easy to implement. • They enable efficient computation. The last point is especially valuable because when the optimal policy has a simple form, specialized algorithms can be developed to search only among policies that have Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 36 the same form as the optimal policy. A simple example of such kind of policy can be expressed of the form: δ∗(s) =  a1, if s < s ∗, a2, if s ≥ s∗, where a1 and a2 are two different actions and s ∗ is a threshold. This policy can be interpreted as follows. When the state of the system is less than s∗, it is optimal to use action a1; and when the state of the system is s ∗ or greater, it is optimal to use action a2. If an MDP problem has this kind of structure in the optimal policy, the problem of finding an optimal policy reduces to determining the threshold s∗ [32]. Given the proper selection of the Lagrange multiplier β (i.e., β∗), the unconstrained MDP in (2.19) has a stationary optimal policy. It can be shown that for a scenario with two wireless networks, the unconstrained MDP and the CMDP optimal vertical handoff policies are monotone in the available bandwidth, delay, and velocity. Monotonicity and the existence of two actions As = {1, 2} define a threshold policy. The threshold policies are optimal policies with a special structure that facilitates computation and implementation [32]. 2.3.1 Threshold Structure of Unconstrained MDP Since two wireless networks are considered (i.e.,M = {1, 2}), the current system state is given by s = [i, b1, d1, b2, d2, v, l], where i denotes the current network used by the MT, b1 and b2 denote the current available bandwidth in network 1 and 2, respectively, d1 and d2 denote the current delay in network 1 and 2, respectively, v denotes the current velocity Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 37 of the MT, and l denotes the current location type that the MT resides in. Recall that the unconstrained MDP can be solved by using PIA. From Algorithm 1, in each iteration k, we have: vk+1β (s) = max a∈As { r(s, a; β) + ∑ s′∈ S λ P [s′ | s, a] vkβ(s′) } , k = 0, 1, 2, . . . (2.26) and Qk+1β (s, a) = r(s, a; β) + ∑ s′∈ S λ P [s′ | s, a] vkβ(s′), k = 0, 1, 2, . . . (2.27) where we refer vβ(s) as the value function and Qβ(s, a) as the state-action reward func- tion. For any initial state s of vβ(s), the sequence v k β(s) generated by the PIA converges to the optimal expected discounted total reward for all s ∈ S. To establish the mono- tone structure of the optimal policy for any discount factor 0 ≤ λ < 1, the concept of supermodularity needs to be introduced. Definition 1. A function F (x, y) : X × Y → R is supermodular in (x, y) if F (x1, y1) + F (x2, y2) ≥ F (x1, y2) + F (x2, y1) for all x1, x2 ∈ X , y1, y2 ∈ Y, x1 > x2, y1 > y2. If the inequality is reversed, then the function F (x, y) is submodular. If the state-action reward function Qβ(s, a) is supermodular (or submodular) in ac- tion a and another variable in the state s (e.g., b2), then the optimal vertical handoff policy is monotone in that variable (i.e., b2). In fact, supermodularity is a sufficient condition for optimality of monotone policies [32], [42]. Based on Definition 1, if F (x, y) is supermodular (submodular) in (x, y), then y(x) = arg maxy F (x, y) is monotonically non-decreasing (non-increasing) in variable x [42]. Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 38 By supermodularity, we can see that the state-action reward function Qβ(s, a) being submodular in (b2, a), and supermodular in (d2, a) and (v, a) implies that the optimal vertical handoff policy δ∗β(s) is monotonically non-increasing in the available bandwidth, and non-decreasing in the delay and velocity, respectively. The methodology of proving the threshold structure of the optimal policy consists of the following two steps: 1. Proof on the monotonicity of the value function (Lemma 1); 2. Proof on the supermodularity/submodularity of the state-action reward function (Theorems 1, 2, and 3). Then, the threshold structure of the optimal vertical handoff policy follows straightfor- wardly. Lemma 1. For any discount factor 0 ≤ λ < 1, the optimal expected discounted total re- ward (i.e., the value function) is monotonically non-decreasing in the available bandwidth, and non-increasing in the delay and velocity. Proof. Please refer to Appendix A.1. In the following theorems, without loss of generality, we assume the current network in use is network 2 (i.e., i = 2). Theorem 1. For any discount factor 0 ≤ λ < 1, if the value function vβ(s) is a monotonically non-decreasing function of the available bandwidth, and the state-action Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 39 reward function Qβ(s, a) is submodular in (b2, a), that is, Qβc(i, b1, d1, b2, d2, v, l, 2)−Qβc(i, b1, d1, b2, d2, v, l, 1) ≥ Qβc(i, b1, d1, b2 + 1, d2, v, l, 2)−Qβc(i, b1, d1, b2 + 1, d2, v, l, 1), (2.28) then the optimal policy is deterministic and monotonically non-increasing in the available bandwidth component of the state s. Consequently, δ∗β(s) is of the form: δ∗β(i, b1, d1, b2, d2, v, l) =  a 6= i, if 0 ≤ b2 ≤ τb(i, b1, d1, d2, v, l), a = i, if b2 > τb(i, b1, d1, d2, v, l), (2.29) where τb(i, b1, d1, d2, v, l) defines the threshold for the rest of the elements of the state s for a given Lagrange multiplier β. Proof. Please refer to Appendix A.2. Theorem 2. For any discount factor 0 ≤ λ < 1, if the value function vβ(s) is a monotonically non-increasing function of the delay, and the state-action reward function Qβ(s, a) is supermodular in (d2, a), that is, Qβ(i, b1, d1, b2, d2, v, l, 2)−Qβ(i, b1, d1, b2, d2, v, l, 1) ≤ Qβ(i, b1, d1, b2, d2 + 1, v, l, 2)−Qβ(i, b1, d1, b2, d2 + 1, v, l, 1), (2.30) then the optimal policy is deterministic and monotonically non-decreasing in the delay component of the state s. Consequently, δ∗β(s) is of the form: δ∗β(i, b1, d1, b2, d2, v, l) =  a = i, if 0 ≤ d2 ≤ τd(i, b1, d1, b2, v, l), a 6= i, if d2 > τd(i, b1, d1, b2, v, l), (2.31) Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 40 where τd(i, b1, d1, b2, v, l) defines the threshold for the rest of the elements of the state s for a given Lagrange multiplier β. Proof. Please refer to Appendix A.3. Theorem 3. For any discount factor 0 ≤ λ < 1, if the value function vβ(s) is a mono- tonically non-increasing function of the velocity, and the state-action reward function Qβ(s, a) is supermodular in (v, a), that is, Qβ(i, b1, d1, b2, d2, v, l, 2)−Qβ(i, b1, d1, b2, d2, v, l, 1) ≤ Qβ(i, b1, d1, b2, d2, v + 1, l, 2)−Qβ(i, b1, d1, b2, d2, v + 1, l, 1), (2.32) then the optimal policy is deterministic and monotonically non-decreasing in the velocity component of the state s. Consequently, δ∗β(s) is of the form: δ∗β(i, b1, d1, b2, d2, v, l) =  a = i, if 0 ≤ v ≤ τv(i, b1, d1, b2, d2, l), a 6= i, if v > τv(i, b1, d1, b2, d2, l), (2.33) where τv(i, b1, d1, b2, d2, l) defines the threshold for the rest of the elements of the state s for a given Lagrange multiplier β. Proof. Please refer to Appendix A.4. 2.3.2 Threshold Structure of Constrained MDP Having shown the threshold structure of the unconstrained MDP optimal policy and based on the results described in Section 2.3.1 and [34], we can state that the constrained Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 41 optimal vertical handoff policy is a randomized mixture between two threshold policies. Corollary 1. There exists a stationary vertical handoff policy δ∗ that is the optimal solution of the CMDP equivalent to the one given in (2.15) such that δ∗ is a randomized mixture of two threshold vertical handoff policies as follows: δ∗(s) = qδ∗β−(s) + (1− q)δ∗β+(s), s ∈ S, (2.34) where δ∗β− and δ ∗ β+ are two unconstrained optimal policies with Lagrange multipliers β − and β+ that are of the form (2.29) for the available bandwidth, (2.31) for the delay, or (2.33) for the velocity. Proof. These results follow directly from Theorems 1, 2, or 3 and Theorem 4.3 of [34]. The existence of monotone policy with a structure allows the use of more efficient al- gorithms which exploit features such as the structured policy iteration and the structured modified policy iteration (cf. [32]). These algorithms seek the optimal policy δ∗ only in the subset of policies with certain structure (e.g., Πδ ⊂ Π). Consequently, computational effort will be considerably reduced by using structured algorithms. Structured algorithms also facilitate implementation because they can be used to find the threshold values for the optimal policy. As an example, the monotone policy iteration algorithm [32, pp. 259] can be used. When an optimal policy is strictly increasing, the action set As decreases in size with increasing s (e.g., b2) and hence reduces the number of actions which need to be evaluated by the algorithm. If at some state, say b̄2 = τb(i, b1, d1, d2, v, l), the action set As contains only one element (i.e., a∗), then no Chapter 2. CMDP Formulation of the Vertical Handoff Decision Problem 42 further maximization is required because the action will be optimal for all b2 ≥ b̄2. Thus, the threshold value is b̄2 and the optimal action is a ∗ for all b2 ≥ b̄2. 43 Chapter 3 Numerical Results and Discussions 3.1 Comparison between Different Vertical Handoff Decision Algorithms We compare the performance of our proposed CMDP-based vertical handoff decision algorithm with three other schemes. The first one is also based on CMDP but does not consider the impact of velocity in making the decisions (this scheme is denoted by CMDP- w/o-velocity). The second is the SAW algorithm introduced in [16], and the last algorithm we compare with is ELECTRE, which is also an MADM algorithm introduced in [19] for network selection. The performance metric is the expected total reward per connection. Two applications are considered: constant bit rate (CBR) voice traffic over user datagram protocol (UDP), and file transfer protocol (FTP) data traffic over transmission control protocol (TCP). We consider the scenario depicted in Fig. 2.1. There are three networks in the system: network 1 is a cellular network, network 2 is a WiMAX network, and network 3 is a WLAN. For the simulation parameters, the unit of bandwidth is 16 kbps, the unit of Chapter 3. Numerical Results and Discussions 44 delay is 60 ms, and the unit of the MT’s velocity is 8 km/hr. The time between two successive decision epochs is 15 sec. The bandwidth importance weight ω is 0.25 for CBR traffic and 0.9 for FTP traffic. The reason is that CBR traffic is more sensitive to delay, while FTP traffic is elastic. Other simulation parameters are summarized in Table. 3.1. Table 3.1: Summary of simulation parameters. Parameter Description Value b1max Maximum available bandwidth for network 1 3 units b2max Maximum available bandwidth for network 2 4 units b3max Maximum available bandwidth for network 3 5 units C1 Access cost of the cellular network (in monetary units per bit) 2 C2 Access cost of the WiMAX network (in monetary units per bit) 1.5 C3 Access cost of the WLAN (in monetary units per bit) 1.2 d1max Maximum packet delay for network 1 3 units d2max Maximum packet delay for network 2 3 units d3max Maximum packet delay for network 3 3 units Ki,a Switching cost (i, a ∈ {1, 2, 3}, i 6= a) 0.5 |M| Number of networks in the system 3 Continued on next page Chapter 3. Numerical Results and Discussions 45 Table 3.1 – continued from previous page Parameter Description Value Vmax Upper threshold of the MT’s velocity 3 units Vmin Lower threshold of the MT’s velocity 1 unit vmax Maximum possible velocity of the MT 3 units α Memory level for the Gauss-Markov model 0.5 Θ1 Area of location type 1 (in terms of percentage of total area) 50% Θ2 Area of location type 2 (in terms of percentage of total area) 25% Θ3 Area of location type 3 (in terms of percentage of total area) 25% κ Risky index of the user 0.5 µ Mean value of the MT’s velocity 1 unit ρ1 : ρ2 : ρ3 Ratio between user densities 1:1:8 σ Standard deviation of the MT’s velocity 0.1 unit φ Importance weight for switching cost 0.5 Note the bandwidth importance weight ω is 0.25 for CBR traffic and 0.9 for FTP traffic. The reason is that CBR traffic is more sensitive to delay, while FTP traffic is elastic. For the SAW and ELECTRE algorithms, the available bandwidth, delay, switching cost, and the MT’s velocity are considered when calculating the policy. The importance weights for these parameters are consistent with those used in the CMDP model. Once Chapter 3. Numerical Results and Discussions 46 the corresponding vertical handoff policies are calculated by the SAW and ELECTRE algorithms, the PIA is used to obtain the expected total reward achieved by each decision algorithm. The probability q that determines the randomized optimal policy in (2.25) is calcu- lated for different discount factors (i.e., different average connection durations). Specifi- cally, for λ equals to [0.9, 0.95, 0.966, 0.975, 0.98], the corresponding probabilities q are [0.61, 0.42, 0.53, 0.86, 0.46]. Moreover, the user’s budget on the expected total access cost is also predefined for different discount factors. Specifically, for λ equals to [0.9, 0.95, 0.966, 0.975, 0.98], the corresponding predefined constraints Cmax are [2, 4, 6, 8, 10]. 3.1.1 Results for CBR Voice Traffic over UDP The expected total reward of a user under different discount factors for CBR traffic is shown in Fig. 3.1. For all the four schemes considered here, the expected total reward increases as λ becomes larger. This is because the larger λ is, the longer the average du- ration of the connection becomes. With the same constraint on the expected total access cost, the CMDP algorithm achieves the highest expected total reward among the four schemes. For example, when λ equals to 0.975 (i.e., the average duration of connection is 10 mins), for which the predefined constraint is 8 monetary units, the CMDP algorithm achieves 25% higher expected total reward than the CMDP-w/o-velocity algorithm, 93% higher expected total reward than the SAW scheme, and 199% higher expected total reward than the ELECTRE algorithm. The reason is that when an MT does not con- Chapter 3. Numerical Results and Discussions 47 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0 2 4 6 8 10 12 Discount Factor ( λ ) Ex pe ct ed T ot al R ew ar d Constrained MDP Constrained MDP w/o Velocity Simple Additive Weighting ELECTRE 25% 93% 199% Figure 3.1: Expected total reward under different discount factor λ for CBR traffic. sider its velocity, the connection might be dropped during the handoff process and needs to be re-established. The associated QoS degradation and extra signaling and process- ing costs decrease the actual reward it will gain by performing the handoff. SAW and ELECTRE algorithms achieve lower reward than the CMDP-w/o-velocity scheme, be- cause they do not consider the user’s budget for the connection, neither the long term effect of the action. In other words, SAW and ELECTRE choose actions only based on the instantaneous reward rather than the expected total reward. Fig. 3.2 shows the expected total reward of a user versus the mean of its velocity un- Chapter 3. Numerical Results and Discussions 48 8 10 12 14 16 18 20 22 24 0 1 2 3 4 5 6 7 8 Mean of User’s Velocity ( µ ) Ex pe ct ed T ot al R ew ar d Constrained MDP Constrained MDP w/o Velocity Simple Additive Weighting ELECTRE Figure 3.2: Expected total reward under different mean of user’s velocity µ for CBR traffic. der a budget of 8 monetary units for CBR traffic. As the user moves faster, the expected total reward of the CMDP algorithm decreases slightly. This is because the CMDP al- gorithm effectively avoids dropped calls by taking the user’s velocity into consideration. For example, handoffs are only performed when the user’s velocity is not likely to cause a dropped call. For the CMDP-w/o-velocity algorithm, the expected total reward de- creases as the user’s velocity increases. The reason is that as the user’s velocity becomes larger, the decrease in the actual reward (i.e., QoS degradation and extra signaling and processing costs) associated with the issue that the model does not consider the effect of user’s velocity becomes more significant. The SAW and ELECTRE algorithms still Chapter 3. Numerical Results and Discussions 49 3 4 5 6 7 8 9 10 11 12 13 0 5 10 15 User’s Budget on Expected Total Access Cost ( C max ) Ex pe ct ed T ot al R ew ar d Constrained MDP Constrained MDP w/o Velocity Simple Additive Weighting ELECTRE Figure 3.3: Expected total reward under different user’s budget on expected total access cost Cmax for CBR traffic. achieve lower expected total reward. The expected total reward a user can obtain versus its budget on the expected total access cost for CBR traffic is shown in Fig. 3.3. As the user’s budget increases, the expected total reward becomes larger. This occurs because the more money that a user can spend on a connection, the more reward it will obtain. For the same budget, the CMDP algorithm always achieves higher reward than the CMDP-w/o-velocity, SAW, and ELECTRE schemes. The reason is that the CMDP algorithm can fully utilize the user’s budget and avoid dropped calls to achieve the optimal reward, while the total reward obtained by the CMDP-w/o-velocity, SAW, and ELECTRE schemes are reduced because Chapter 3. Numerical Results and Discussions 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 6 7 8 9 10 Switching Cost ( Ki,a ) Ex pe ct ed T ot al R ew ar d Constrained MDP Constrained MDP w/o Velocity Simple Additive Weighting ELECTRE Figure 3.4: Expected total reward under different switching cost Ki,a for CBR traffic. of the dropped connections. Fig. 3.4 shows the expected total reward of users under different switching cost for CBR traffic. The budget used here is 8 monetary units. As we can see from the graph, when the switching cost (i.e, Ki,a) increases, the expected total reward of all four schemes decrease. The expected total reward of the CMDP algorithm decreases slower than the CMDP-w/o-velocity algorithm does. This is because as the switching costs increase, the decrease on the actual reward achieved by the CMDP-w/o-velocity scheme is also larger. Since for the same number of dropped calls, the extra signaling and processing costs increase as Ki,a becomes larger. Fig. 3.5 shows the expected total reward of a user versus the access cost of the cellular Chapter 3. Numerical Results and Discussions 51 0 0.5 1 1.5 2 2.5 3 3.5 4 0 5 10 15 Access Cost of the Cellualr Network ( C1 ) Ex pe ct ed T ot al R ew ar d Constrained MDP Constrained MDP w/o Velocity Simple Additive Weighting ELECTRE Figure 3.5: Expected total reward under different access cost of the cellular network C1 for CBR traffic. network under a budget of 8 monetary units for CBR traffic. As C1 increases (while C2 and C3 are fixed), the expected total reward become smaller for all four algorithms. The reason is that in order to take advantage of the cellular network, users need to pay more as the price of the cellular network increases. From another point of view, this can also be seen as the user’s budget becomes smaller. Thus, the expected total reward of the user decreases. For the same constraint on the expected total access cost, the CMDP scheme achieves better expected total reward than the CMDP-w/o-velocity, SAW, and ELECTRE schemes. Chapter 3. Numerical Results and Discussions 52 8 10 12 14 16 18 20 22 24 0 1 2 3 4 5 6 7 8 Mean of User’s Velocity ( µ ) Ex pe ct ed T ot al R ew ar d Constrained MDP Constrained MDP w/o Velocity Simple Additive Weighting ELECTRE Figure 3.6: Expected total reward under different mean of user’s velocity µ for FTP traffic. 3.1.2 Results for FTP Data Traffic over TCP Fig. 3.6 shows the expected total reward of a user versus the mean of its velocity under a budget of 8 monetary units for FTP traffic. Note when the MT’s velocity is high (µ = 24 km/h), the expected total reward achieved by the CMDP algorithm with FTP traffic is smaller than that achieved with CBR traffic. The reason is when an MT moves faster, the probability that it can switch from the cellular network to WiMAX or WLAN is lower. As a result, it cannot take advantage of the high bandwidth in WiMAX and WLAN (which is crucial for FTP traffic that relies on high bandwidth); hence is not able to achieve a high expected total reward when it is moving fast. Chapter 3. Numerical Results and Discussions 53 3 4 5 6 7 8 9 10 11 12 13 0 5 10 15 User’s Budget on Expected Total Access Cost ( C max ) Ex pe ct ed T ot al R ew ar d Constrained MDP Constrained MDP w/o Velocity Simple Additive Weighting ELECTRE Figure 3.7: Expected total reward under different user’s budget on expected total access cost Cmax for FTP traffic. The expected total reward a user can obtain versus its budget on the expected total access cost for FTP traffic is shown in Fig. 3.7. Similar to the CBR traffic case, we can see for the same budget, the CMDP algorithm always achieves higher expected total reward than the CMDP-w/o-velocity, SAW, and ELECTRE schemes. Note the expected total reward decreases dramatically when the user’s budget is below 3.5 monetary units. This also happens in Fig. 3.3, however for CBR traffic this decrease is less noticeable. Fig. 3.8 shows the expected total reward of a user versus the access cost of the cellular network under a budget of 8 monetary units for FTP traffic. For the same constraint on the expected total access cost, the CMDP scheme achieves higher expected total reward Chapter 3. Numerical Results and Discussions 54 0 0.5 1 1.5 2 2.5 3 3.5 4 0 5 10 15 Access Cost of the Cellualr Network ( C1 ) Ex pe ct ed T ot al R ew ar d Constrained MDP Constrained MDP w/o Velocity Simple Additive Weighting ELECTRE Figure 3.8: Expected total reward under different access cost of the cellular network C1 for FTP traffic. than the CMDP-w/o-velocity, SAW, and ELECTRE schemes. Note for FTP traffic, the reward does not decrease as much as in the CBR traffic case shown in Fig. 3.5. The reason is that since FTP traffic is more sensitive to bandwidth, the MT will sojourn more time in the WiMAX network and WLAN (i.e., less time in the cellular network) than in the CBR traffic case, such that it can take advantages of their higher bandwidth. As a result, the increase of the cellular network access cost does not influence the expected total reward in the FTP traffic case as much as in the CBR traffic case. Chapter 3. Numerical Results and Discussions 55 1 2 3 4 5 1 2 3 4 0 1 2 3 b3 b2 O pt im al A ct io n Figure 3.9: Structure of the unconstrained MDP optimal vertical handoff policy on the available bandwidth when s = [i = 3, b1 = 1, d1 = 3, d2 = 1, d3 = 3, v = 2, l = 3]. 3.2 Structure of the Optimal Policy Fig. 3.9 shows the structure of the unconstrained MDP optimal vertical handoff policy by varying the bandwidth of network 2 and network 3 (i.e., WiMAX and WLAN, re- spectively). The condition of the current state s is [i = 3, b1 = 1, d1 = 3, d2 = 1, d3 = 3, v = 2, l = 3]. We can see for the threshold structure of the optimal policy, given a fixed value of b2 (e.g., b2 = 3), the optimal policy chooses network 3 when b3 ≥ 3, and selects network 2 when b3 < 3. Similarly, for a fixed b3 (e.g., b3 = 2), the optimal policy chooses network 2 when b2 ≥ 3, and selects network 3 when b2 < 3. Chapter 3. Numerical Results and Discussions 56 1 2 3 1 2 3 0 1 2 d1 d2 O pt im al A ct io n Figure 3.10: Structure of the unconstrained MDP optimal vertical handoff policy on the packet delay when s = [i = 2, b1 = 1, b2 = 4, b3 = 2, d3 = 3, v = 1, l = 3]. Fig. 3.10 shows the structure of the unconstrained MDP optimal vertical handoff policy by varying the delay of network 1 and network 2 (i.e., cellular network andWiMAX, respectively). The condition of the current state s is [i = 2, b1 = 1, b2 = 4, b3 = 2, d3 = 3, v = 1, l = 3]. The threshold structure of the optimal policy shows that, for a fixed value of d2 (e.g., d2 = 3), the optimal policy chooses network 2 when d1 ≥ 3, and selects network 1 when d1 < 3. Similarly, for a fixed d1 (e.g., d1 = 2), the optimal policy chooses network 1 when d2 ≥ 3, and selects network 2 when d2 < 3. Fig. 3.11 shows the structure of the unconstrained MDP optimal vertical handoff policy by varying the bandwidth of network 3 (i.e., WLAN) and the mean value of the Chapter 3. Numerical Results and Discussions 57 1 2 3 1 2 3 4 5 0 1 2 3 v b3 O pt im al A ct io n Figure 3.11: Structure of the unconstrained MDP optimal vertical handoff policy on the mean of user’s velocity when s = [i = 3, b1 = 3, d1 = 3, b2 = 4, d2 = 1, d3 = 3, l = 3]. MT’s velocity. The condition of the current state s is [i = 3, b1 = 3, d1 = 3, b2 = 4, d2 = 1, d3 = 3, l = 3]. We can see for the threshold structure of the optimal policy, given a fixed value of b3 (e.g., b3 = 4), the optimal policy chooses not to handoff when v ≥ 2, and chooses to handoff when v < 2. If b3 = 2, the optimal policy does not perform vertical handoff when v ≥ 3, and chooses to handoff when v < 3. 58 Chapter 4 Conclusions and Future Work 4.1 Conclusions In this thesis, we proposed a CMDP-based vertical handoff decision algorithm for 4G het- erogeneous wireless networks. Our work considered the connection duration, the available bandwidth and delay of the candidate networks, MT’s velocity and location information, signaling load incurred on the network, network access cost, user’s preference, and user’s monetary budget for the vertical handoff decision. The algorithm is based on the CMDP formulation with the objective of maximizing the expected total reward of a connection. The constraint of the problem is that users have monetary budgets for their connections. By using the PIA and Q-learning algorithm, a stationary randomized policy is obtained when the connection termination time is geometrically distributed. Structural results on the optimal vertical handoff policy are derived by using the concept of supermodularity. We showed that the optimal policy is a threshold policy in the available bandwidth, delay, and velocity. Numerical results showed that the proposed CMDP-based vertical handoff decision algorithm outperforms other decision schemes in a wide range of conditions such as variations on connection Chapter 4. Conclusions and Future Work 59 duration, user’s velocity, user’s budget, traffic type, signaling cost, and monetary access cost. 4.2 Future Work For future possible extension of this work, the Semi-Markov decision process (SMDP) can be used to formulate the vertical handoff decision problem. In the current work, the time between each decision epoch (i.e., state transition time) is fixed. By using the SMDP, the time between state transitions may follow an arbitrary probability distribution. A special case of the SMDP is the continuous-time Markov decision process (CTMDP), in which the inter-transition times are exponentially distributed. By using this algorithm, a more realistic system model can be formulated. Another extension of this work can be in the direction of considering more constraints. For example, in addition to the constraint that users have monetary budgets for their connections, the power consumed for making a call should be less than some threshold too. In this case, the problem becomes a CMDP with multiple constraints [43]. 60 Bibliography [1] IEEE 802.11 Wireless Local Area Networks, http://www.ieee802.org/11/. [2] WiMAX Technology, http://www.wimaxforum.org/technology/. [3] 3rd Generation Partnership Project (3GPP), http://www.3gpp.org. [4] 3rd Generation Partnership Project 2 (3GPP2), http://www.3gpp2.org. [5] IEEE 802.21 Media Independent Handover Working Group, http://www.ieee802.org/21/. [6] J. McNair and F. Zhu, “Vertical Handoffs in Fourth-generation Multi-network En- vironments,” IEEE Wireless Communications, vol. 11, no. 3, pp. 8–15, June 2004. [7] W. Chen, J. Liu, and H. Huang, “An Adaptive Scheme for Vertical Handoff in Wireless Overlay Networks,” in Proc. of ICPAD’04, Newport Beach, CA, July 2004. [8] W. Wu, N. Banerjee, K. Basu, and S. Das, “SIP-based Vertical Handoff between WWANs and WLANs,” IEEE Communications Magazine, vol. 12, no. 3, pp. 66–72, June 2005. Bibliography 61 [9] Y.-C. Hsu and P.-F. Tsai, “A Practical Mechanism for Vertical Handoff in WLAN/3G Integrated Networks,” in Proc. of IEEE VTC’06-Spring, Melbourne, Australia, May 2006. [10] V. Azhari, M. Smadi, and T. Todd, “Fast Client-Based Connection Recovery for Soft WLAN-to-Cellular Vertical Handoff,” IEEE Transactions on Vehicular Technology, vol. 57, no. 2, pp. 1089–1102, March 2008. [11] L. Xia, L. G. Jiang, and C. He, “A Novel Fuzzy Logic Vertical Handoff Algorithm with Aid of Differential Prediction and Pre-Decision Method,” in Proc. of IEEE ICC’07, Glasgow, Scotland, June 2007. [12] N. Nasser, S. Guizani, and E. Al-Masri, “Middleware Vertical Handoff Manager: A Neural Network-based Solution,” in Proc. of IEEE ICC’07, Glasgow, Scotland, June 2007. [13] M. Yildirim and F. Alagoz, “A New Handover Decision Management Algorithm in Wireless Overlay Networks,” in Proc. of IEEE GLOBECOM’06, San Francisco, CA, November 2006. [14] M. Mani and N. Crespi, “Handover Criteria Considerations in Future Convergent Networks,” in Proc. of IEEE GLOBECOM’06, San Francisco, CA, November 2006. [15] A. V. Garmonov, S. H. Cheon, D. H. Yim, K. T. Han, Y. S. Park, A. Y. Savinkov, S. A. Filin, S. N. Moiseev, and M. S. Kondakov, “QoS-Oriented Intersystem Han- Bibliography 62 dover between IEEE 802.11b and Overlay Networks,” IEEE Transactions on Vehic- ular Technology, vol. 57, no. 2, pp. 1142–1154, March 2008. [16] W. Zhang, “Handover Decision Using Fuzzy MADM in Heterogeneous Networks,” in Proc. of IEEE WCNC’04, Atlanta, GA, March 2004. [17] R. Tawil, O. Salazar, and G. Pujolle, “Vertical Handoff Decision Scheme Using MADM for Wireless Networks,” in Proc. of IEEE WCNC’08, Las Vegas, NV, March/April 2008. [18] M. Stoyanova and P. Mahonen, “Algorithmic Approaches for Vertical Handoff in Heterogeneous Wireless Environment,” in Proc. of IEEE WCNC’07, Hong Kong, China, March 2007. [19] F. Bari and V. Leung, “Application of ELECTRE to Network Selection in a Hetero- geneous Wireless Network Environment,” in Proc. of IEEE WCNC’07, Hong Kong, China, March 2007. [20] K. Yang, I. Gondal, B. Qiu, and L. S. Dooley, “Combined SINR Based Vertical Hand- off Algorithm for Next Generation Heterogeneous Wireless Networks,” in Proc. of IEEE GLOBECOM’07, Washington, DC, November 2007. [21] M. Liu, Z. Li, X. Guo, and E. Dutkiewicz, “Performance Analysis and Optimization of Handoff Algorithms in Heterogeneous Wireless Networks,” IEEE Transaction on Mobile Computing, vol. 7, no. 7, pp. 846–857, July 2008. Bibliography 63 [22] S. Chien, H. Liu, A. L. Y. Low, C. Maciocco, and Y. Ho, “Smart Predictive Trigger for Effective Handover in Wireless Networks,” in Proc. of IEEE ICC’08, Beijing, China, May 2008. [23] N. Nasser, A. Hasswa, and H. Hassanein, “Handoff in Fourth Generation Hetero- geneous Networks,” IEEE Communications Magazine, vol. 44, no. 10, pp. 96–103, October 2006. [24] O. Ormond, J. Murphy, and G. Muntean, “Utility-based Intelligent Network Se- lection in Beyond 3G Systems,” in Proc. of IEEE ICC’06, Istanbul, Turkey, June 2006. [25] J. Zhang, H. C. Chan, and V. Leung, “A Location-Based Vertical Handoff Decision Algorithm for Heterogeneous Mobile Networks,” in Proc. of IEEE GLOBECOM’06, San Francisco, CA, November 2006. [26] Q. Guo, J. Zhu, and X. Xu, “An Adaptive Multi-criteria Vertical Handoff Decision Algorithm for Radio Heterogeneous Networks,” in Proc. of IEEE ICC’05, Seoul, Korea, May 2005. [27] W. Lee, E. Kim, J. Kim, I. Lee, and C. Lee, “Movement-Aware Vertical Handoff of WLAN and Mobile WiMAX for Seamless Ubiquitous Access,” IEEE Transactions on Consumers Electronics, vol. 53, no. 4, pp. 1268–1275, November 2007. Bibliography 64 [28] A. Zahran and B. Liang, “Performance Evaluation Framework for Vertical Handoff Algorithms in Heterogeneous Networks,” in Proc. of IEEE ICC’05, Seoul, Korea, May 2005. [29] J. Wang, R. V. Prasad, and I. Niemegeers, “Solving the Incertitude of Vertical Handovers in Heterogeneous Mobile Wireless Network Using MDP,” in Proc. of IEEE ICC’08, Beijing, China, May 2008. [30] E. Stevens-Navarro, Y. Lin, and V. W. S. Wong, “An MDP-based Vertical Handoff Decision Algorithm for Heterogeneous Wireless Networks,” IEEE Transactions on Vehicular Technology, vol. 57, no. 2, pp. 1243–1254, March 2008. [31] C. Sun, E. Stevens-Navarro, and V. Wong, “A Constrained MDP-based Vertical Handoff Decision Algorithm for 4G Wireless Networks,” in Proc. of IEEE ICC’08, Beijing, China, May 2008. [32] M. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Program- ming. John Wiley and Sons, 1994. [33] E. Altman, Constrained Markov Decision Processes. Chapman and Hall, 1999. [34] F. Beutler and K. Ross, “Optimal Policies for Controlled Markov Chains with a Constraint,” J. Math. Anal. Appl., vol. 112, pp. 236–252, 1985. Bibliography 65 [35] D. A. Dolgov and E. H. Durfee, “A Practical Mechanism for Vertical Handoff in WLAN/3G Integrated Networks,” in Proc. of the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI-05), Edinburgh, Scotland, August 2005. [36] B. Liang and Z. Haas, “Predictive Distance-Based Mobility Management for Mul- tidimensional PCS Networks,” IEEE/ACM Transactions on Networking, vol. 11, no. 5, pp. 718–732, October 2003. [37] S. Tang and W. Li, “Performance Analysis of the 3G Network with Complementary WLANs,” in Proc. of IEEE GLOBECOM’05, St. Louis, MO, November 2005. [38] A. Doufexi, E. Tameh, A. Nix, S. Armour, and A. Molina, “Hotspot Wireless LANs to Enhance the Performance of 3G and Beyond Cellular Networks,” IEEE Commu- nications Magazine, vol. 41, no. 7, pp. 58–65, July 2003. [39] The Network Simulator - ns-2, http://www.isi.edu/nsnam/ns. [40] WiMAX Module for ns-2 Simulator, http://ndsl.csie.cgu.edu.tw/wimaxns2.php. [41] V. Djonin and V. Krishnamurthy, “Q-Learning Algorithms for Constrained Markov Decision Process with Randomized Monotone Policies: Application to MIMO Trans- mission Control,” IEEE Transactions on Signal Processing, vol. 55, no. 5, pp. 2170– 2181, May 2007. [42] D. M. Topkis, Supermodularity and Complementary. Princeton University Press, 1998. Bibliography 66 [43] A. K. Karmokar, D. V. Djonin, and V. K. Bhargava, “Optimal and Suboptimal Packet Scheduling over Correlated Time Varying Flat Fading Channels,” IEEE Transactions on Wireless Communications, vol. 5, no. 2, pp. 446–456, February 2006. 67 Appendix A Proofs of Lemma 1 and Theorems 1, 2, and 3 A.1 Proof of Lemma 1 We will prove vβ(i, b1, d1, b2, d2, v, l) is monotone in available bandwidth, delay, and ve- locity. This consists of two steps: 1. To prove the reward function r(s, a) is monotone in available bandwidth, delay, and velocity; 2. To prove the sum of transition probabilities ∑ s′∈S P [s ′ | s, a] is monotone in avail- able bandwidth, delay, and velocity. First, the only part that relates to the bandwidth in the reward function (i.e., r(s, a)) is fb(s, a). Let b 1 a and b 2 a be two possible bandwidth values, and b 1 a ≥ b2a. We denote fb(s 1, a) as the value of fb(s, a) when ba = b 1 a, and fb(s 2, a) as the value of fb(s, a) when ba = b 2 a. Clearly from the definition of fb(s, a), fb(s 1, a) is greater than (or equal to) fb(s 2, a), since fb(s, a) is linear proportional to ba. As a result, the reward function r(s, a) is monotonically non-decreasing in the available bandwidth. Similarly, the only part that relates to the delay in the reward function is fd(s, a). Let d1a and d 2 a be two possible delay values, and d 1 a ≥ d2a. We denote fd(s1, a) as the value Appendix A. Proofs of Lemma 1 and Theorems 1, 2, and 3 68 of fd(s, a) when da = d 1 a, and fd(s 2, a) as the value of fd(s, a) when da = d 2 a. Clearly from the definition of fd(s, a), fd(s 1, a) is smaller than (or equal to) fd(s 2, a), since fd(s, a) is linear inverse proportional to da. As a result, the reward function r(s, a) is monotonically non-increasing in the delay. For the velocity, the only part that relates to it in the reward function is −q(s, a). As we can see from the definition of q(s, a) (2.9), when the velocity v becomes larger, the value of q(s, a) becomes larger too (or stays the same), which means −q(s, a) becomes smaller (or stays the same). Consequently, the reward function r(s, a) is monotonically non-increasing in velocity. Second, we assume the transition probability function P [s′ | s, a] satisfies the first order stochastic dominance condition. This implies when the system is in a better state (e.g., larger bandwidth, lower delay), its evolution will be in the region of better states with a higher probability. Specifically, take bandwidth for example, this implies the probability that the bandwidth in the current decision epoch exceeds some value will be greater when the bandwidth in the previous decision epoch is greater. In other words, the bandwidth in the next decision epoch is stochastically increasing with respect to the bandwidth in the current decision epoch. This is the condition under which the sum of transition probabilities (i.e., ∑ s′∈S P [s ′ | s, a]) is monotonically non-decreasing in the available bandwidth. Similarly, that the delay (velocity) in the next decision epoch is stochastically decreasing with respect to the delay (velocity) in the current decision epoch is the condition under which the sum of transition probabilities (i.e., ∑ s′∈S P [s ′ | s, a]) Appendix A. Proofs of Lemma 1 and Theorems 1, 2, and 3 69 is monotonically non-increasing in the delay (velocity). ¥ A.2 Proof of Theorem 1 To show that the optimal policy is monotonically non-increasing in the available band- width, we need to prove inductively that Qβ(s, a) is submodular in (b2, a). We will prove via mathematical induction that, for a suitable initialization, Qk+1β ([i, b1, d1, b2, d2, v, l], 2)−Qk+1β ([i, b1, d1, b2, d2, v, l], 1) = r([i, b1, d1, b2, d2, v, l], 2; β)− r([i, b1, d1, b2, d2, v, l], 1; β) + ∑ s′∈ S λP [b′1, d ′ 1 | b1, d1]P [b′2, d′2 | b2, d2]P [v′ | v]P [l′ | l] × (vkβ(j, b′1, d′1, b′2, d′2, v′, l′)− vkβ(j, b′1, d′1, b′2, d′2, v′, l′)) , (A.1) is monotonically non-increasing in available bandwidth b2. It holds if vβ(j, b ′ 1, d ′ 1, b ′ 2, d ′ 2, v ′, l′) has non-increasing difference in b2. Select v 0 β(j, b ′ 1, d ′ 1, b ′ 2, d ′ 2, v ′, l′) with non-increasing dif- ference in b2. Assume that v k β(j, b ′ 1, d ′ 1, b ′ 2, d ′ 2, v ′, l′) has non-increasing difference in b2, which implies that Qk+1β ([i, b1, d1, b2, d2, v, l], a) is submodular in (b2, a). We will now prove that vk+1β (j, b ′ 1, d ′ 1, b ′ 2, d ′ 2, v ′, l′) also has non-increasing difference in b2. That is, vk+1β (i, b1, d1, b2 + 1, d2, v, l)− vk+1β (i, b1, d1, b2, d2, v, l) ≤ vk+1β (i, b1, d1, b2, d2, v, l)− vk+1β (i, b1, d1, b2 − 1, d2, v, l), (A.2) or vk+1β (i, b1, d1, b2 + 1, d2, v, l)− vk+1β (i, b1, d1, b2, d2, v, l) −(vk+1β (i, b1, d1, b2, d2, v, l)− vk+1β (i, b1, d1, b2 − 1, d2, v, l)) ≤ 0. (A.3) Appendix A. Proofs of Lemma 1 and Theorems 1, 2, and 3 70 Let’s assume: vk+1β (i, b1, d1, b2 + 1, d2, v, l) = Q k+1 β ([i, b1, d1, b2 + 1, d2, v, l], a2), vk+1β (i, b1, d1, b2, d2, v, l) = Q k+1 β ([i, b1, d1, b2, d2, v, l], a1), and vk+1β (i, b1, d1, b2 − 1, d2, v, l) = Qk+1β ([i, b1, d1, b2 − 1, d2, v, l], a0), for some actions a2, a1, a0 ∈ As. Thus, we can re-write (A.2) as: Qk+1β ([i, b1, d1, b2 + 1, d2, v, l], a2)−Qk+1β ([i, b1, d1, b2, d2, v, l], a1) − ( Qk+1β ([i, b1, d1, b2, d2, v, l], a1)−Qk+1β ([i, b1, d1, b2 − 1, d2, v, l], a0) ) ≤ 0, (A.4) or Qk+1β ([i, b1, d1, b2 + 1, d2, v, l], a2)−Qk+1β ([i, b1, d1, b2, d2, v, l], a2)︸ ︷︷ ︸ W1 +Qk+1β ([i, b1, d1, b2, d2, v, l], a2)−Qk+1β ([i, b1, d1, b2, d2, v, l], a1)︸ ︷︷ ︸ X1 −Qk+1β ([i, b1, d1, b2, d2, v, l], a1) +Qk+1β ([i, b1, d1, b2, d2, v, l], a0)︸ ︷︷ ︸ Y1 − ( Qk+1β ([i, b1, d1, b2, d2, v, l], a0)−Qk+1β ([i, b1, d1, b2 − 1, d2, v, l], a0︸ ︷︷ ︸) ) Z1 ≤ 0, where X1 ≤ 0 and Y1 ≤ 0 by optimality. Note that in X1 and Y1 the optimal action is a1. Appendix A. Proofs of Lemma 1 and Theorems 1, 2, and 3 71 In addition, it follows from the induction hypothesis that: W1 = r([i, b1, d1, b2 + 1, d2, v, l], a2; β)− r([i, b1, d1, b2, d2, v, l], a2; β) + ∑ s′∈ S λ ( P [b′1, d ′ 1 | b1, d1]P [b′2, d′2 | b2 + 1, d2]P [v′ | v]P [l′ | l]vkβ(j, b′1, d′1, (b2 + 1)′, d′2, v′, l′) −P [b′1, d′1 | b1, d1]P [b′2, d′2 | b2, d2]P [v′ | v]P [l′ | l]vkβ(j, b′1, d′1, b′2, d′2, v′, l′) ) ≤ r([i, b1, d1, b2, d2, v, l], a0; β)− r([i, b1, d1, b2 − 1, d2, v, l], a0; β) + ∑ s′∈ S λ ( P [b′1, d ′ 1 | b1, d1]P [b′2, d′2 | b2, d2]P [v′ | v]P [l′ | l]vkβ(j, b′1, d′1, b′2, d′2, v′, l′) −P [b′1, d′1 | b1, d1]P [b′2, d′2 | b2 − 1, d2]P [v′ | v]P [l′ | l]vkβ(j, b′1, d′1, (b2 − 1)′, d′2, v′, l′) ) . The right-hand side (RHS) of the inequality comes from the expansion of Z1 which implies that W1 ≤ Z1. Therefore, it is shown that vk+1β (i, b1, d1, b2, d2, v, l) satisfies (A.2), which implies that Qβ(s, a) is submodular in (b2, a). ¥ A.3 Proof of Theorem 2 To show that the optimal policy is monotonically non-decreasing in the delay, we need to prove inductively that Qβ(s, a) is supermodular in (d2, a). We will prove via mathemat- ical induction that, for a suitable initialization, (A.1) is monotonically non-decreasing in the delay d2. The above holds if vβ(j, b ′ 1, d ′ 1, b ′ 2, d ′ 2, v ′, l′) has non-increasing differ- ence in d2. Select v 0 β(j, b ′ 1, d ′ 1, b ′ 2, d ′ 2, v ′, l′) with non-increasing difference in d2. As- sume that vkβ(j, b ′ 1, d ′ 1, b ′ 2, d ′ 2, v ′, l′) has non-increasing difference in d2, which implies that Qk+1β ([i, b1, d1, b2, d2, v, l], a) is supermodular in (d2, a). We will now prove that vk+1β (j, b ′ 1, d ′ 1, b ′ 2, d ′ 2, v ′, l′) also has non-increasing difference in Appendix A. Proofs of Lemma 1 and Theorems 1, 2, and 3 72 d2. That is, vk+1β (i, b1, d1, b2, d2 + 1, v, l)− vk+1β (i, b1, d1, b2, d2, v, l) ≤ vk+1β (i, b1, d1, b2, d2, v, l)− vk+1β (i, b1, d1, b2, d2 − 1, v, l), (A.5) or vk+1β (i, b1, d1, b2, d2 + 1, v, l)− vk+1β (i, b1, d1, b2, d2, v, l) −(vk+1β (i, b1, d1, b2, d2, v, l)− vk+1β (i, b1, d1, b2, d2 − 1, v, l)) ≤ 0. (A.6) Let’s assume: vk+1β (i, b1, d1, b2, d2 + 1, v, l) = Q k+1 β ([i, b1, d1, b2, d2 + 1, v, l], a2), vk+1β (i, b1, d1, b2, d2, v, l) = Q k+1 β ([i, b1, d1, b2, d2, v, l], a1), and vk+1β (i, b1, d1, b2, d2 − 1, v, l) = Qk+1β ([i, b1, d1, b2, d2 − 1, v, l], a0), for some actions a2, a1, a0 ∈ As. Thus, we can re-write (A.5) as: Qk+1β ([i, b1, d1, b2, d2 + 1, v, l], a2)−Qk+1β ([i, b1, d1, b2, d2, v, l], a1) − ( Qk+1β ([i, b1, d1, b2, d2, v, l], a1)−Qk+1β ([i, b1, d1, b2, d2 − 1, v, l], a0) ) ≤ 0, (A.7) or Qk+1β ([i, b1, d1, b2, d2 + 1, v, l], a2)−Qk+1β ([i, b1, d1, b2, d2, v, l], a2)︸ ︷︷ ︸ W2 +Qk+1β ([i, b1, d1, b2, d2, v, l], a2)−Qk+1β ([i, b1, d1, b2, d2, v, l], a1)︸ ︷︷ ︸ X2 −Qk+1β ([i, b1, d1, b2, d2, v, l], a1) +Qk+1β ([i, b1, d1, b2, d2, v, l], a0)︸ ︷︷ ︸ Y2 Appendix A. Proofs of Lemma 1 and Theorems 1, 2, and 3 73 − ( Qk+1β ([i, b1, d1, b2, d2, v, l], a0)−Qk+1β ([i, b1, d1, b2, d2 − 1, v, l], a0︸ ︷︷ ︸) ) Z2 ≤ 0, where X2 ≤ 0 and Y2 ≤ 0 by optimality. Note that in X2 and Y2 the optimal action is a1. In addition, it follows from the induction hypothesis that: W2 = r([i, b1, d1, b2, d2 + 1, v, l], a2; β)− r([i, b1, d1, b2, d2, v, l], a2; β) + ∑ s′∈ S λ ( P [b′1, d ′ 1 | b1, d1]P [b′2, d′2 | b2, d2 + 1]P [v′ | v]P [l′ | l]vkβ(j, b′1, d′1, b′2, (d2 + 1)′, v′, l′) −P [b′1, d′1 | b1, d1]P [b′2, d′2 | b2, d2]P [v′ | v]P [l′ | l]vkβ(j, b′1, d′1, b′2, d′2, v′, l′) ) ≤ r([i, b1, d1, b2, d2, v, l], a0; β)− r([i, b1, d1, b2, d2 − 1, v, l], a0; β) + ∑ s′∈ S λ ( P [b′1, d ′ 1 | b1, d1]P [b′2, d′2 | b2, d2]P [v′ | v]P [l′ | l]vkβ(j, b′1, d′1, b′2, d′2, v′, l′) −P [b′1, d′1 | b1, d1]P [b′2, d′2 | b2, d2 − 1]P [v′ | v]P [l′ | l]vkβ(j, b′1, d′1, b′2, (d2 − 1)′, v′, l′) ) . The RHS of the inequality comes from the expansion of Z2 which implies that W2 ≤ Z2. Therefore, it is shown that vk+1β (i, b1, d1, b2, d2, v, l) satisfies (A.5), which implies that Qβ(s, a) is supermodular in (d2, a). ¥ A.4 Proof of Theorem 3 To show that the optimal policy is monotonically non-decreasing in the velocity, we need to prove inductively that Qβ(s, a) is supermodular in (v, a). We will prove via mathematical induction that, for a suitable initialization, (A.1) is monotonically non- decreasing in the velocity v. It holds if vβ(j, b ′ 1, d ′ 1, b ′ 2, d ′ 2, v ′, l′) has non-increasing dif- ference in v. Select v0β(j, b ′ 1, d ′ 1, b ′ 2, d ′ 2, v ′, l′) with non-increasing difference in v. As- Appendix A. Proofs of Lemma 1 and Theorems 1, 2, and 3 74 sume that vkβ(j, b ′ 1, d ′ 1, b ′ 2, d ′ 2, v ′, l′) has non-increasing difference in v, which implies that Qk+1β ([i, b1, d1, b2, d2, v, l], a) is supermodular in (v, a). We will now prove that vk+1β (j, b ′ 1, d ′ 1, b ′ 2, d ′ 2, v ′, l′) also has non-increasing difference in v. That is, vk+1β (i, b1, d1, b2, d2, v + 1, l)− vk+1β (i, b1, d1, b2, d2, v, l) ≤ vk+1β (i, b1, d1, b2, d2, v, l)− vk+1β (i, b1, d1, b2, d2, v − 1, l), (A.8) or vk+1β (i, b1, d1, b2, d2, v + 1, l)− vk+1β (i, b1, d1, b2, d2, v, l) −(vk+1β (i, b1, d1, b2, d2, v, l)− vk+1β (i, b1, d1, b2, d2, v − 1, l)) ≤ 0. (A.9) Let’s assume: vk+1β (i, b1, d1, b2, d2, v + 1, l) = Q k+1 β ([i, b1, d1, b2, d2, v + 1, l], a2), vk+1β (i, b1, d1, b2, d2, v, l) = Q k+1 β ([i, b1, d1, b2, d2, v, l], a1), and vk+1β (i, b1, d1, b2, d2, v − 1, l) = Qk+1β ([i, b1, d1, b2, d2, v − 1, l], a0), for some actions a2, a1, a0 ∈ As. Thus, we can re-write (A.8) as: Qk+1β ([i, b1, d1, b2, d2, v + 1, l], a2)−Qk+1β ([i, b1, d1, b2, d2, v, l], a1) − ( Qk+1β ([i, b1, d1, b2, d2, v, l], a1)−Qk+1β ([i, b1, d1, b2, d2, v − 1, l], a0) ) ≤ 0, (A.10) or Qk+1β ([i, b1, d1, b2, d2, v + 1, l], a2)−Qk+1β ([i, b1, d1, b2, d2, v, l], a2)︸ ︷︷ ︸ W3 Appendix A. Proofs of Lemma 1 and Theorems 1, 2, and 3 75 +Qk+1β ([i, b1, d1, b2, d2, v, l], a2)−Qk+1β ([i, b1, d1, b2, d2, v, l], a1)︸ ︷︷ ︸ X3 −Qk+1β ([i, b1, d1, b2, d2, v, l], a1) +Qk+1β ([i, b1, d1, b2, d2, v, l], a0)︸ ︷︷ ︸ Y3 − ( Qk+1β ([i, b1, d1, b2, d2, v, l], a0)−Qk+1β ([i, b1, d1, b2, d2, v − 1, l], a0︸ ︷︷ ︸) ) Z3 ≤ 0, where X3 ≤ 0 and Y3 ≤ 0 by optimality. Note that in X3 and Y3 the optimal action is a1. In addition, it follows from the induction hypothesis that: W3 = r([i, b1, d1, b2, d2, v + 1, l], a2; β)− r([i, b1, d1, b2, d2, v, l], a2; β) + ∑ s′∈ S λ ( P [b′1, d ′ 1 | b1, d1]P [b′2, d′2 | b2, d2]P [v′ | v + 1]P [l′ | l]vkβ(j, b′1, d′1, b′2, d′2, (v + 1)′, l′) −P [b′1, d′1 | b1, d1]P [b′2, d′2 | b2, d2]P [v′ | v]P [l′ | l]vkβ(j, b′1, d′1, b′2, d′2, v′, l′) ) ≤ r([i, b1, d1, b2, d2, v, l], a0; β)− r([i, b1, d1, b2, d2, v − 1, l], a0; β) + ∑ s′∈ S λ ( P [b′1, d ′ 1 | b1, d1]P [b′2, d′2 | b2, d2]P [v′ | v]P [l′ | l]vkβ(j, b′1, d′1, b′2, d′2, v′, l′) −P [b′1, d′1 | b1, d1]P [b′2, d′2 | b2, d2]P [v′ | v − 1]P [l′ | l]vkβ(j, b′1, d′1, b′2, d′2, (v − 1)′, l′) ) . The RHS of the inequality comes from the expansion of Z3 which implies that W3 ≤ Z3. Therefore, it is shown that vk+1β (i, b1, d1, b2, d2, v, l) satisfies (A.8), which implies that Qβ(s, a) is supermodular in (v, a). ¥