Optimistic Thompson sampling : strategic exploration in bandits and reinforcement learning

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Optimistic Thompson sampling : strategic exploration in bandits and reinforcement learning Zhang, Tianyue H.

Abstract

Reinforcement learning (RL) is an area of machine learning where intelligent agents take sequential actions in an environment and learn from experience to maximize rewards. This is particularly useful when the dynamics of the environment are complex and explicit training data is limited, such as in autonomous driving, robotics, game-playing, recommendation systems, finance and health care. The main challenge in RL is to balance between exploration (taking less observed actions to gather information) and exploitation (taking actions with high historical reward). Stochastic Multi-arm bandits (MABs) can be viewed as a simplified decision-making problem where the state of the agent does not change by the actions. The rich existing bandit theory provides insights into tackling expiration-exploitation trade-offs in RL. In this work, we draw inspiration from the classical bandit algorithms upper confidence bound (UCB) and Thompson sampling (TS), and propose a novel algorithm in RL: Optimistic Thompson sampling. Optimism encourages exploration: Agents can gain information by being optimistic and playing actions that are estimated to be sub-optimal but are not sufficiently sampled. We first discuss our performance metric in theoretical analysis, namely regret. Regret measures how the algorithm performs compared to the optimal strategy if the rewards and dynamics of the environment are known. We then provide a novel theoretical analysis of the optimistic Thompson sampling (O-TS) [Chapelle and Li, 2011] algorithms in bandits. Next, we extend the algorithms to the Markov decision process (MDP) setting, a common representation for RL problems. We propose two novel algorithms, O-TS-MDP and O-TS-MDP+. We compare their performance to existing work both theoritically and with numerical experiments. Finally, we conclude our work with a discussion of future directions for strategic exploration in bandits and RL.

Item Metadata

Title	Optimistic Thompson sampling : strategic exploration in bandits and reinforcement learning
Creator	Zhang, Tianyue H.
Supervisor	Schmidt, Mark
Publisher	University of British Columbia
Date Issued	2023
Description	Reinforcement learning (RL) is an area of machine learning where intelligent agents take sequential actions in an environment and learn from experience to maximize rewards. This is particularly useful when the dynamics of the environment are complex and explicit training data is limited, such as in autonomous driving, robotics, game-playing, recommendation systems, finance and health care. The main challenge in RL is to balance between exploration (taking less observed actions to gather information) and exploitation (taking actions with high historical reward). Stochastic Multi-arm bandits (MABs) can be viewed as a simplified decision-making problem where the state of the agent does not change by the actions. The rich existing bandit theory provides insights into tackling expiration-exploitation trade-offs in RL. In this work, we draw inspiration from the classical bandit algorithms upper confidence bound (UCB) and Thompson sampling (TS), and propose a novel algorithm in RL: Optimistic Thompson sampling. Optimism encourages exploration: Agents can gain information by being optimistic and playing actions that are estimated to be sub-optimal but are not sufficiently sampled. We first discuss our performance metric in theoretical analysis, namely regret. Regret measures how the algorithm performs compared to the optimal strategy if the rewards and dynamics of the environment are known. We then provide a novel theoretical analysis of the optimistic Thompson sampling (O-TS) [Chapelle and Li, 2011] algorithms in bandits. Next, we extend the algorithms to the Markov decision process (MDP) setting, a common representation for RL problems. We propose two novel algorithms, O-TS-MDP and O-TS-MDP+. We compare their performance to existing work both theoritically and with numerical experiments. Finally, we conclude our work with a discussion of future directions for strategic exploration in bandits and RL.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2023-09-08
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0435839
URI	http://hdl.handle.net/2429/85881
Degree	Master of Science - MSc
Program	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2023-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Optimistic Thompson sampling : strategic exploration in bandits and reinforcement learning Zhang, Tianyue H.

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights