- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Optimistic Thompson sampling : strategic exploration...
Open Collections
UBC Theses and Dissertations
UBC Theses and Dissertations
Optimistic Thompson sampling : strategic exploration in bandits and reinforcement learning Zhang, Tianyue H.
Abstract
Reinforcement learning (RL) is an area of machine learning where intelligent agents take sequential actions in an environment and learn from experience to maximize rewards. This is particularly useful when the dynamics of the environment are complex and explicit training data is limited, such as in autonomous driving, robotics, game-playing, recommendation systems, finance and health care. The main challenge in RL is to balance between exploration (taking less observed actions to gather information) and exploitation (taking actions with high historical reward). Stochastic Multi-arm bandits (MABs) can be viewed as a simplified decision-making problem where the state of the agent does not change by the actions. The rich existing bandit theory provides insights into tackling expiration-exploitation trade-offs in RL. In this work, we draw inspiration from the classical bandit algorithms upper confidence bound (UCB) and Thompson sampling (TS), and propose a novel algorithm in RL: Optimistic Thompson sampling. Optimism encourages exploration: Agents can gain information by being optimistic and playing actions that are estimated to be sub-optimal but are not sufficiently sampled. We first discuss our performance metric in theoretical analysis, namely regret. Regret measures how the algorithm performs compared to the optimal strategy if the rewards and dynamics of the environment are known. We then provide a novel theoretical analysis of the optimistic Thompson sampling (O-TS) [Chapelle and Li, 2011] algorithms in bandits. Next, we extend the algorithms to the Markov decision process (MDP) setting, a common representation for RL problems. We propose two novel algorithms, O-TS-MDP and O-TS-MDP+. We compare their performance to existing work both theoritically and with numerical experiments. Finally, we conclude our work with a discussion of future directions for strategic exploration in bandits and RL.
Item Metadata
Title |
Optimistic Thompson sampling : strategic exploration in bandits and reinforcement learning
|
Creator | |
Supervisor | |
Publisher |
University of British Columbia
|
Date Issued |
2023
|
Description |
Reinforcement learning (RL) is an area of machine learning where intelligent agents take sequential actions in an environment and learn from experience to maximize rewards. This is particularly useful when the dynamics of the environment are complex and explicit training data is limited, such as in autonomous driving, robotics, game-playing, recommendation systems, finance and health care. The main challenge in RL is to balance between exploration (taking less observed actions to gather information) and exploitation (taking actions with high historical reward). Stochastic Multi-arm bandits (MABs) can be viewed as a simplified decision-making problem where the state of the agent does not change by the actions. The rich existing bandit theory provides insights into tackling expiration-exploitation trade-offs in RL.
In this work, we draw inspiration from the classical bandit algorithms upper confidence bound (UCB)
and Thompson sampling (TS), and propose a novel algorithm in RL: Optimistic Thompson sampling. Optimism encourages exploration: Agents can gain information by being optimistic and playing actions that are estimated to be sub-optimal but are not sufficiently sampled. We first discuss our performance metric in theoretical analysis, namely regret. Regret measures how the algorithm performs compared to the optimal strategy if the rewards and dynamics of the environment are known. We then provide a novel theoretical analysis of the optimistic Thompson sampling (O-TS) [Chapelle and Li, 2011] algorithms in bandits. Next, we extend the algorithms to the Markov decision process (MDP) setting, a common representation for RL problems. We propose two novel algorithms, O-TS-MDP and O-TS-MDP+. We compare their performance to existing work both theoritically and with numerical experiments. Finally, we conclude our work with a discussion of future directions for strategic exploration in bandits and RL.
|
Genre | |
Type | |
Language |
eng
|
Date Available |
2023-09-08
|
Provider |
Vancouver : University of British Columbia Library
|
Rights |
Attribution-NonCommercial-NoDerivatives 4.0 International
|
DOI |
10.14288/1.0435839
|
URI | |
Degree | |
Program | |
Affiliation | |
Degree Grantor |
University of British Columbia
|
Graduation Date |
2023-11
|
Campus | |
Scholarly Level |
Graduate
|
Rights URI | |
Aggregated Source Repository |
DSpace
|
Item Media
Item Citations and Data
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International