Open Collections

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

Taking advantage of common assumptions in policy optimization and reinforcement learning Lavington, Jonathan Wilder

Abstract

This work considers training conditional probability distributions called policies, using simulated environments via gradient-based optimization methods. It begins by investigating the effects that complex model classes have on settings where a policy is learned through the imitation of expert data which is gathered through repeated environmental interaction. Next, it discusses how to build a gradient based optimizer which is tailored specifically to policy optimization where querying gradient information is expensive. We then consider policy optimization settings which contain imperfect expert demonstration, and design an algorithm which utilizes additional information available during training to improve the policy performance at test time and the efficiency of learning. Lastly, we consider how to generate behavioral data which satisfies hard constraints by using a combination of learned inference artifacts and a special variant of sequential Monte Carlo.

Item Metadata

Title	Taking advantage of common assumptions in policy optimization and reinforcement learning
Creator	Lavington, Jonathan Wilder
Supervisor	Schmidt, Mark; Wood, Frank
Publisher	University of British Columbia
Date Issued	2024
Description	This work considers training conditional probability distributions called policies, using simulated environments via gradient-based optimization methods. It begins by investigating the effects that complex model classes have on settings where a policy is learned through the imitation of expert data which is gathered through repeated environmental interaction. Next, it discusses how to build a gradient based optimizer which is tailored specifically to policy optimization where querying gradient information is expensive. We then consider policy optimization settings which contain imperfect expert demonstration, and design an algorithm which utilizes additional information available during training to improve the policy performance at test time and the efficiency of learning. Lastly, we consider how to generate behavioral data which satisfies hard constraints by using a combination of learned inference artifacts and a special variant of sequential Monte Carlo.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2024-10-31
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NoDerivatives 4.0 International
DOI	10.14288/1.0447187
URI	http://hdl.handle.net/2429/89582
Degree (Theses)	Doctor of Philosophy - PhD
Program (Theses)	Computer Science
Affiliation	Science, Faculty of; Computer Science, Department of
Degree Grantor	University of British Columbia
Graduation Date	2025-05
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

Taking advantage of common assumptions in policy optimization and reinforcement learning Lavington, Jonathan Wilder

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights