Logo du LIFL

Depuis le 1er janvier 2015 le LIFL et le LAGIS forment le laboratoire CRIStAL

  1. Doctoral studies

Thesis of

Victor Gabillon

Thursday 12 June 2014
Amphithéâtre de l'IRCICA

Budgeted Classification-based Policy Iteration

Directeur de Thèse : M. Mohammad GHAVAMZADEH, Inria Lille & Adobe Directeur de Thèse : M. Philippe PREUX, Université Lille 3 Rapporteurs : M. Peter AUER, Université de Leoben Rapporteurs : M. Shie MANNOR, Technion Membres : M. Olivier CAPPÉ, Télécom ParisTech Membres : M. Csaba SZEPESVÁRI, Université de l'Alberta

This dissertation is motivated by the study ofclass of reinforcement learning (RL) algorithms, called classification-based policy iteration (CBPI). Contrary to the standard RL methods, CBPI do not use an explicit representation for value function. Instead, they use rollouts and estimate the action-value function of the current policy at a collection of states. Using a training set built from these rollout estimates, the greedy policy is learned as the output of a classifier. Thus, the policy generated at each iteration of the algorithm, is no longer defined by a (approximated) value function, but instead by a classifier. In this thesis, we propose new algorithms that improve the performance of the existing CBPI methods, especially when they have a fixed budget of interaction with the environment. Our improvements are based on the following two shortcomings of the existing CBPI algorithms: 1) The rollouts that are used to estimate the action-value functions should be truncated and their number is limited, and thus, we have to deal with bias-variance tradeoff in estimating the rollouts, and 2) The rollouts are allocated uniformly over the states in the rollout set and the available actions, while a smarter allocation strategy could guarantee a more accurate training set for the classifier. We propose CBPI algorithms that address these issues, respectively, by: 1) the use of a value function approximation to improve the accuracy (balancing the bias and variance) of the rollout estimates, and 2) adaptively sampling the rollouts over the state-action pairs.


UMR 8022 - Laboratoire d'Informatique Fondamentale de Lille - Copyright © 2012 Sophie TISON - Crédits & Mentions légales

Page respectant XHTML et CSS.

Pour tout commentaire / Comments and remarks : webmaster