Hybrid Reward Architecture for Reinforcement Learning

194

One of the main challenges in reinforcement learning (RL) is generalisation. In typical deep RL methods this is achieved by approximating the optimal value function with a low-dimensional representation using a deep network. While this approach works well in many domains, in domains where the optimal value function cannot easily be reduced to a low-dimensional representation, learning can be very slow and unstable.

This paper contributes towards tackling such challenging domains, by proposing a new method, called Hybrid Reward Architecture (HRA). HRA takes as input a decomposed reward function and learns a separate value function for each component reward function. Because each component typically only depends on a subset of all features, the overall value function is much smoother and can be easier approximated by a low-dimensional representation, enabling more effective learning. We demonstrate HRA on a toy-problem and the Atari game Ms. Pac-Man, where HRA achieves above-human performance.

1 Introduction

In reinforcement learning (RL) (Sutton & Barto, 1998; Szepesvári, 2009), the goal is to find a behaviour policy that maximises the return—the discounted sum of rewards received over time—in a data-driven way. One of the main challenges of RL is to scale methods such that they can be applied to large, real-world problems. Because the state-space of such problems is typically massive, strong generalisation is required to learn a good policy efficiently. Mnih et al. (2015) achieved a big breakthrough in this area: by combining standard RL techniques with deep neural networks, they outperformed humans on a large number of Atari 2600 games, by learning a policy from pixels. The generalisation properties of their Deep Q-Networks (DQN) method is performed by approximating the optimal value function. A value function plays an important role in RL, because it predicts the expected return, conditioned on a state or state-action pair. Once the optimal value function is known, an optimal policy can easily be derived.

By modelling the current estimate of the optimal value function with a deep neural network, DQN carries out a strong generalisation on the value function, and hence on the policy. arXiv:1706.04208v1 [cs.LG] 13 Jun 2017 The generalisation behaviour of DQN is achieved by regularisation on the model for the optimal value function. However, if the optimal value function is very complex, then learning an accurate low-dimensional representation can be challenging or even impossible. Therefore, when the optimal value function cannot easily be reduced to a low-dimensional representation, we argue to apply a new, complementary form of regularisation on the target side. Specifically, we propose to replace the reward function with an alternative reward function that has a smoother optimal value function, but still yields a reasonable—but not necessarily optimal—policy, when acting greedily.

A key observation behind regularisation on the target function is the difference between the performance objective, which specifies what type of behaviour is desired, and the training objective, which provides the feedback signal that modifies an agent’s behaviour. In RL, a single reward function often takes on both roles. However, the reward function that encodes the performance objective might be awful as a training objective, resulting in slow or unstable learning. At the same time, a training objective can differ from the performance objective, but still do well with respect to it. Intrinsic motivation (Stout et al., 2005;

Schmidhuber, 2010) uses the above observation to improve learning in sparse-reward domains. It achieves this by adding a domain-specific intrinsic reward signal to the reward coming from the environment. Typically, the intrinsic reward function is potential-based, which maintains optimality of the resulting policy. In our case, we define a training objective based on a different criterion: smoothness of the value function, such that it can easily be represented by a low-dimensional representation. Because of this different goal, adding a potential-based reward function to the original reward function is not a good strategy in our case, since this typically does not reduce the complexity of the optimal value function.

Our main strategy for constructing a training objective is to decompose the reward function of the environment into n different reward functions. Each of them is assigned to a separate reinforcementlearning agent. Similar to the Horde architecture (Sutton et al., 2011), all these agents can learn in parallel on the same sample sequence by using off-policy learning. For action selection, each agent gives its values for the actions of the current state to an aggregator, which combines them into a single action-value for each action (for example, by averaging over all agents). Based on these action-values the current action is selected (for example, by taking the greedy action). We test our approach on two domains: a toy-problem, where an agent has to eat 5 randomly located fruits, and Ms. Pac-Man, a hard game from the ALE benchmark set (Bellemare et al., 2013).

https://arxiv.org/pdf/1706.04208.pdf