文件名称:
Hindsight Experience Replay.pdf
开发工具:
文件大小: 2mb
下载次数: 0
上传时间: 2019-09-02
详细说明:关于Hindsight Experience Replay的原始论文,适合初学者对深度强化学习Hindsight Experience Replay的认识和了解is to periodically set the weights of the target network to the current weights of the main network(e. g
Mnih et al. (2015)) or to use a polyak-averaged(Polyak and Juditsky, 1992)version of the main
network instead ( Lillicrap et al., 2015)
2.3 Deep Deterministic Policy Gradients (DDPG
Deep Deterministic Policy Gradients(DDPG)(Lillicrap et al., 2015)is a model-free RL algorithm
for continuous action spaces. Here we sketch it only informally, see Lillicrap et al. (2015)for more
details In DDPG we maintain two neural networks: a target policy (also called an actor)T: S-A
and an action-value function approximator(called the critic)Q:SX A-R. The critic's job is to
approximate the actors action-value function Q
Episodes are generated using a behavioral policy which is a noisy version of the target policy, e. g
Tb(s=T(s)+N(, 1). The critic is trained in a similar way as the Q-function in DQn but the
targets yt are computed using actions outputted by the actor, i.e. 3t =rt +re(st+1, 7(st+1))
The actor is trained with mini-batch gradient descent on the loss Ca =-Es Q(s: T(s), where s
is sampled from the replay buffer. The gradient of ca w.r.t. actor parameters can be computed by
backpropagation through the combined critic and actor networks
2.4 Universal Value Function Approximators ( UVFA)
Universal value Function Approximators(UVFA)(Schaul et al., 2015a) is an extension of dQn to
the setup where there is more than one goal we may try to achieve. Let g be the space of possible
goals. Every goal g E g corresponds to some reward functionTg:SXA>R. Every episode starts
with sampling a state-goal pair from some distribution P(so, g). The goal stays fixed for the whole
episode. At every timestep the agent gets as input not only the current state but also the current goal
T:SX9-7A and gets the reward rt=Tg(st, at). The Q-function now depends not only on a
state-action pair but also on a goal Q(st, at: 9)=E[Rt st, at, g). Schaul et al. (2015a) show that in
this setup it is possible to train an approximator to the Q-function using direct bootstrapping from the
Bellman equation just like in case of DQN) and that a greedy policy derived from it can generalize
to previously unseen state-action pairs. The extension of this approach to DDPG is straightforward
3 Hindsight Experience Replay
3.1 A motivating exampl
ole
Consider a bit-flipping environment with the state space S=0, 1n and the action space A
1 for some integer n in which executing the i-th action flips the i-th bit of the state
For every episode we sample uniformly an initial state as well as a target state and the policy gets a
reward of -1 as long as it is not in the target state, i.e, rg(s,a)=-ls g.
Standard rl algorithms are bound
this environment for
n>40 because they will never experience any reward other than-1. Figure 1: Bit-fipping experi
Notice that using techniques for improving exploration (e.g. VIME
ment
(Houthooft et al., 2016), count-based exploration(Ostrowski et al
2017)or bootstrapped DQn (Osband et al., 2016) does not help
DQN
DQN+HER
here because the real problem is not in lack of diversity of states
1.0
being visited rather it is simply impractical to explore such a large
state space. The standard solution to this problem would be to use
0.8
a shaped reward function which is more informative and guides the e o6
agent tow ards the goal, e. g. Ta(s,u)
s-g
ng a
shaped reward solves the problem in our toy environment, it may be 30.4
difficult to apply to more complicated problems. We investigate the
0.2
results of reward shaping experimentally in Sec. 4.4
一一一一一一备
Instead of shaping the reward we propose a different solution which
01020304
does not require any domain knowledge. Consider an episode with
number of bits n
a polyak-averaged version of a parametric model M which is being trained is a model whose parameters
are computed as an exponential moving average of the parameters of M over time
a state sequence s1,…, sT and a goal g≠s1,…, sT which implies
that the agent received a reward of -1 at every timestep The pivotal idea behind our approach is to
re-examine this trajectory with a different goal -while this trajectory may not help us learn how to
achieve the state g, it definitely tells us something about how to achieve the state ST. This information
can be harvested by using an off-policy rl algorithm and experience replay where we replace g in
the replay buffer by sr. In addition we can still replay with the original goal g left intact in the replay
buffer. With this modification at least half of the replayed trajectories contain rewards different from
I and learning becomes much simpler. Fig. I compares the final performance of dqn with and
without this additional replay technique which we call Hindsight Experience replay(Her). DQn
without HER can only solve the task for n 13 while dQn with hER easily solves the task for n up
to 50. See Appendix a for the details of the experimental setup Note that this approach combined
with powerful function approximators(e. g deep neural networks )allows the agent to learn how to
achieve the goal g even if it has never observed it during trainin
We more formally describe our approach in the following sections
3.2 Multi-goal RL
We are interested in training agents which learn to achieve multiple different goals. We follow the
approach from Universal value Function Approximators(Schaul et al., 2015a), i.e. we train policies
and value functions which take as input not only a state s e s but also a goal g E g. Moreover, we
show that training an agent to perform multiple tasks can be easier than training it to perform only
one task(see Sec.4.3 for details) and therefore our approach may be applicable even if there is only
one task we would like the agent to perform(a similar situation was recently observed by Pinto and
Gupta(2016))
We assume that every goal g E g corresponds to some predicate fg: 8>0, 1] and that the agent's
goal is to achieve any state s that satisfies fa(s)=1. In the case when we want to exactly specify the
desired state of the system we may useS-g and fg(s)-[s-g. The goals can also specify onl
some properties of the state, e.g. suppose that S=R- and we want to be able to achieve an arbitrary
state with the given value of x coordinate. In this case g=R and f((a, y))=a=g
Moreover, we assume that given a state s we can easily find a goal g which is satisfied in this state
More formally, we assume that there is given a mapping m: S>gs.. Vsesfm(s(s)=1. Notice
that this assumption is not very restrictive and can usually be satisfied. In the case where each goal
corresponds to a state we want to achieve, i. e. g=S and fa(s)=[s=g], the mapping m is just an
dentity. For the case of 2-dimensional state and 1-dimensional goals from the previous paragraph
this mapping is also very simple m((a, ))-a
A universal policy can be trained using an arbitrary rl algorithm by sampling goals and initial states
from some distributions, running the agent for some number of timesteps and giving it a negati
reward at every timestep when the goal is not achieved, i.e. ra(s, a)=-Ifg(s
0]
This does not
however work very well in practice because this reward function is sparse and not very informative
In order to solve this problem we introduce the technique of Hindsight Experience Replay which is
the crux of our approach
3.3 Algorithm
The idea behind Hindsight Experience Replay(hEr)is very simple: after experiencing some episode
S0,81,., sr we store in the replay buffer every transition St - St+1 not only with the original
goal used for this episode but also with a subset of other goals. Notice that the goal being pursued
influences the agents actions but not the environment dynamics and therefore we can replay each
trajectory with an arbitrary goal assuming that we use an off-policy rl algorithm like dQN (Mnih
et al., 2015), DDPG (Lillicrap et al., 2015), NAF (Guet al., 2016)or SDQN (Metz et al., 2017)
One choice which has to be made in order to use hEr is the set of additional goals used for replay
In the simplest version of our algorithm we replay each trajectory with the goal m(sr), i.e. the goal
which is achieved in the final state of the episode. We experimentally compare different types and
quantities of additional goals for replay in Sec. 4.5. In all cases we also replay each trajectory with
the original goal pursued in the episode. See Alg I for a more formal description of the algorithm
Algorithm 1 Hindsight Experience Replay(her)
Given:
an off-policy rl algorithm A
De.g. DQN, DDPG, NAF, SDQN
a strategy S for sampling goals for replay
Dc.g.S(50,…,1)=m(s1
● a reward function r:S×4×G→R.
Deg7(s,0,9)=-[Jf
Initialize a
initialize neural networks
Initialize replay buffer R
for episode =1, M do
Sample a goal g and an initia
state So
for l=0.T-1 do
Sample an action at using the behavioral policy from A:
stig
d denotes concatenation
Execute the action at and observe a new state st+1
end for
or
t=0,T-1d
Store the transition(st9, at, rt, 8t+1lg) in R
D standard experience replay
Sample a set of additional goals for replay G: =S(current episode)
forg∈Gdo
r(St, at: g
Store the transition( Stllg
St+1lg)in R
> HER
end for
d fo
for t=1. do
Sample a minibatch b from the replay buffer R
Perform one step of optimization using A and minibatch B
end for
end for
HER may be seen as a form of implicit curriculum as the goals used for replay naturally shift from
ones which are simple to achieve even by a random agent to more difficult ones. However, in contrast
to explicit curriculum, hEr does not require having any control over the distribution of initial
environment states. Not only does hEr learn with extremely sparse rewards, in our experiments
it also performs better with sparse rewards than with shaped ones(See Sec. 4.4). These results are
indicative of the practical challenges with reward shaping, and that shaped rewards would often
constitute a compromise on the metric we truly care about (such as binary success/failure)
4 Experiments
Thevideopresentingourexperimentsisavailableathttps://goo.gl/smrqni
This section is organized as follows. In Sec. 4. 1 we introduce multi-goal RL environments we use for
the experiments as well as our training procedure In Sec. 4.2 we compare the performance of DDPG
with and without HEr. In Sec. 4.3 we check if HER improves performance in the single-goal setup
In Sec. 4. 4 we analyze the effects of using shaped reward functions. In Sec. 4.5 we compare different
strategies for sampling additional goals for HER. In Sec. 4.6 we show the results of the experiments
on the physical robot
4.1 Environments
The are no standard environments for multi-goal Rl and therefore we created our own environments.
We decided to use manipulation environments based on an existing hardware robot to ensure that the
challenges we face correspond as closely as possible to the real world. In all experiments we use a
7-DOF Fetch Robotics arm which has a two-fingered parallel gripper The robot is simulated using
the MuoCo(todorov et al., 2012)physics engine. The whole training procedure is performed in
he simulation but we show in Sec. 4.6 that the trained policies perform well on the physical robot
without any finetuning
Figure 2: Different tasks: pushing(top row), sliding(middle row ) and pick-and-place(bottom row)
The red ball denotes the goal position
Policies are represented as Multi-Layer Perceptrons (MLps) with Rectified Linear Unit (RelU)
activation functions. Training is performed using the dDPG algorithm lillicrap et al., 2015)with
Adam(Kingma and Ba, 2014 )as the optimizer For improved efficiency we use 8 workers which
average the parameters after every update. See Appendix a for more details and the values of all
hyperparameters
We consider 3 different tasks
1. Pushing In this task a box is placed on a table in front of the robot and the task is to move
it to the target location on the table. The robot fingers are locked to prevent grasping. The
learned behaviour is a mixture of pushing and rolling
2. Sliding. In this task a puck is placed on a long slippery table and the target position is outside
of the robot's reach so that it has to hit the puck with such a force that it slides and then
stops in the appropriate place due to friction
3. Pick-and-place. This task is similar to pushing but the target position is in the air and the
fingers are not locked. To make exploration in this task easier we recorded a single state in
which the box is grasped and start half of the training episodes from this state
States: The state of the system is represented in the MuJoCo physics engine and consists of angles
and velocities of all robot joints as well as positions, rotations and velocities (linear and angular) of
all objects
Goals: Goals describe the desired position of the object(a box or a puck depending on the task
with some fixed tolerance of E i.e. g=RS and fg(s)=g- Object s E, where Object is
the position of the object in the state s. The mapping from states to goals used in HER is simply
m (s)=object
Rewards: Unless stated otherwise we use binary and sparse rewards r(s, a, g)=-fg(s)=0
where s if the state after the execution of the action a in the state s. We compare sparse and shaped
reward functions in Sec. 4.4
State-goal distributions: For all tasks the initial position of the gripper is fixed, while the initial
position of the object and the target are randomized. See Appendix a for details
3This was necessary because we could not successfully train any policies for this task without using the
demonstration state. We have later discovered that training is possible without this trick if only the goal position
is sometimes on the table and sometimes in the air
Observations: In this paragraph relative means relative to the current gripper position. The policy
is given as input the absolute position of the gripper, the relative position of the object and the target
as well as the distance between the fingers. The Q-function is additionally given the linear velocity of
the gripper and fingers as well as relative linear and angular velocity of the object. We decided to
restrict the input to the policy in order to make deploy ment on the physical robot easier
Actions: None of the problems we consider require gripper rotation and therefore we keep it fixed
Action space is 4-dimensional. Three dimensions specify the desired relative gripper position at
the next timestep. We use MuJoCo constraints to move the gripper towards the desired position bu
Jacobian-based control could be used instead. The last dimension specifies the desired distance
between the 2 fingers which are position controlled
Strategy S for sampling goals for replay: Unless stated otherwise heR uses replay with the goal
corresponding to the final state in each episode, i.e. S(so, .. ST)=m(st). We compare different
strategies for choosing which goals to replay with in Sec. 4.5
4.2 Does HER improve performance
In order to verify if HER improves performance we evaluate DDPG with and without hER on all
3 tasks. Moreover, we compare against DDPG with count-based exploration(Strehl and Littman,
2005: Kolter and Ng, 2009; Tang et al, 2016: Bellemare et al., 2016: Ostrowski et al., 2017). For
HER we store each transition in the replay buffer twice: once with the goal used for the generation
of the episode and once with the goal corresponding to the final state from the episode (we call this
strategy final). In Sec. 4. we perform ablation studies of different strategies s for choosing goals
for replay, here we include the best version from Sec. 4.5 in the plot for comparison
DDPG
DDPG-count-based exploration
DDPG+HER
DDPG+HER (version from Sec. 4.5)
pushing
pick-and-place
80
40%
20%
20%
50100150203
0100150200
050100153200
epoch number(every epoch= 800 episodes= 800x50 timesteps)
Figure 3: Learning curves for multi-goal setup. An episode is considered successful if the distance
between the object and the goal at the end of the episode is less than 7cm for pushing and pick-and-
place and less than 20cm for sliding. The results are averaged across 5 random seeds and shaded
areas represent one standard deviation. The red curves correspond to the future strategy with k= 4
from Sec. 4.5 while the blue one corresponds to the final strateg
From Fig 3 it is clear that DDPG without HER is unable to solve any of the tasksand DDPG with
count-based exploration is only able to make some progress on the sliding task. On the other hand
DDPG with HER solves all tasks almost perfectly. It confirms that her is a crucial element which
makes learning from sparse, binary rewards possible.
The target position is relative to the current object position
SThe successful deployment on a physical robot(Sec. 4.6)confirms that our control model produces
movements which are reproducible on the physical robot despite not being fully physically plausible
6 We discretize the state space and use an intrinsic reward of the form a/ VN, where a is a hyper-
parameter and
is the number of times the given state was visited. The discretization works as fol
lows. We take the relative position of the box and the target and then discretize every coordinate using
a grid with a stepsize B which is a hyperparameter. We have performed a hyperparameter search over
a∈{0.032,0.064,0.125,0.25,0.5,1,2,4,8,16,32},B∈{lcm,2cm,4cm,8cm}. The best results were
ob 7we also evaluated DQN (without HER)on our tasks and it was not able to solve any of them
obtained using ce= l and B= lcm and these are the results we repo
DPG - DDPG+count-based explorati
DDP G+HER
pushing
slidi
pick-and-plac
100%
100%
100%
80%
80%
80%
题%
60%
%
20%
20%
0
50100150200
50100150200
100150200
epoch number (every epoch =800 episodes= 800 50 timesteps)
Figure 4: Learning curves for the single-goal case
4.3 Does HER improve performance even if there is only one goal we care about?
In this section we evaluate whether her improves performance in the case where there is only one
goal we care about. To this end, we repeat the experiments from the previous section but the goal
state is identical in all episodes
From Fig 4 it is clear that ddPG+her performs much better than pure ddPg even if the goal state
is identical in all episodes. More importantly, comparing Fig 3 and Fig. 4 we can also notice that
HER learns faster if training episodes contain multiple goals, so in practice it is advisable to train on
multiple goals even if we care only about one of them
4.4 How does hEr interact with reward shaping?
So far we only considered binary rewards of the form r(s, a, 9)=-l9- Obiect>E. In this
section we check how the performance of DDPG with and without her changes if we replace
this reward with one which is shaped. We considered reward functions of the form r(s, a, 9
alg-sobjectIp-lg-sobject P, where s' is the state of the environment after the execution of the
action a in the state s and入∈{0,1},p∈{1,2} are hyperparameters
ig. 5 shows the results. Surprisingly neither DDPG, nor DDPGi+HER was able to successfully
solve any of the tasks with any of these reward functions. Our results are consistent with the fact
that successful applications of rl to difficult manipulation tasks which does not use demonstrations
usually have more complicated reward functions than the ones we tried(e.g. Popov et al.(2017)
The following two reasons can cause shaped rewards to perform so poorly: (1) There is a huge
discrepancy between what we optimize (i.e. a shaped reward function) and the success condition (i.e
is the object within some radius from the goal at the end of the episode); (2) Shaped rewards penalize
for inappropriate behaviour(e. g. moving the box in a wrong direction) which may hinder exploration
It can cause the agent to learn not to touch the box at all if it can not manipulate it precisely and we
noticed such behaviour in some of our experiments
Our results suggest that domain-agnostic reward shaping does not work well(at least in the simple
forms we have tried). Of course for every problem there exists a reward which makes it easy (ng
et al., 1999) but designing such shaped rewards requires a lot of domain knowledge and may in some
from sparse, binary rewards is an important probleyolicy This strengthens our belief that learning
cases not be much easier than directly scripting the
4.5 How many goals should we replay each trajectory with and how to choose them?
In this section we experimentally evaluate different strategies(i.e. S in Alg. 1)for choosing goals to
use with hEr. So far the only additional goals we used for replay were the ones corresponding to
We also tried to rescale the distances, so that the range of rewards is similar as in the case of binary rewards,
clipping big distances and adding a simple (linear or quadratic) term encouraging the gripper to move towards
the object but none of these techniques have led to successful training
8
DDPG
DDPG+HER
shing
slidi
pick-and-place
100%
100%
100%
80%
80%
80%
题%
60%
20%
20%
0%
050100150200050100150200
100150200
epoch number (every epoch =800 episodes= 800 50 timesteps)
Figure 5: Learning curves for the shaped reward r(s, a, )=-1g
object
it performed best
among the shaped rewards we have tried). Both algorithms fail on all tasks
no HER
H episode
future
ushing
pick-and-place
1.0
1.0
0.8
0.6
0.6
6
04
0.4
0.2
0.2
0.2
0.0
0.0
24816ll
124816all
pushing
sliding
pick-and-place
1.0
1.0
1.0
0.8
0.6
0.4
0.4
0.4
0.2
0.0
0.0
number of additional goals used to replay each transition with
Figure 6: Ablation study of different strategies for choosing additional goals for replay. The top row
shows the highest(across the training epochs) test performance and the bottom row shows the average
test performance across all training epochs. On the right top plot the curves for final, episode and
future coincide as all these strategies achieve perfect performance on this task.
the final state of the environment and we will call this strategy final. apart from it we consider the
Tollowing strategies
future- replay with h random states which come from the same episode as the transition
being replayed and were observed after it,
episode- replay with k random states coming from the same episode as the transition
being replayed,
random- replay with k random states encountered so far in the whole training procedure
All of these strategies have a hyperparameter h which controls the ratio of her data to data coming
from normal experience replay in the replay buffer
The plots comparing different strategies and different values of k can be found in Fig. 6. We can
see from the plots that all strategies apart from random solve pushing and pick-and-place almost
perfectly regardless of the values of k In all cases future with k equal 4 or 8 performs best and it
is the only strategy which is able to solve the sliding task almost perfectly. The learning curves for
Figure 7: The pick-and-place policy deployed on the physical robot
future with k= 4 can be found in Fig 3. It confirms that the most valuable goals for replay are the
ones which are going to be achieved in the near future. Notice that increasing the values of k above
8 degrades performance because the fraction of normal replay data in the buffer becomes very low
4.6 Deployment on a physical robot
We took a policy for the pick-and-place task trained in the simulator(version with the future strategy
and k= 4 from Sec. 4.5)and deployed it on a physical fetch robot without any finetuning. The box
position was predicted using a separately trained cnn using raw fetch head camera images. See
Appendix b for details
Initially the policy succeeded in 2 out of 5 trials. It was not robust to small errors in the box position
estimation because it was trained on perfect state coming from the simulation. After retraining the
policy with gaussian noise(std=lcm) added to observations the success rate increased to 5 5. The
videoshowingsomeofthetrialsisavailableathttps://goo.gl/smrqni
5 Related work
The technique of experience replay has been introduced in Lin(1992) and became very popular
after it was used in the DQn agent playing Atari (Mnih et al., 2015). Prioritized experience replay
(Schaul et al., 20 15b) is an improvement to experience replay which prioritizes transitions in the
replay buffer in order to speed up training. Itit orthogonal to our work and both approaches can be
easily combined
Learning simultaneously policies for multiple tasks have been heavily explored in the context of
policy search, e.g. Schmidhuber and Huber(1990); Caruana(1998); Da Silva et al.(2012); Kober et al
(2012); Devin et al. (2016); Pinto and Gupta(2016). Learning off-policy value functions for multiple
tasks was investigated by Foster and Dayan(2002)and Sutton et al.(2011). Our work is most heavil
based on Schaul et al. (2015a)who considers training a single neural network approximating multiple
value functions. Learning simultaneously to perform multiple tasks has been also investigated for
a long time in the context of Hierarchical Reinforcement Learning, e.g. Bakker and Schmidhuber
(2004); Vezhnevets et al (2017)
Our approach may be seen as a form of implicit curriculum learning(Elman, 1993; Bengio et al
2009). While curriculum is now often used for training neural networks(e. g. Zaremba and Sutskever
2014); Graves et al. (2016), the curriculum is almost always hand-crafted. The problem of automatic
curriculum generation was approached by Schmidhuber(2004 )who constructed an asymptotically
optimal algorithm for this problem using program search. Another interesting approach is PowerPlay
(Schmidhuber, 2013; Srivastava et al., 2013) which is a general framework for automatic task selection
Graves et al.(2017) consider a setup where there is a fixed discrete set of tasks and empirically
evaluate different strategies for automatic curriculum generation in this settings. Another approach
investigated by Sukhbaatar et al.(2017)and Held et al. (2017) uses self-play between the policy and
a task-setter in order to automatically generate goal states which are on the border of what the current
policy can achieve Our approach is orthogonal to these techniques and can be combined with them
oWe have also tried replaying the goals which are close to the ones achieved in the near future but it has not
performed better than the future strategy
IOThe Q-function approximator was trained using exact observations. It does not have to be robust to noisy
observations because it is not used during the deployment on the physical robot
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.