文件名称:
HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION.pdf
开发工具:
文件大小: 1mb
下载次数: 0
上传时间: 2019-09-02
详细说明:
HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATIONPublished as a conference paper at ICLR 2016
Here, the subscript of e enumerates the variables being integrated over, where states and actions are
sampled sequentially from the dynamics model P(st+1 st, at)and policy (at st), respectively
The colon notation a: b refers to the inclusive range (u, a+1,..., b). These formulas are well
known and straightforward to obtain; they follow directly from Proposition 1, which will be stated
shortly
The choice yt=A"(St: at) yields almost the lowest possible variance, though in practice, the
advantage function is not known and must be estimated This statement can be intuitively justified by
the following interpretation of the policy gradient that a step in the policy gradient direction should
increase the probability of better-than-average actions and decrease the probability of worse-than
average actions. The advantage function, by it's definition A"(S,a)=Q(s, a)-VT(s), measures
whether or not the action is better or worse than the policy's default behavior. Hence, we should
choose yt to be the advantage function A(St, at), so that the gradient term y Ve log te(at st)
points in the direction of increased Te at st)if and only if A"(st, at)>0. See Greensmith et al
(2004) for a more rigorous analysis of the variance of policy gradient estimators and the effect of
using a baseline
We will introduce a parameter y that allows us to reduce variance by downweighting rewards cor
responding to delayed effects, at the cost of introducing bias. This parameter corresponds to the
discount factor used in discounted formulations of mdps but we treat it as a variance reduction
parameter in an undiscounted problem; this technique was analyzed theoretically by Marbach
TSitsiklis(2003); Kakade(2001b); Thomas(2014). The discounted value functions are given by
I,
St+1:∞
5+,+)
C
t+1:
L=0
A"(St, at): =Q"1(St, at)-V,(st)
The discounted approximation to the policy gradient is defined as follows
A,(Su,al)Vlog e(al su
(6)
The following section discusses how to obtain biased(but not too biased )estimators for A,,, giving
s noisy estimates of the discounted policy gradient in Equation(6)
Before proceeding, we will introduce the notion of a y-just estimator of the advantage function
which is an estimator that does not introduce bias when we use it in place of 4. (which is not
known and must be estimated)in Equation (6)to estimate g". Consider an advantage estimator
At(o: 00, C0: oo), which may in general be a function of the entire trajectory
Definition 1. The estimator At is y-Just y
t,at)ve
It follows immediately that if At is -just for all t, then
Vlogπe(at|st)
(8)
One sufficient condition for At to be y-just is that At decomposes as the difference between two
functions Qt and bt, where Qt can depend on any trajectory variables but gives an unbiased estimator
of the a'-discounted q-function, and b is an arbitrary function of the states and actions sampled
Proposition 1. Suppose that At can be written in the form At(so: 00, 00: o)=Qt(St: 00, at: oo)
bt(so: t, ao:t-1) such that for all(St, at), E
St+1:∝c,at+1
st,a:[Q(s:∞x,Ct:s)
Then A is y-just
Note, that we have already introduced bias by using A"? in place of A" here we are concerned with
obtaining an unbiased estimate of 9', which is a biased estimate of the policy gradient of the undiscounted
MDP
Published as a conference paper at ICLR 2016
The proof is provided in Appendix B. It is easy to verify that the following expressions are -just
advantage estimators for a t
=077t+l
Q,(St, at)
. rt+yV,(+1-V,(st)
3 ADVANTAGE FUNCTION ESTIMATION
This section will be concerned with producing an accurate estimate At of the discounted advan
tage function A,(St, at), which will then be used to construct a policy gradient estimator of the
following form
0=元2∑∑ At log Te(al)
=1′=0
where n indexes over a batch of episodes
Let V be an approximate value function. Define 8y=rt+yV(st+1)-V(st), i.e., the TD residual
of v with discount y(Sutton Barto, 1998). Note that dt can be considered as an estimate of the
advantage of the action at. In fact, if we have the correct value function V=v,,, then it is a y-just
advantage estimator. and in fact an unbiased estimator of a,?
rtt
Es+[Q, 7(st:at)-V:(St=A'
However, this estimator is only y-just for V=V,, otherwise it will yield biased policy gradient
estimates
Next let us consider taking the sum of k of these 5 terms. which we will denote by ack)
A();=6
V(st)+rt+?V(st+1)
(11)
6+6
V(St)+rt+
+y2V(s+2)
(12)
+~6+1+21+2=-V(s)+r+r+1+2r+2+?2V(st+3)(13)
(st)+r+r+1+…+k-1+k-1+V(st+k)(14)
These equations result from a telescoping sum, and we see that AsK)involves a k-step estimate of
the returns, minus a baseline term-V(st). analogously to the case of Y-A
we can consider
At' to be an estimator of the advantage function, which is only y-just when V= Vi, Y. However,
note that the bias generally becomes smaller as k: -, since the term y V(st+ k) becomes more
heavily discounted, and the term-V(St) does not affect the bias. Taking k->0o, we get
A1x)=∑H=-V(s)+∑
(15)
=0
which is simply the empirical returns minus the value function baseline
Published as a conference paper at ICLR 2016
The generalized advantage estimator GAE(, X)is defined as the exponentially-weighted average
of these h-step estimators
AAE(M=(1-20(41+2+24(+…)
y)(6+(6:+781)+2(6+76+1+26{+2)+
入)(6}(1+入+2+…)+61(入+入2+23+
+28+2(2+入3+4+…)+…)
70+1
+
∑(7)}5
(16)
7=0
From Equation(16), we see that the advantage estimator has a remarkably simple formula involving
a discounted sum of bellman residual terms. Section 4 discusses an interpretation of this formula as
the returns in an mdp with a modified reward function. The construction we used above is closely
analogous to the one used to define TD(A)(Sutton barto, 1998), however TD(A)is an estimator
of the value function whereas here we are estimating the advantage function
There are two notable special cases of this formula, obtained by setting A=0 and X=1
GAE(,0):A4:=6
=T+V(s1+1)-V(s)
(17)
GAE(,1):A:=∑?61+1=∑
(18)
GAE(Y, 1) is ) -just regardless of the accuracy of v, but it has high variance due to the sum of
terms. GAE(, 0) is y-just for V-V, and otherwise induces bias, but it typically has much
lower variance. The generalized advantage estimator for 0R be an arbitrary scalar-valued function on state space, and define the
transformed reward function r by
)+(3
Published as a conference paper at ICLR 2016
which in turn defines a transformed MDP. This transformation leaves the discounted advantage
function A, unchanged for any policy T. To see this, consider the discounted sum of rewards of a
trajectory starting with state st:
∑(s:+4,a4,st+1-1)=∑r(st+1,at+4s++x)-(s
Letting q ,7,V,, A", be the value and advantage functions of the transformed MDP, one obtains
from the definitions of these quantities that
Q(s,a)=Q(s,a)-Φ(s)
Vx(s,a)=V(8)更(8)
(23
Ax(s,a)-(Q?(s,0)-重(s))-(V(6)-(s))-Am
Note that if dp happens to be the state-value function Vi, from the original MDP, then the trans
formed MDP has the interesting property that V,(s)is zero at every state
Note that (Ng et al., 1999)showed that the reward shaping transformation leaves the policy gradient
and optimal policy unchanged when our objective is to maximize the discounted sum of rewards
t-o yr(st, at, St+1). In contrast, this paper is concerned with maximizing the undiscounted sum
of rewards, where the discount y is used as a variance-reduction parameter.
Having reviewed the idea of reward shaping, let us consider how we could use it to get a policy
gradient estimate. The most natural approach is to construct policy gradient estimators that use
discounted sums of shaped rewards r. However, Equation(21)shows that we obtain the discounted
sum of the original MDP's rewards minus a baseline terIn. Next, lets consider using a"steeper
discount yA, whereas 1. It's easy to see that the shaped reward r equals the Bellman residual
term dv. introduced in Section 3. where we set p= v. Letting p=v. we see that
∑()(st+,a,9t++1)=∑(A)bH=4AB,
Hence, by considering the A-discounted sum of shaped rewards, we exactly obtain the generalized
advantage estimators from Section 3. As shown previously, a= l gives an unbiased estimate of g?.
whereas x< l gives a biased estimate
To further analyze the effect of this shaping transformation and parameters y and A, it will be useful
to introduce the notion of a response function x, which we define as follows
X(; st, at)=E[r+ st, at-e[r
(26)
Note that A,(s, a)=bEo yX(l;s, a), hence the response function decomposes the advantage
function across timesteps. The response function lets us quantify the temporal credit assignment
problem: long range dependencies between actions and rewards correspond to nonzero values of the
response function for I >0
Next, let us revisit the discount factor and the approximation we are making by using a t. rather
than A,. The discounted policy gradient estimator from Equation(6) has a sum of terms of the
form
olog To(at St)A"(st, at)=Vo log o(at se)X(; st, a,
M…∵
U:
d i
h>1/(1
Given that V T, completely reduces the temporal spread of the response function, we can hope that
a good approximation v a v,? partially reduces it. This observation suggests an interpretation of
Equation (16): reshape the rewards using v to shrink the temporal extent of the response function
and then introduce a steeper discount y X to cut off the noise arising from long delays, i.e., ignore
terms Ve log te(at l st) i+ where l>1/(1-nA
Published as a conference paper at ICLR 2016
5 VALUE FUNCTION ESTIMATION
A variety of different methods can be used to estimate the value function(see, e.g., Bertsekas
(2012)). When using a nonlinear function approximator to represent the value function, the sim
plest approach is to solve a nonlinear regression problem
minimize 2 vo(sm)-VnI2
(28)
where Vt= 2Ieoy'rtil is the discounted sum of rewards, and n, indexes over all timesteps in a
batch of trajectories. This is sometimes called the Monte Carlo or TD() approach for estimating
the value function(Sutton barto, 1998).2
For the experiments in this work, we used a trust region method to optimize the value function
in each iteration of a batch optimization procedure. The trust region helps us to avoid overfitting
to the most recent batch of data. To formulate the trust region problem, we first compute 02
ISold (sn)-Vnl, where Pold is the parameter vector before optimization. Then we solve
the following constrained optimization problem
minimize
∑‖V(sn)-Vn|2
c(s,
S
subject to
9
2
This constraint is equivalent to constraining the average kl divergence between the previous value
function and the new value function to be smaller than E, where the value function is taken to pa
rameterize a conditional Gaussian distribution with mean Vo(s)and variance o
We compute an approximate solution to the trust region problem using the conjugate gradient algo
rithm(Wright nocedal, 1999). Specifically, we are solving the quadratic program
minimize 9(e-pold
subject to
∑
(-oud)H(-ood)≤.
(30)
where g is the gradient of the objective, and H=N2njnjT, where jn=VoVo(sn).Note that
H is the Gauss-Newtonapproximation of the Hessian of the objective, and it is(up to a o factor
the Fisher information matrix when interpreting the value function as a conditional probability dis
tribution. USing matrix-vector products 7-> Hu to implement the conjugate gradient algorithm, we
compute a step direction s -H-Ig. Then we rescale s-as such that 2 Q)TH(CS=Eand
take o= old as. This procedure is analogous to the procedure we use for updating the policy
which is described further in Section 6 and based on Schulman et al. (2015)
6Eⅹ PERIMENTS
We designed a set of experiments to investigate the following questions
reward using generalized advantage estimation p and y E[0, 1] when optimizing episodic total
1. What is the empirical effect of varying A E 0, 1
2. Can generalized advantage estimation along with trust region algorithms for policy and value
function optimization, be used to optimize large neural network policies for challenging control
problems?
Another natural choice is to compute target values with an estimator based on the TD(A)backup Bertsekas
2012: Sutton Barto, 1998), mirroring the expression we use for policy gradient estimation: V+= Vood(sn)+
the-A)'d+l. While we experimented with this choice, we did not notice a difference in performance from
I estimator in Equation(28).
Published as a conference paper at ICLR 2016
6.1 POLICY OPTIMIZATION ALGORITHM
While generalized advantage estimation can be used along with a variety of different policy gra
dient methods, for these experiments, we performed the policy updates using trust region policy
optimization(trPo)(Schulman et aL., 2015). TRPO updates the policy by approximately solving
the following constrained optimization problem each iteration
minimize Le,(0)
subject to DKL( NO,丌o)≤e
where Leod(e)
∑
TA(am. Sn)
)=∑DK(x|5n)(n)
(31)
As described in(Schulman et al., 2015), we approximately solve this problem by linearizing the
objective and quadraticizing the constraint, which yields a step in the direction 6-8old K-F
where F is the average Fisher information matrix, and g is a policy gradient estimate. This policy
update yields the same step direction as the natural policy gradient (Kakade, 2001a) and natural
actor-critic(Peters schaal, 2008), however it uses a different stepsize determination scheme and
numerical procedure for computing the step
Since prior work(Schulman et al., 2015)compared TRPO to a variety of different policy optimiza-
tion algorithms, we will not repeat these comparisons; rather, we will focus on varying the 1, X
parameters of policy gradient estimator while keeping the underlying algorithm fixed
For completeness, the whole algorithm for iteratively updating policy and value function is given
Initialize policy parameter ]o and value function parameter oo
for讠=0,1,2,.,,d
Simulate current policy Te, until N timesteps are obtained
Compute dr at all timesteps t E1, 2
), using
Compute At=2io(x)'5y1 at all timesteps
Compute Bi +1 with TRPO update, Equation (31)
Compute i+1 with Equation (30)
nd for
Note that the policy update Bi- Bi+l is performed using the value function Voi for ad vantage
estimation, not Vp +1. Additional bias would have been introduced if we updated the value function
first. To see this. consider the extreme case where we overfit the value function and the bellman
residualrt +?V(st+1)-V(st) becomes zero at all timesteps the policy gradient estimate would
be zero
6.2 EXPERIMENTAL SETUP
We evaluated our approach on the classic cart-pole balancing problem, as well as several challenging
3D locomotion tasks:(1)bipedal locomotion;(2)quadrupedal locomotion;(3)dynamically standing
up, for the biped, which starts off laying on its back. The models are shown in Figure 1
6.2.1 ARCHITECTURE
We used the same neural network architecture for all of the 3d robot tasks. which was a feedforward
network with three hidden layers, with 100, 50 and 25 tanh units respectively. The same architecture
was used for the policy and value function The final output layer had linear activation The value
function estimator used the same architecture but with only one scalar output. For the simpler cart
pole task, we used a linear policy, and a neural network with one 20-unit hidden layer as the value
function
Published as a conference paper at ICLR 2016
Figure 1: Top figures: robot models used for 3D locomotion. Bottom figures: a sequence of
framesfromthelearnedgaits.Videosareavailableathttps://sites.google.com/sice/
gaepapersupp
6.2.2 TASK DETAILS
For the cart-pole balancing task, we collected 20 trajectories per batch, with a maximum length of
1000 timesteps, using the physical parameters from Barto et al. (1983)
The simulated robot tasks were simulated using the muoCo physics engine (Todorov et al., 2012)
The humanoid model has 33 state dimensions and 10 actuated degrees of freedom, while the
quadruped model has 29 state dimensions and 8 actuated degrees of freedom. The initial state
for these tasks consisted of a uniform distribution centered on a reference configuration We used
50000 timesteps per batch for bipedal locomotion, and 200000 timesteps per batch for quadrupedal
locomotion and bipedal standing. Each episode was terminated after 2000 timesteps if the robot had
not reached a terminal state beforehand The timestep was 0.01 seconds
The reward functions are provided in the table below
Task
R
ewar
3d biped locomotion
0-5l
fimpact l2+0.2
Quadruped locomotion UFwd-10-6Jul2-10-3l impact 12+0.05
Biped getting up
5)2-10-5
Here, Ufwd
forward velocity, u: vector of joint torques, impact: impact forces, hhead
height of the head
In the locomotion tasks, the episode is terminated if the center of mass of the actor falls below a
predefined height: &m for the biped, and 2 m for the quadruped. The constant offset in the reward
function encourages longer episodes; otherwise the quadratic reward terms might lead lead to a
policy that ends the episodes as quickly as possible
6.3 EXPERIMENTAL RESULTS
All results are presented in terms of the cost, which is defined as negative reward and is m
InI
mized.Videosofthelearnedpoliciesareavailableathttps://sites.googlecom/site/
gaepapersupp. In plots, "No VF" means that we used a time-dependent baseline that did not
depend on the state, rather than an estimate of the state value function. The time-dependent baseline
was computed by averaging the return at each timestep over the trajectories in the batch
6.3.1 CART-POLE
The results are averaged across 21 experiments with different random seeds. Results are shown in
Figure 2, and indicate that the best results are obtained at intermediate values of the parameters
Y∈[0.96,0.99]andx∈o.92,0.99
Published as a conference paper at ICLR 2016
Cart-pole learning curves (at y-0.99)
Cart-pole per formance after 20 iterations
A=1.0
入=0
8,080,340
number of policy iterations
Figure 2: Left: learning curves for cart-pole task, using generalized advantage estimation with
H ying values of A at y=0.99. The fastest policy improvement is obtain by intermediate values of
A in the range 0.92, 0.98. Right: performance after 20 iterations of policy optimization, as y and A
are varied. White means higher reward. The best results are obtained at intermediate values of both
d Bi ped
3D Quadruped
T=0.995
.5
-1
HHHH
veTh
100
56
896
1366
number of policy iterations
number of policy iterations
Figure 3: Left: Learning curves for 3D bipedal locomotion, averaged across nine runs of the algo
rithm. Right: learning curves for 3D quadrupedal locomotion, averaged across five runs
6.3.2 3D BIPEDAL LOCOMOTION
Each trial took about 2 hours to run on a 16-core machine, where the simulation rollouts were paral
lelized, as were the function, gradient, and matrix-vector-product evaluations used when optimizin
the policy and value function. Here, the results are averaged across 9 trials with different random
seeds. The best performance is again obtained using intermediate values of y E0.99, 0.995,E
0.96,0.99. The result after 1000 iterations is a fast, smooth, and stable gait that is effectively
completely stable. We can compute how much"real time was used for this learning process
0.01 seconds timestep X 50000 timesteps/batch X 1000 batches /3600. 24 seconds/day =5.8 days. Hence,
it is plausible that this algorithm could be run on a real robot, or multiple real robots learning in par-
allel, if there were a way to reset the state of the robot and ensure that it doesnt damage itself.
6.3.3 OTHER 3D ROBOT TASKS
The other two motor behaviors considered are quadrupedal locomotion and getting up off the ground
for the 3D biped. Again, we performed 5 trials per experimental condition, with different random
seeds(and initializations ) The experiments took about 4 hours per trial on a 32-core machine
We performed a more limited comparison on these domains(due to the substantial computational
resources required to run these experiments), fixing y=0.995 but varying A=10, 0.96, as well as
an experimental condition with no value function. For quadrupedal locomotion, the best results are
obtained using a value function with X=0.96 Section 6.3.2. For 3 d standing the value function
always helped, but the results are roughly the same for =0.96 and X=I
10
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.