HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALI

文件名称: HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION.pdf

所属分类: 讲义

开发工具:

文件大小: 1mb

下载次数: 0

上传时间: 2019-09-02

提供者: m0_37******

下载 (1mb)

不能下载？报告错误

详细说明：HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATIONPublished as a conference paper at ICLR 2016 Here, the subscript of e enumerates the variables being integrated over, where states and actions are sampled sequentially from the dynamics model P(st+1 st, at)and policy (at st), respectively The colon notation a: b refers to the inclusive range (u, a+1,..., b). These formulas are well known and straightforward to obtain; they follow directly from Proposition 1, which will be stated shortly The choice yt=A"(St: at) yields almost the lowest possible variance, though in practice, the advantage function is not known and must be estimated This statement can be intuitively justified by the following interpretation of the policy gradient that a step in the policy gradient direction should increase the probability of better-than-average actions and decrease the probability of worse-than average actions. The advantage function, by it's definition A"(S,a)=Q(s, a)-VT(s), measures whether or not the action is better or worse than the policy's default behavior. Hence, we should choose yt to be the advantage function A(St, at), so that the gradient term y Ve log te(at st) points in the direction of increased Te at st)if and only if A"(st, at)>0. See Greensmith et al (2004) for a more rigorous analysis of the variance of policy gradient estimators and the effect of using a baseline We will introduce a parameter y that allows us to reduce variance by downweighting rewards cor responding to delayed effects, at the cost of introducing bias. This parameter corresponds to the discount factor used in discounted formulations of mdps but we treat it as a variance reduction parameter in an undiscounted problem; this technique was analyzed theoretically by Marbach TSitsiklis(2003); Kakade(2001b); Thomas(2014). The discounted value functions are given by I, St+1:∞ 5+,+) C t+1: L=0 A"(St, at): =Q"1(St, at)-V,(st) The discounted approximation to the policy gradient is defined as follows A,(Su,al)Vlog e(al su (6) The following section discusses how to obtain biased(but not too biased )estimators for A,,, giving s noisy estimates of the discounted policy gradient in Equation(6) Before proceeding, we will introduce the notion of a y-just estimator of the advantage function which is an estimator that does not introduce bias when we use it in place of 4. (which is not known and must be estimated)in Equation (6)to estimate g". Consider an advantage estimator At(o: 00, C0: oo), which may in general be a function of the entire trajectory Definition 1. The estimator At is y-Just y t,at)ve It follows immediately that if At is -just for all t, then Vlogπe(at|st) (8) One sufficient condition for At to be y-just is that At decomposes as the difference between two functions Qt and bt, where Qt can depend on any trajectory variables but gives an unbiased estimator of the a'-discounted q-function, and b is an arbitrary function of the states and actions sampled Proposition 1. Suppose that At can be written in the form At(so: 00, 00: o)=Qt(St: 00, at: oo) bt(so: t, ao:t-1) such that for all(St, at), E St+1:∝c,at+1 st,a:[Q(s:∞x,Ct:s) Then A is y-just Note, that we have already introduced bias by using A"? in place of A" here we are concerned with obtaining an unbiased estimate of 9', which is a biased estimate of the policy gradient of the undiscounted MDP Published as a conference paper at ICLR 2016 The proof is provided in Appendix B. It is easy to verify that the following expressions are -just advantage estimators for a t =077t+l Q,(St, at) . rt+yV,(+1-V,(st) 3 ADVANTAGE FUNCTION ESTIMATION This section will be concerned with producing an accurate estimate At of the discounted advan tage function A,(St, at), which will then be used to construct a policy gradient estimator of the following form 0=元2∑∑ At log Te(al) =1′=0 where n indexes over a batch of episodes Let V be an approximate value function. Define 8y=rt+yV(st+1)-V(st), i.e., the TD residual of v with discount y(Sutton Barto, 1998). Note that dt can be considered as an estimate of the advantage of the action at. In fact, if we have the correct value function V=v,,, then it is a y-just advantage estimator. and in fact an unbiased estimator of a,? rtt Es+[Q, 7(st:at)-V:(St=A' However, this estimator is only y-just for V=V,, otherwise it will yield biased policy gradient estimates Next let us consider taking the sum of k of these 5 terms. which we will denote by ack) A();=6 V(st)+rt+?V(st+1) (11) 6+6 V(St)+rt+ +y2V(s+2) (12) +~6+1+21+2=-V(s)+r+r+1+2r+2+?2V(st+3)(13) (st)+r+r+1+…+k-1+k-1+V(st+k)(14) These equations result from a telescoping sum, and we see that AsK)involves a k-step estimate of the returns, minus a baseline term-V(st). analogously to the case of Y-A we can consider At' to be an estimator of the advantage function, which is only y-just when V= Vi, Y. However, note that the bias generally becomes smaller as k: -, since the term y V(st+ k) becomes more heavily discounted, and the term-V(St) does not affect the bias. Taking k->0o, we get A1x)=∑H=-V(s)+∑ (15) =0 which is simply the empirical returns minus the value function baseline Published as a conference paper at ICLR 2016 The generalized advantage estimator GAE(, X)is defined as the exponentially-weighted average of these h-step estimators AAE(M=(1-20(41+2+24(+…) y)(6+(6:+781)+2(6+76+1+26{+2)+ 入)(6}(1+入+2+…)+61(入+入2+23+ +28+2(2+入3+4+…)+…) 70+1 + ∑(7)}5 (16) 7=0 From Equation(16), we see that the advantage estimator has a remarkably simple formula involving a discounted sum of bellman residual terms. Section 4 discusses an interpretation of this formula as the returns in an mdp with a modified reward function. The construction we used above is closely analogous to the one used to define TD(A)(Sutton barto, 1998), however TD(A)is an estimator of the value function whereas here we are estimating the advantage function There are two notable special cases of this formula, obtained by setting A=0 and X=1 GAE(,0):A4:=6 =T+V(s1+1)-V(s) (17) GAE(,1):A:=∑?61+1=∑ (18) GAE(Y, 1) is ) -just regardless of the accuracy of v, but it has high variance due to the sum of terms. GAE(, 0) is y-just for V-V, and otherwise induces bias, but it typically has much lower variance. The generalized advantage estimator for 0R be an arbitrary scalar-valued function on state space, and define the transformed reward function r by )+(3 Published as a conference paper at ICLR 2016 which in turn defines a transformed MDP. This transformation leaves the discounted advantage function A, unchanged for any policy T. To see this, consider the discounted sum of rewards of a trajectory starting with state st: ∑(s:+4,a4,st+1-1)=∑r(st+1,at+4s++x)-(s Letting q ,7,V,, A", be the value and advantage functions of the transformed MDP, one obtains from the definitions of these quantities that Q(s,a)=Q(s,a)-Φ(s) Vx(s,a)=V(8)更(8) (23 Ax(s,a)-(Q?(s,0)-重(s))-(V(6)-(s))-Am Note that if dp happens to be the state-value function Vi, from the original MDP, then the trans formed MDP has the interesting property that V,(s)is zero at every state Note that (Ng et al., 1999)showed that the reward shaping transformation leaves the policy gradient and optimal policy unchanged when our objective is to maximize the discounted sum of rewards t-o yr(st, at, St+1). In contrast, this paper is concerned with maximizing the undiscounted sum of rewards, where the discount y is used as a variance-reduction parameter. Having reviewed the idea of reward shaping, let us consider how we could use it to get a policy gradient estimate. The most natural approach is to construct policy gradient estimators that use discounted sums of shaped rewards r. However, Equation(21)shows that we obtain the discounted sum of the original MDP's rewards minus a baseline terIn. Next, lets consider using a"steeper discount yA, whereas 1. It's easy to see that the shaped reward r equals the Bellman residual term dv. introduced in Section 3. where we set p= v. Letting p=v. we see that ∑()(st+,a,9t++1)=∑(A)bH=4AB, Hence, by considering the A-discounted sum of shaped rewards, we exactly obtain the generalized advantage estimators from Section 3. As shown previously, a= l gives an unbiased estimate of g?. whereas x< l gives a biased estimate To further analyze the effect of this shaping transformation and parameters y and A, it will be useful to introduce the notion of a response function x, which we define as follows X(; st, at)=E[r+ st, at-e[r (26) Note that A,(s, a)=bEo yX(l;s, a), hence the response function decomposes the advantage function across timesteps. The response function lets us quantify the temporal credit assignment problem: long range dependencies between actions and rewards correspond to nonzero values of the response function for I >0 Next, let us revisit the discount factor and the approximation we are making by using a t. rather than A,. The discounted policy gradient estimator from Equation(6) has a sum of terms of the form olog To(at St)A"(st, at)=Vo log o(at se)X(; st, a, M…∵ U: d i h>1/(1 Given that V T, completely reduces the temporal spread of the response function, we can hope that a good approximation v a v,? partially reduces it. This observation suggests an interpretation of Equation (16): reshape the rewards using v to shrink the temporal extent of the response function and then introduce a steeper discount y X to cut off the noise arising from long delays, i.e., ignore terms Ve log te(at l st) i+ where l>1/(1-nA Published as a conference paper at ICLR 2016 5 VALUE FUNCTION ESTIMATION A variety of different methods can be used to estimate the value function(see, e.g., Bertsekas (2012)). When using a nonlinear function approximator to represent the value function, the sim plest approach is to solve a nonlinear regression problem minimize 2 vo(sm)-VnI2 (28) where Vt= 2Ieoy'rtil is the discounted sum of rewards, and n, indexes over all timesteps in a batch of trajectories. This is sometimes called the Monte Carlo or TD() approach for estimating the value function(Sutton barto, 1998).2 For the experiments in this work, we used a trust region method to optimize the value function in each iteration of a batch optimization procedure. The trust region helps us to avoid overfitting to the most recent batch of data. To formulate the trust region problem, we first compute 02 ISold (sn)-Vnl, where Pold is the parameter vector before optimization. Then we solve the following constrained optimization problem minimize ∑‖V(sn)-Vn|2 c(s, S subject to 9 2 This constraint is equivalent to constraining the average kl divergence between the previous value function and the new value function to be smaller than E, where the value function is taken to pa rameterize a conditional Gaussian distribution with mean Vo(s)and variance o We compute an approximate solution to the trust region problem using the conjugate gradient algo rithm(Wright nocedal, 1999). Specifically, we are solving the quadratic program minimize 9(e-pold subject to ∑ (-oud)H(-ood)≤. (30) where g is the gradient of the objective, and H=N2njnjT, where jn=VoVo(sn).Note that H is the Gauss-Newtonapproximation of the Hessian of the objective, and it is(up to a o factor the Fisher information matrix when interpreting the value function as a conditional probability dis tribution. USing matrix-vector products 7-> Hu to implement the conjugate gradient algorithm, we compute a step direction s -H-Ig. Then we rescale s-as such that 2 Q)TH(CS=Eand take o= old as. This procedure is analogous to the procedure we use for updating the policy which is described further in Section 6 and based on Schulman et al. (2015) 6Eⅹ PERIMENTS We designed a set of experiments to investigate the following questions reward using generalized advantage estimation p and y E[0, 1] when optimizing episodic total 1. What is the empirical effect of varying A E 0, 1 2. Can generalized advantage estimation along with trust region algorithms for policy and value function optimization, be used to optimize large neural network policies for challenging control problems? Another natural choice is to compute target values with an estimator based on the TD(A)backup Bertsekas 2012: Sutton Barto, 1998), mirroring the expression we use for policy gradient estimation: V+= Vood(sn)+ the-A)'d+l. While we experimented with this choice, we did not notice a difference in performance from I estimator in Equation(28). Published as a conference paper at ICLR 2016 6.1 POLICY OPTIMIZATION ALGORITHM While generalized advantage estimation can be used along with a variety of different policy gra dient methods, for these experiments, we performed the policy updates using trust region policy optimization(trPo)(Schulman et aL., 2015). TRPO updates the policy by approximately solving the following constrained optimization problem each iteration minimize Le,(0) subject to DKL( NO,丌o)≤e where Leod(e) ∑ TA(am. Sn) )=∑DK(x|5n)(n) (31) As described in(Schulman et al., 2015), we approximately solve this problem by linearizing the objective and quadraticizing the constraint, which yields a step in the direction 6-8old K-F where F is the average Fisher information matrix, and g is a policy gradient estimate. This policy update yields the same step direction as the natural policy gradient (Kakade, 2001a) and natural actor-critic(Peters schaal, 2008), however it uses a different stepsize determination scheme and numerical procedure for computing the step Since prior work(Schulman et al., 2015)compared TRPO to a variety of different policy optimiza- tion algorithms, we will not repeat these comparisons; rather, we will focus on varying the 1, X parameters of policy gradient estimator while keeping the underlying algorithm fixed For completeness, the whole algorithm for iteratively updating policy and value function is given Initialize policy parameter ]o and value function parameter oo for讠=0,1,2,.,,d Simulate current policy Te, until N timesteps are obtained Compute dr at all timesteps t E1, 2 ), using Compute At=2io(x)'5y1 at all timesteps Compute Bi +1 with TRPO update, Equation (31) Compute i+1 with Equation (30) nd for Note that the policy update Bi- Bi+l is performed using the value function Voi for ad vantage estimation, not Vp +1. Additional bias would have been introduced if we updated the value function first. To see this. consider the extreme case where we overfit the value function and the bellman residualrt +?V(st+1)-V(st) becomes zero at all timesteps the policy gradient estimate would be zero 6.2 EXPERIMENTAL SETUP We evaluated our approach on the classic cart-pole balancing problem, as well as several challenging 3D locomotion tasks:(1)bipedal locomotion;(2)quadrupedal locomotion;(3)dynamically standing up, for the biped, which starts off laying on its back. The models are shown in Figure 1 6.2.1 ARCHITECTURE We used the same neural network architecture for all of the 3d robot tasks. which was a feedforward network with three hidden layers, with 100, 50 and 25 tanh units respectively. The same architecture was used for the policy and value function The final output layer had linear activation The value function estimator used the same architecture but with only one scalar output. For the simpler cart pole task, we used a linear policy, and a neural network with one 20-unit hidden layer as the value function Published as a conference paper at ICLR 2016 Figure 1: Top figures: robot models used for 3D locomotion. Bottom figures: a sequence of framesfromthelearnedgaits.Videosareavailableathttps://sites.google.com/sice/ gaepapersupp 6.2.2 TASK DETAILS For the cart-pole balancing task, we collected 20 trajectories per batch, with a maximum length of 1000 timesteps, using the physical parameters from Barto et al. (1983) The simulated robot tasks were simulated using the muoCo physics engine (Todorov et al., 2012) The humanoid model has 33 state dimensions and 10 actuated degrees of freedom, while the quadruped model has 29 state dimensions and 8 actuated degrees of freedom. The initial state for these tasks consisted of a uniform distribution centered on a reference configuration We used 50000 timesteps per batch for bipedal locomotion, and 200000 timesteps per batch for quadrupedal locomotion and bipedal standing. Each episode was terminated after 2000 timesteps if the robot had not reached a terminal state beforehand The timestep was 0.01 seconds The reward functions are provided in the table below Task R ewar 3d biped locomotion 0-5l fimpact l2+0.2 Quadruped locomotion UFwd-10-6Jul2-10-3l impact 12+0.05 Biped getting up 5)2-10-5 Here, Ufwd forward velocity, u: vector of joint torques, impact: impact forces, hhead height of the head In the locomotion tasks, the episode is terminated if the center of mass of the actor falls below a predefined height: &m for the biped, and 2 m for the quadruped. The constant offset in the reward function encourages longer episodes; otherwise the quadratic reward terms might lead lead to a policy that ends the episodes as quickly as possible 6.3 EXPERIMENTAL RESULTS All results are presented in terms of the cost, which is defined as negative reward and is m InI mized.Videosofthelearnedpoliciesareavailableathttps://sites.googlecom/site/ gaepapersupp. In plots, "No VF" means that we used a time-dependent baseline that did not depend on the state, rather than an estimate of the state value function. The time-dependent baseline was computed by averaging the return at each timestep over the trajectories in the batch 6.3.1 CART-POLE The results are averaged across 21 experiments with different random seeds. Results are shown in Figure 2, and indicate that the best results are obtained at intermediate values of the parameters Y∈[0.96,0.99]andx∈o.92,0.99 Published as a conference paper at ICLR 2016 Cart-pole learning curves (at y-0.99) Cart-pole per formance after 20 iterations A=1.0 入=0 8,080,340 number of policy iterations Figure 2: Left: learning curves for cart-pole task, using generalized advantage estimation with H ying values of A at y=0.99. The fastest policy improvement is obtain by intermediate values of A in the range 0.92, 0.98. Right: performance after 20 iterations of policy optimization, as y and A are varied. White means higher reward. The best results are obtained at intermediate values of both d Bi ped 3D Quadruped T=0.995 .5 -1 HHHH veTh 100 56 896 1366 number of policy iterations number of policy iterations Figure 3: Left: Learning curves for 3D bipedal locomotion, averaged across nine runs of the algo rithm. Right: learning curves for 3D quadrupedal locomotion, averaged across five runs 6.3.2 3D BIPEDAL LOCOMOTION Each trial took about 2 hours to run on a 16-core machine, where the simulation rollouts were paral lelized, as were the function, gradient, and matrix-vector-product evaluations used when optimizin the policy and value function. Here, the results are averaged across 9 trials with different random seeds. The best performance is again obtained using intermediate values of y E0.99, 0.995,E 0.96,0.99. The result after 1000 iterations is a fast, smooth, and stable gait that is effectively completely stable. We can compute how much"real time was used for this learning process 0.01 seconds timestep X 50000 timesteps/batch X 1000 batches /3600. 24 seconds/day =5.8 days. Hence, it is plausible that this algorithm could be run on a real robot, or multiple real robots learning in par- allel, if there were a way to reset the state of the robot and ensure that it doesnt damage itself. 6.3.3 OTHER 3D ROBOT TASKS The other two motor behaviors considered are quadrupedal locomotion and getting up off the ground for the 3D biped. Again, we performed 5 trials per experimental condition, with different random seeds(and initializations ) The experiments took about 4 hours per trial on a 32-core machine We performed a more limited comparison on these domains(due to the substantial computational resources required to run these experiments), fixing y=0.995 but varying A=10, 0.96, as well as an experimental condition with no value function. For quadrupedal locomotion, the best results are obtained using a value function with X=0.96 Section 6.3.2. For 3 d standing the value function always helped, but the results are roughly the same for =0.96 and X=I 10

(系统自动生成,下载前可以参看下载内容)