Noisy Networks for Exploration.pdf关于Noisy Networks

文件名称: Noisy Networks for Exploration.pdf

所属分类: 讲义

开发工具:

文件大小: 5mb

下载次数: 0

上传时间: 2019-09-02

提供者: m0_37******

下载 (5mb)

不能下载？报告错误

详细说明：关于Noisy Networks for Exploration dqn的原始论文，适合初学者对深度强化学习Noisy Networks for Exploration dqn的认识和了解Published as a conference paper at ICLR 2018 T is assessed by the action-value function Q defined as Q"(.a)=配 ∑ rR(t, at) (1) where E is the expectation over the distribution of the admissible trajectories(o, a0, 31, a1 obtained by executing the policy T starting from o=. and ao =a. Therefore, the quantity Q(a, a represents the expected ?-discounted cumulative reward collected by executing the policy T starting from c and a. a policy is optimal if no other policy yields a higher return. The action-value function of the optimal policy is Q(a, a)=arg max Q(, a) The value function V for a policy is defined as Vm(r)=fa ur(2)[Q(a, a)], and represents the expected ?-discounted return collected by executing the policy T starting from statec 2.2 DEEP REINFORCEMENT LEARNING Deep reinforcement Learning uses deep neural networks as function approximators for Rl methods DeepQ-Networks(DQN)(Mnih et al. 2015), Dueling architecture(Wang et al. 2016), Asynchronous Advantage Actor-Critic(A3C)(Mnih et al. 2016), Trust Region Policy Optimisation (Schulman et al. 2015, Deep Deterministic Policy Gradient (Lillicrap et al. 2015) and distributional RL C5 1)( Bellemare et al. 2017) are examples of such algorithms. They frame the rl problem as the minimisation of a loss function L(0), where 0 represents the parameters of the network In our experiments we shall consider the dQn, dueling and A3C algorithms DQN (Mnih et al. 2015) uses a neural network as an approximator for the action-value function of the optimal policy Q (, a). DQN'S estimate of the optimal action-value function, Q(, a), is found by minimising the following loss with respect to the neural network parameters 6 L(0)=E(, a, , D(7+ ymax Q(3, b; 0)-Q(a, a; 0) b∈ where D is a distribution over transitions e=(x,a, r-R(, a), yn P(la, a)) drawn from a replay buffer of previously observed transitions. Here 6 represents the parameters of a fixed and separate target network which is updated (60) regularly to stabilise the learning. An E-greedy policy is used to pick actions greedily according to the action-value function Q or, with probability E,a random action is taken The Dueling dQn (Wang et al. 2016) is an extension of the dQn architecture. The main difference is in using Dueling network architecture as opposed to the Q network in DQN. Dueling network estimates the action-value function using two parallel sub-networks the value and advantage sub- network, sharing a convolutional layer. Let Cony, Ov, and OA be, respectively, the parameters of the convolutional encoder of the value network V, and of the advantage network A; and 9=conv, ev, 0A) is their concatenation. The output of these two networks are combined as follows for every(x,a)∈孔×A: Q(x,0,)=V(:0m),0)+4(r(:0am):04)-24((:m,.(3) actions The Dueling algorithm then makes use of the double-DQN update rule(van Hasselt et al. 2016) to optimise e L(0)=E(aND[(r+1Q((0:0)a,a:0)21, S.t. b(y)=arg max Q(3, 6: 0), (5) where the definition distribution D and the target network parameter set o is identical to DQn In contrast to DQN and Dueling, A3C (Mnih et al. 2016) is a policy gradient algorithm. A3C"'s network directly learns a policy T and a value function V of its policy. The gradient of the loss on the Published as a conference paper at ICLR 2018 A3C policy at step t for the roll-out(t+i, au+i N T( 0++i; 0), Tt+i) VOL(0=-E Ve log(丌(at+a|x+;9) B∑VH(r(|x+;0) HTlIt; 0)) denotes the entropy of the policy T and B is a hyper parameter that trades off be tween optimising the advantage function and the entropy of the policy. The advantage function A(Et+i, at+i; 0) is the difference between observed returns and estimates of the return produced byA3 Cs value network:A(x+;,4+÷;0)=∑=个-个++k-(x+k:0)-V(x=;0),r1+ being the reward at step t+3 and V(: 0) being the agent 's estimate of value function of state The parameters of the value function are found to match on-policy returns: namely we have )=∑E[Q-V(x4+;0) i-0 where i is the return obtained by executing policy T starting in state t+i. In practice, and as in Mnih et al. (2016 we estimate Q; as Qi-2k=imj-irt+; +rk-iv(=++; 0)where fri+k-1 are rewards observed by the agent, and Tti k is the h: th state observed when starting from observed state t. The overall A3C loss is then L(0)=L(0)+ALv(8 where A balances optimising the policy loss relative to the baseline value function loss 3 NOISYNETS FOR REINFORCEMENT LEARNING NoisyNets are neural networks whose weights and biases are perturbed by a parametric function of the noise. These parameters are adapted with gradient descent. More precisely, let y= fe(c outputs y. We represent the noisy parameters 8 as 6 der parameters 6 which takes the input c and be a neural network parameterised by the vector of noisy μ+∑⊙, where c del ( u, ) is a set of vectors of learnable parameters, c is a vector of zero-mean noise with fixed statistics and O represents element-wise multiplication. The usual loss of the neural network is wrapped by expectation over th noise e L(SEL(OJ Optimisation now occurs with respect to the set of parameters s Consider a linear layer of a neural network with p inputs and g outputs, represented by y=w3+6 (8) where E RP is the layer input, w E Qxp the weight matrix, and b E Rq the bias. The corresponding noisy linear layer is defined as (p0+ )x+b+ab⊙ where u+o oa and ub_. replace w and b in Eq ( 8), respectively. The parameters 1C∈RxP,pb∈R9,am∈ RxP and o∈R" are learnable whereas et∈ QxP and E∈R?are noise random variables(the specific choices of this distribution are descrihed below ) We provide a graphical representation of a noisy linear layer in Fig. 4(see Appendix B We now turn to explicit instances of the noise distributions for linear layers in a noisy network We explore two options: Independent Gaussian noise, which uses an independent Gaussian noise entry per weight and Factorised Gaussian noise, which uses an independent noise per each output and another independent noise per each input. The main reason to use factorised gaussian noise is to reduce the compute time of random number generation in our algorithms. This computational overhead is especially prohibitive in the case of single-thread agents such as dQn and duelling. For this reason we use factorised noise for dQN and Duelling and independent noise for the distributed A3C, for which the compute time is not a major concern (a) Independent Gaussian noise: the noise applied to each weight and bias is independent, where each entry ei;(respectively each entry e9 )of the random matrix e(respectively of the random vector e )is drawn from a unit Gaussian distribution. This means that for each noisy linear layer, there are pq + q noise variables(for p inputs to the layer and q outputs) Published as a conference paper at ICLR 2018 (b)Factorised Gaussian noise: by factorising ej i, we can use p unit Gaussian variables Ei for noise of the inputs and and g unit Gaussian variables ai for noise of the outputs (thus p+ q unit Gaussian variables in total). Fach a and ab can then be written as Eii=f(eif(e ∫(E) (11) where f is a real-valued function In our experiments we used f(a)=sgn(a)vz note that for the bias Eq (1lb we could have set f(a)-a, but we decided to keep the same output noise for weights and biases Since the loss of a noisy network, L(S)=EL(0)J, is an expectation over the noise, the gradients are straightforward to obtain VL(9)=VE[L()=BV,YL(+∑⊙e) (12) We use a Monte Carlo approximation to the above gradients, taking a single sample s at each step of optimisation VL(()≈V,L(+∑⊙ (13) 3.1 DEEP REINFORCEMENT LEARNING WITH NOISYNETS We now turn to our application of noisy networks to exploration in deep reinforcement learning. No oise drives exploration in many methods for reinforcement learning, providing a source of stochasticity external to the agent and the rl task at hand either the scale of this noise is manually tuned across a wide range of tasks(as is the practice in general purpose agents such as dQn or A3C)or it can be manually scaled per task. Here we propose automatically tuning the level of noise added to an agent for exploration, using the noisy networks training to drive down (or up) the level of noise injected into the parameters of a neural network, as needed A noisy network agent samples a new set of parameters after every step of optimisation. Between tion steps, the agent acts according to a fi ters weights and b ensures that the agent always acts according to parameters that are drawn from the current noise distribution Deep q-Networks DQN) and Dueling. We apply the following modifications to both dQn and Dueling: first, E-greedy is no longer used, but instead the policy greedily optimises the(randomised action-value function. Secondly the fully connected layers of the value network are parameterised as a noisy network, where the parameters are drawn from the noisy network parameter distribution after every replay step. We used factorised Gaussian noise as explained in(b) from Sec. 3For replay, the current noisy network parameter sample is held fixed across the batch. Since dQn and Dueling take one step of optimisation for every action step, the noisy network parameters are re-sampled before every action. We call the new adaptations of DQN and Dueling, Noisy NeL-DQN and Noisy Net-Dueling, respectively We now provide the details of the loss function that our variant of dQN is minimising. When replacing the linear layers by noisy layers in the network (respectively in the target network), the parameterised action-value function Q(a, a, E; S)(respectively Q(a, a, e;s ) can be seen as a random variable and the don loss becomes the noisy net-dQn loss L(S)=EE(2,a, m, goD[r+ max Q(, b, e;s )-Q(, a, E; S (14) here the outer expectation is with respect to distribution of the noise variables e for the noisy value function Q(, a, E:S)and the noise variable e' for the noisy target value function Q(3, 6, e;s) Computing an unbiased estimate of the loss is straightforward as we only need to compute, for each transition in the replay buffer, one instance of the target network and one instance of the online network. We generate these independent noises to avoid bias due to the correlation between the noise in the target network and the online network. Concerning the action choice, we generate another independent sample efor the online network and we act greedily with respect to the corresponding output action-value function Published as a conference paper at ICLR 2018 Similarly the loss function for noisy -Dueling is defined as L()=E[E(xar,)ND+((v,0(y,e;)-Q(x,a;)2](15) S.t. b()=arg max Q( y, b(0),E"S) (16) Both algorithms are provided in AppendixC I Asynchronous Advantage Actor Critic(A3C). A3C is modified in a similar fashion to DQN firstly, the entropy bonus of the policy loss is removed. Secondly, the fully connected layers of explained in( a) from Sec. 3 In A3C, there is no explicit exploratory action selection scheme(such as E-greedy; and the chosen action is always drawn from the current policy. For this reason, an entropy bonus of the policy loss is often added to discourage updates leading to deterministic policies However, when adding noisy weights to the network, sampling these parameters corresponds to choosing a different current policy which naturally favours exploration. As a consequence of direct exploration in the policy space, the artificial entropy loss on the policy can thus be omitted. New parameters of the policy network are sampled after each step of optimisation, and since A3C uses n step returns, optimisation occurs every n steps. We call this modification of A3C, Noisy Net-A3C Indeed, when replacing the linear layers by noisy linear layers(the parameters of the noisy network are now noted S, we obtain the following estimation of the return via a roll-out of size k: ∑-r+;+k=V(x+; (17) As A3C is an on-policy algorithm the gradients are unbiased when noise of the network is consistent for the whole roll-out. Consistency among action value functions Qi is ensured by letting lettin the noise be the same throughout each rollout. i.e., Vi, i= E. Additional details are provided in the Appendix A and the algorithm is given in Appendix C 2 3.2 INITIALISATION OF NOISY NETWORKS In the case of an unfactorised noisy networks, the parameters u and o are initialised as follows. Each element Wi, j is sampled fr roI ll independent uniform distributions u[/≥,+√3 where p is th number of inputs to the corresponding linear layer, and each element o,, is simply set to 0.017 for all parameters. This particular initialisation was chosen because similar values worked well for the supervised learning tasks described in Fortunato et al. (2017), where the initialisation of the variances of the posteriors and the variances of the prior are related. We have not tuned for this parameter, but we believe different values on the same scale should provide similar results. For factorised noisy networks, each element ui. i was initialised by a sample from an independent uniform distributions ll ,T ypI and each element ii was initialised to a constant og. The hyperparameter co is set to 0.5 4 RESULTS We evaluated the performance of noisy network agents on 57 Atari games(Bellemare et al. 2015 and compared to baselines that, without noisy networks, rely upon the original exploration methods (a-greedy and entropy bonus 4.1 TRAINING DETAILS AND PERFORMANCE We used the randoin start no-ops scheme for training and evaluation as described the original dQn paper(Mnih et al. 2015. The mode of evaluation is identical to those of Mnih et al. (2016)where randomised restarts of the games are used for evaluation after training has happened. The raw average scores of the agents are evaluated during training, every IM frames in the environment, by suspending Published as a conference paper at ICLR 2018 5 T C (a) Improvement in percentage of NoisyNet-DQN over DQN (Mnih et al. 2015 103 -53 但g (b)Improvement in percentage of Noisy Net-Dueling over Dueling(Wang et al. 2016) 155 103 (c)Improvement in percentage of Noisy Net-A 3C over A3C ( Mnih et al. 2016) Figure 1: Comparison of Noisy Net agent versus the baseline according to Eq. 19 The maximum score is truncated at 250%0 learning and evaluating the latest agent for 500K frames. Episodes are truncated at 108K frames(or 30 minutes of simulated play)(van Hasselt et al. 2016) We consider three baseline agents: DQN (Mnih et al. 2015), duel clip variant of Dueling algo rithm(Wang et al. 2016) and A3C(Mnih et al. 2016). The DQN and A3C agents were training for 200M and 320M frames, respectively. In each case, we used the neural network architecture from the corresponding original papers for both the baseline and noisyNet variant. For the NoisyNet variants we used the same hyper parameters as in the respective original paper for the baseline We compared absolute performance of agents using the human normalised score: Scoreagent ScoreRand ScoreHuman ScoreRandoll (18) here human and random scores are the same as those in Wang et al. (2016. Note that the human normalised score is zero for a random agent and 100 for human level performance. Per-game maximum scores are computed by taking the maximum raw scores of the agent and then averaging over three seeds. However, for computing the human normalised scores in Figure 2 the raw scores are evaluated every 1M frames and averaged over three seeds. The overall agent performance is measured by both mean and median of the human normalised score across all 57 Atari games The aggregated results across all 57 Atari games are reported in Tablel while the individual scores for each game are in Table 3 from the Appendix e The median human normalised score is improved Published as a conference paper at ICLR 2018 in all agents by using Noisy Net, adding at least 18 (in the case of A3 C)and at most 48(in the case of DQN) percentage points to the median human normalised score. The mean human normalised score is also significantly improved for all agents. Interestingly the dueling case, which relies on multiple modifications of dQN, demonstrates that Noisy Net is orthogonal to several other improvements made to dQn. We also compared relative performance of NoisyNet agents to the respective baseline agent Baseline NOisyNet Improvemen Mean Median Mean Median (On median DQN 319 379 123 48 Dueling 524 132 633 172 30% A3C293 347 94 18% Table 1: Comparison between the baseline dQn, dueling and a3C and their noisy net version in terms of median and mean human-normalised scores defined in Eq (18. We report on the last column the percentage improvement on the baseline in terms of median human-normalised score without noisy networks Scorenoisy net 100× Baseline Inax(ScoreHuman: ScoreBaseline)- ScoreRandom (19) As before, the per-game score is computed by taking the maximum performance for each game and then averaging over three seeds. The relative human normalised scores are shown in FigureIAs can be seen, the performance of NoisyNet agents(DQN, Dueling and A3C) is better for the majority of games relative to the corresponding baseline, and in some cases by a considerable margin. Also as it is evident from the learning curves of Fig. 2 Noisy Net agents produce superior performance compared to their corresponding baselines throughout the learning process. This improvenent is especially significant in the case of Noisy Net-DQN and Noisy Net-Dueling. Also in some games oisy Net agents provide an order of magnitude improvement on the performance of the vanilla agent as can be seen in Table 3 in the appendix e with detailed breakdown of individual game scores and the learning curves plots from Figs 6 and 8] for DQN, Dueling and A3C, respectively. We also ran some experiments evaluating the performance of noisy net -a3C with factorised noise. We report the corresponding learning curves and the scores in Fig. and Table 2 respectively(see Appendix D. This result shows that using factorised noise does not lead to any significant decrease in the performance of A3C. On the contrary it seems that it has positive effects in terms of improving the median score as well as speeding up the learning process Median score over games Median score over Median score over games 40 W 30 10 100 50 200 100 150 200 050100150200250300350 Million frames Million frames Figure 2: Comparison of the learning curves of NoisyNet agent versus the baseline according to the median human normalised score 4.2 ANALYSIS OF LEARNING IN NOISY LAYERS In this subsection, we try to provide some insight on how noisy networks affect the learning process and the exploratory behaviour of the agent. In particular, we focus on analysing the evolution of the noise weights o" and o throughout the learning process. We first note that, as L(s)is a positive and continuous function of S, there always exists a determini stic optimiser for the loss l(s)(defined in Published as a conference paper at ICLR 2018 Eq.(14)). Therefore, one may expect that, to obtain the deterministic optimal solution, the neural network may learn to discard the noise entries by eventually pushing o s and o towards 0 To test this hypothesis we track the changes in os throughout the learning process. Let o denote the i th weight of a noisy layer. We then define > the mean-absolute of the a 's of a noisy layer, as ∑|o (20 weights i Intuitively speaking 2 provides some measure of the stochasticity of the Noisy layers. We report the learning curves of the average of 2 across 3 seeds in Fig. 3 for a selection of Atari games in NoisyNet-DQN agent. We observe that 2 of the last layer of the network decreases as the learning proceeds in all cases, whereas in the case of the penultimate layer this only happens for 2 games out of 5(Pong and Beam rider) and in the remaining 3 games 2 in fact increases. This shows that in the case of NoisyNet-dQn the agent does not necessarily evolve towards a deterministic solution as one might have ex pected. Another interesting observation is that the way X evolves significantly differs from one game to another and in some cases from one seed to another seed, as it is evident from the error bars. This suggests that NoisyNet produces a problem-specific exploration strategy as opposed to fixed exploration strategy used in standard DQN Penultimate layer Last layer ean rider 0.014-+breakout .Dz00 00175 0. 012- space_ invade's 0.0050 mmmmmmmmmmmmm 0.00 100125150175200 255075100125150175200 Million frames Million frames Figure 3: Comparison of the learning curves of the average noise parameter 2 across five Atari games in NoisyNet-DQN. The results are averaged across 3 seeds and error bars(+/-standard deviation ) are plc 5 CONCLUSION We have presented a general method for exploration in deep reinforcement learning that shows significant performance improvements across many Atari games in three different agent architec tures. In particular, we ohserve that in games such as Beam rider, Asteroids and Freeway that the standard DQN, Dueling and A3C perform poorly compared with the human player, NoisyNet-DQN, NoisyNet-Dueling and noisynet-A3C achieve super human performance, respectively. Although the improvements in performance might also come from the optimisation aspect since the cost functions are modified the uncertainty in the parameters of the networks introduced by noisynet is the only exploration mechanism of the method Having weights with greater uncertainty introduces more variability into the decisions made by the policy, which has potential for exploratory actions, but further analysis needs to be done in order to disentangle the exploration and optimisation effects Another advantage of noisyNet is that the amount of noise injected in the network is tuned automati- cally by the rl algorithm. This alleviates the need for any hyper parameter tuning(required with standard entropy bonus and e-greedy types of exploration). This is also in contrast to many other methods that add intrinsic motivation signals that may destabilise learning or change the optimal policy. Another interesting feature of the Noisy Net approach is that the degree of exploration is ontextual and varies from state to state based upon per-weight variances. While more gradients are needed, the gradients on the mean and variance parameters are related to one another by a computationally efficient affine function, thus the computational overhead is marginal. Automatic differentiation makes implementation of our method a straightforward adaptation of many existing methods. A similar randomisation technique can also be applied to LStM units (Fortunato et al 2017)and is easily extended to reinforcement learning, we leave this as future work Published as a conference paper at ICLR 2018 Note Noisy Net exploration strategy is not restricted to the baselines considered in this paper. In fact, this idea can be applied to any deep rl algorithms that can be trained with gradient descent, including DDPG (Lillicrap et al. 2015), TRPO(Schulman et al. 2015) or distributional RL(C51) Bellemare et al. 2017). As such we believe this work is a step towards the goal of developing a universal exploration strategv Acknowledgements We would like to thank Koray Kavukcuoglu, Oriol Vinyals, Daan Wierstra Georg Ostrowski, Joseph Modayil, Simon Osindero, Chris Apps, Stephen Gaffney and many others at DeepMind for insightful discussions, comments and feedback on this work REFERENCES Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. Advances in Neural Information Processing Systems, 19: 49, 2007 Mohammad Gheshlaghi azar, lan Osband and remi munos minimax regret bounds for reinforce ment learning. arXiv preprint arXiv: /703.05449, 2017 Marc Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. In Twenty-Fourth international Joint Conference on Artificial Intelligence, 2015 Marc Bellenare, SriraIn Srinivasan, Georg Ostrowski, Tom Schaul, David Saxton, and Remi Munos Unifying count-based exploration and intrinsic motivation. In Advances in Neural information P g System,pp.1471-1479,2016 Marc G Bellemare, will Dabney, and remi munos. a distributional perspective on reinforcement learning. In International Conference on Machine Learning, pp. 449-458, 2017 Richard Bellman and robert Kalaba. Dynamic programming and modern control theory. Academic Press New York. 1965 Dimitri Bertsekas. Dynamic programming and optimal control, volume 1. Athena Scientific, Belmont, MA,1995. Chris m Bishop Training with noise is equivalent to Tikhonov regularization Neural computation, 7 (1):108-116,1995 Charles blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. In Proceedings of The 32nd International Conference on Machine learning, pp 1613-1622,2015. Jeremy Fix and Matthieu geist. Monte-Carlo swarm policy search. In Swarm and Evolutionary Computation, pp 75-83. Springer, 2012 Meire Fortunato, Charles blundell, and Oriol Vinyals. Bayesian recurrent neural networks. arXiv preprint arXiv: 1704.02798 2017 Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Maria Florina Balcan and Kilian Q. Weinberger (eds ) Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of machine ittp: //proceedings, mlrpress/v48/gal1 6. htm 20-22 Jun 2016. PMLR. URL Learning research, pp. 1050-1059, New York, New York, USA Matthieu geist and Olivier Pietquin. Kalman temporal differences. Journal of artificial intelligence research,39:483-532,20l0a Matthieu Geist and Olivier Pietquin. Managing uncertainty within value function approximation in reinforcement learning. In Active Learning and Experimental Design workshop(collocated with AISTATS 2010), Sardinia, Italy, volume 92, 2010b Alex Graves. Practical variational inference for neural networks. In Advances in Neural information Processing Systems, pp. 2348-2356, 201 1

(系统自动生成,下载前可以参看下载内容)