文件名称:
Noisy Networks for Exploration.pdf
开发工具:
文件大小: 5mb
下载次数: 0
上传时间: 2019-09-02
详细说明:关于Noisy Networks for Exploration dqn的原始论文,适合初学者对深度强化学习Noisy Networks for Exploration dqn的认识和了解Published as a conference paper at ICLR 2018
T is assessed by the action-value function Q defined as
Q"(.a)=配
∑
rR(t, at)
(1)
where E is the expectation over the distribution of the admissible trajectories(o, a0, 31, a1
obtained by executing the policy T starting from o=. and ao =a. Therefore, the quantity Q(a, a
represents the expected ?-discounted cumulative reward collected by executing the policy T starting
from c and a. a policy is optimal if no other policy yields a higher return. The action-value function
of the optimal policy is Q(a, a)=arg max Q(, a)
The value function V for a policy is defined as Vm(r)=fa ur(2)[Q(a, a)], and represents the
expected ?-discounted return collected by executing the policy T starting from statec
2.2 DEEP REINFORCEMENT LEARNING
Deep reinforcement Learning uses deep neural networks as function approximators for Rl methods
DeepQ-Networks(DQN)(Mnih et al. 2015), Dueling architecture(Wang et al. 2016), Asynchronous
Advantage Actor-Critic(A3C)(Mnih et al. 2016), Trust Region Policy Optimisation (Schulman
et al. 2015, Deep Deterministic Policy Gradient (Lillicrap et al. 2015) and distributional RL
C5 1)( Bellemare et al. 2017) are examples of such algorithms. They frame the rl problem as
the minimisation of a loss function L(0), where 0 represents the parameters of the network In our
experiments we shall consider the dQn, dueling and A3C algorithms
DQN (Mnih et al. 2015) uses a neural network as an approximator for the action-value function of
the optimal policy Q (, a). DQN'S estimate of the optimal action-value function, Q(, a), is found
by minimising the following loss with respect to the neural network parameters 6
L(0)=E(, a, , D(7+ ymax Q(3, b; 0)-Q(a, a; 0)
b∈
where D is a distribution over transitions e=(x,a, r-R(, a), yn P(la, a)) drawn from a replay
buffer of previously observed transitions. Here 6 represents the parameters of a fixed and separate
target network which is updated (60) regularly to stabilise the learning. An E-greedy policy
is used to pick actions greedily according to the action-value function Q or, with probability E,a
random action is taken
The Dueling dQn (Wang et al. 2016) is an extension of the dQn architecture. The main difference
is in using Dueling network architecture as opposed to the Q network in DQN. Dueling network
estimates the action-value function using two parallel sub-networks the value and advantage sub-
network, sharing a convolutional layer. Let Cony, Ov, and OA be, respectively, the parameters
of the convolutional encoder of the value network V, and of the advantage network A; and
9=conv, ev, 0A) is their concatenation. The output of these two networks are combined as follows
for every(x,a)∈孔×A:
Q(x,0,)=V(:0m),0)+4(r(:0am):04)-24((:m,.(3)
actions
The Dueling algorithm then makes use of the double-DQN update rule(van Hasselt et al. 2016) to
optimise e
L(0)=E(aND[(r+1Q((0:0)a,a:0)21,
S.t. b(y)=arg max Q(3, 6: 0),
(5)
where the definition distribution D and the target network parameter set o is identical to DQn
In contrast to DQN and Dueling, A3C (Mnih et al. 2016) is a policy gradient algorithm. A3C"'s
network directly learns a policy T and a value function V of its policy. The gradient of the loss on the
Published as a conference paper at ICLR 2018
A3C policy at step t for the roll-out(t+i, au+i N T( 0++i; 0), Tt+i)
VOL(0=-E
Ve log(丌(at+a|x+;9)
B∑VH(r(|x+;0)
HTlIt; 0)) denotes the entropy of the policy T and B is a hyper parameter that trades off be
tween optimising the advantage function and the entropy of the policy. The advantage function
A(Et+i, at+i; 0) is the difference between observed returns and estimates of the return produced
byA3 Cs value network:A(x+;,4+÷;0)=∑=个-个++k-(x+k:0)-V(x=;0),r1+
being the reward at step t+3 and V(: 0) being the agent 's estimate of value function of state
The parameters of the value function are found to match on-policy returns: namely we have
)=∑E[Q-V(x4+;0)
i-0
where i is the return obtained by executing policy T starting in state t+i. In practice, and as in
Mnih et al. (2016 we estimate Q; as Qi-2k=imj-irt+; +rk-iv(=++; 0)where fri+k-1
are rewards observed by the agent, and Tti k is the h: th state observed when starting from observed
state t. The overall A3C loss is then L(0)=L(0)+ALv(8 where A balances optimising the
policy loss relative to the baseline value function loss
3 NOISYNETS FOR REINFORCEMENT LEARNING
NoisyNets are neural networks whose weights and biases are perturbed by a parametric function
of the noise. These parameters are adapted with gradient descent. More precisely, let y= fe(c
outputs y. We represent the noisy parameters 8 as 6 der parameters 6 which takes the input c and
be a neural network parameterised by the vector of noisy
μ+∑⊙, where c
del ( u, ) is a set of
vectors of learnable parameters, c is a vector of zero-mean noise with fixed statistics and O represents
element-wise multiplication. The usual loss of the neural network is wrapped by expectation over th
noise e
L(SEL(OJ Optimisation now occurs with respect to the set of parameters s
Consider a linear layer of a neural network with p inputs and g outputs, represented by
y=w3+6
(8)
where E RP is the layer input, w E Qxp the weight matrix, and b E Rq the bias. The corresponding
noisy linear layer is defined as
(p0+
)x+b+ab⊙
where u+o oa and ub_.
replace w and b in Eq ( 8), respectively. The parameters
1C∈RxP,pb∈R9,am∈ RxP and o∈R" are learnable whereas et∈ QxP and E∈R?are
noise random variables(the specific choices of this distribution are descrihed below ) We provide a
graphical representation of a noisy linear layer in Fig. 4(see Appendix B
We now turn to explicit instances of the noise distributions for linear layers in a noisy network
We explore two options: Independent Gaussian noise, which uses an independent Gaussian noise
entry per weight and Factorised Gaussian noise, which uses an independent noise per each output
and another independent noise per each input. The main reason to use factorised gaussian noise is
to reduce the compute time of random number generation in our algorithms. This computational
overhead is especially prohibitive in the case of single-thread agents such as dQn and duelling. For
this reason we use factorised noise for dQN and Duelling and independent noise for the distributed
A3C, for which the compute time is not a major concern
(a) Independent Gaussian noise: the noise applied to each weight and bias is independent, where
each entry ei;(respectively each entry e9 )of the random matrix e(respectively of the random
vector e )is drawn from a unit Gaussian distribution. This means that for each noisy linear layer,
there are pq + q noise variables(for p inputs to the layer and q outputs)
Published as a conference paper at ICLR 2018
(b)Factorised Gaussian noise: by factorising ej i, we can use p unit Gaussian variables Ei for noise
of the inputs and and g unit Gaussian variables ai for noise of the outputs (thus p+ q unit
Gaussian variables in total). Fach a and ab can then be written as
Eii=f(eif(e
∫(E)
(11)
where f is a real-valued function In our experiments we used f(a)=sgn(a)vz note that
for the bias Eq (1lb we could have set f(a)-a, but we decided to keep the same output noise
for weights and biases
Since the loss of a noisy network, L(S)=EL(0)J, is an expectation over the noise, the gradients are
straightforward to obtain
VL(9)=VE[L()=BV,YL(+∑⊙e)
(12)
We use a Monte Carlo approximation to the above gradients, taking a single sample s at each step of
optimisation
VL(()≈V,L(+∑⊙
(13)
3.1 DEEP REINFORCEMENT LEARNING WITH NOISYNETS
We now turn to our application of noisy networks to exploration in deep reinforcement learning. No
oise
drives exploration in many methods for reinforcement learning, providing a source of stochasticity
external to the agent and the rl task at hand either the scale of this noise is manually tuned across a
wide range of tasks(as is the practice in general purpose agents such as dQn or A3C)or it can be
manually scaled per task. Here we propose automatically tuning the level of noise added to an agent
for exploration, using the noisy networks training to drive down (or up) the level of noise injected
into the parameters of a neural network, as needed
A noisy network agent samples a new set of parameters after every step of optimisation. Between
tion steps, the agent acts according to a fi
ters weights and b
ensures that the agent always acts according to parameters that are drawn from the current noise
distribution
Deep q-Networks DQN) and Dueling. We apply the following modifications to both dQn and
Dueling: first, E-greedy is no longer used, but instead the policy greedily optimises the(randomised
action-value function. Secondly the fully connected layers of the value network are parameterised
as a noisy network, where the parameters are drawn from the noisy network parameter distribution
after every replay step. We used factorised Gaussian noise as explained in(b) from Sec. 3For
replay, the current noisy network parameter sample is held fixed across the batch. Since dQn
and Dueling take one step of optimisation for every action step, the noisy network parameters are
re-sampled before every action. We call the new adaptations of DQN and Dueling, Noisy NeL-DQN
and Noisy Net-Dueling, respectively
We now provide the details of the loss function that our variant of dQN is minimising. When replacing
the linear layers by noisy layers in the network (respectively in the target network), the parameterised
action-value function Q(a, a, E; S)(respectively Q(a, a, e;s ) can be seen as a random variable
and the don loss becomes the noisy net-dQn loss
L(S)=EE(2,a, m, goD[r+ max Q(, b, e;s )-Q(, a, E; S
(14)
here the outer expectation is with respect to distribution of the noise variables e for the noisy value
function Q(, a, E:S)and the noise variable e' for the noisy target value function Q(3, 6, e;s)
Computing an unbiased estimate of the loss is straightforward as we only need to compute, for each
transition in the replay buffer, one instance of the target network and one instance of the online
network. We generate these independent noises to avoid bias due to the correlation between the noise
in the target network and the online network. Concerning the action choice, we generate another
independent sample efor the online network and we act greedily with respect to the corresponding
output action-value function
Published as a conference paper at ICLR 2018
Similarly the loss function for noisy -Dueling is defined as
L()=E[E(xar,)ND+((v,0(y,e;)-Q(x,a;)2](15)
S.t. b()=arg max Q( y, b(0),E"S)
(16)
Both algorithms are provided in AppendixC I
Asynchronous Advantage Actor Critic(A3C). A3C is modified in a similar fashion to DQN
firstly, the entropy bonus of the policy loss is removed. Secondly, the fully connected layers of
explained in( a) from Sec. 3 In A3C, there is no explicit exploratory action selection scheme(such
as E-greedy; and the chosen action is always drawn from the current policy. For this reason, an
entropy bonus of the policy loss is often added to discourage updates leading to deterministic policies
However, when adding noisy weights to the network, sampling these parameters corresponds to
choosing a different current policy which naturally favours exploration. As a consequence of direct
exploration in the policy space, the artificial entropy loss on the policy can thus be omitted. New
parameters of the policy network are sampled after each step of optimisation, and since A3C uses n
step returns, optimisation occurs every n steps. We call this modification of A3C, Noisy Net-A3C
Indeed, when replacing the linear layers by noisy linear layers(the parameters of the noisy network
are now noted S, we obtain the following estimation of the return via a roll-out of size k:
∑-r+;+k=V(x+;
(17)
As A3C is an on-policy algorithm the gradients are unbiased when noise of the network is consistent
for the whole roll-out. Consistency among action value functions Qi is ensured by letting lettin
the noise be the same throughout each rollout. i.e., Vi, i= E. Additional details are provided in the
Appendix A and the algorithm is given in Appendix C 2
3.2 INITIALISATION OF NOISY NETWORKS
In the case of an unfactorised noisy networks, the parameters u and o are initialised as follows. Each
element Wi, j is sampled fr
roI ll
independent uniform distributions u[/≥,+√3 where p is th
number of inputs to the corresponding linear layer, and each element o,, is simply set to 0.017 for
all parameters. This particular initialisation was chosen because similar values worked well for the
supervised learning tasks described in Fortunato et al. (2017), where the initialisation of the variances
of the posteriors and the variances of the prior are related. We have not tuned for this parameter, but
we believe different values on the same scale should provide similar results.
For factorised noisy networks, each element ui. i was initialised by a sample from an independent
uniform distributions ll
,T ypI and each element ii was initialised to a constant og. The
hyperparameter co is set to 0.5
4 RESULTS
We evaluated the performance of noisy network agents on 57 Atari games(Bellemare et al. 2015
and compared to baselines that, without noisy networks, rely upon the original exploration methods
(a-greedy and entropy bonus
4.1 TRAINING DETAILS AND PERFORMANCE
We used the randoin start no-ops scheme for training and evaluation as described the original dQn
paper(Mnih et al. 2015. The mode of evaluation is identical to those of Mnih et al. (2016)where
randomised restarts of the games are used for evaluation after training has happened. The raw average
scores of the agents are evaluated during training, every IM frames in the environment, by suspending
Published as a conference paper at ICLR 2018
5
T C
(a) Improvement in percentage of NoisyNet-DQN over DQN (Mnih et al. 2015
103
-53
但g
(b)Improvement in percentage of Noisy Net-Dueling over Dueling(Wang et al. 2016)
155
103
(c)Improvement in percentage of Noisy Net-A 3C over A3C ( Mnih et al. 2016)
Figure 1: Comparison of Noisy Net agent versus the baseline according to Eq. 19 The maximum
score is truncated at 250%0
learning and evaluating the latest agent for 500K frames. Episodes are truncated at 108K frames(or
30 minutes of simulated play)(van Hasselt et al. 2016)
We consider three baseline agents: DQN (Mnih et al. 2015), duel clip variant of Dueling algo
rithm(Wang et al. 2016) and A3C(Mnih et al. 2016). The DQN and A3C agents were training for
200M and 320M frames, respectively. In each case, we used the neural network architecture from the
corresponding original papers for both the baseline and noisyNet variant. For the NoisyNet variants
we used the same hyper parameters as in the respective original paper for the baseline
We compared absolute performance of agents using the human normalised score:
Scoreagent
ScoreRand
ScoreHuman ScoreRandoll
(18)
here human and random scores are the same as those in Wang et al. (2016. Note that the human
normalised score is zero for a random agent and 100 for human level performance. Per-game
maximum scores are computed by taking the maximum raw scores of the agent and then averaging
over three seeds. However, for computing the human normalised scores in Figure 2 the raw scores
are evaluated every 1M frames and averaged over three seeds. The overall agent performance is
measured by both mean and median of the human normalised score across all 57 Atari games
The aggregated results across all 57 Atari games are reported in Tablel while the individual scores
for each game are in Table 3 from the Appendix e The median human normalised score is improved
Published as a conference paper at ICLR 2018
in all agents by using Noisy Net, adding at least 18 (in the case of A3 C)and at most 48(in the case of
DQN) percentage points to the median human normalised score. The mean human normalised score
is also significantly improved for all agents. Interestingly the dueling case, which relies on multiple
modifications of dQN, demonstrates that Noisy Net is orthogonal to several other improvements made
to dQn. We also compared relative performance of NoisyNet agents to the respective baseline agent
Baseline
NOisyNet
Improvemen
Mean Median Mean Median (On median
DQN 319
379
123
48
Dueling 524
132
633
172
30%
A3C293
347
94
18%
Table 1: Comparison between the baseline dQn, dueling and a3C and their noisy net version
in terms of median and mean human-normalised scores defined in Eq (18. We report on the last
column the percentage improvement on the baseline in terms of median human-normalised score
without noisy networks
Scorenoisy net
100×
Baseline
Inax(ScoreHuman: ScoreBaseline)- ScoreRandom
(19)
As before, the per-game score is computed by taking the maximum performance for each game and
then averaging over three seeds. The relative human normalised scores are shown in FigureIAs
can be seen, the performance of NoisyNet agents(DQN, Dueling and A3C) is better for the majority
of games relative to the corresponding baseline, and in some cases by a considerable margin. Also
as it is evident from the learning curves of Fig. 2 Noisy Net agents produce superior performance
compared to their corresponding baselines throughout the learning process. This improvenent is
especially significant in the case of Noisy Net-DQN and Noisy Net-Dueling. Also in some games
oisy Net agents provide an order of magnitude improvement on the performance of the vanilla agent
as can be seen in Table 3 in the appendix e with detailed breakdown of individual game scores and
the learning curves plots from Figs 6 and 8] for DQN, Dueling and A3C, respectively. We also ran
some experiments evaluating the performance of noisy net -a3C with factorised noise. We report
the corresponding learning curves and the scores in Fig. and Table 2 respectively(see Appendix
D. This result shows that using factorised noise does not lead to any significant decrease in the
performance of A3C. On the contrary it seems that it has positive effects in terms of improving the
median score as well as speeding up the learning process
Median score over games
Median score over
Median score over games
40
W
30
10
100
50
200
100
150
200
050100150200250300350
Million frames
Million frames
Figure 2: Comparison of the learning curves of NoisyNet agent versus the baseline according to the
median human normalised score
4.2 ANALYSIS OF LEARNING IN NOISY LAYERS
In this subsection, we try to provide some insight on how noisy networks affect the learning process
and the exploratory behaviour of the agent. In particular, we focus on analysing the evolution of the
noise weights o" and o throughout the learning process. We first note that, as L(s)is a positive and
continuous function of S, there always exists a determini stic optimiser for the loss l(s)(defined in
Published as a conference paper at ICLR 2018
Eq.(14)). Therefore, one may expect that, to obtain the deterministic optimal solution, the neural
network may learn to discard the noise entries by eventually pushing o s and o towards 0
To test this hypothesis we track the changes in os throughout the learning process. Let o denote
the i th weight of a noisy layer. We then define > the mean-absolute of the a 's of a noisy layer, as
∑|o
(20
weights i
Intuitively speaking 2 provides some measure of the stochasticity of the Noisy layers. We report
the learning curves of the average of 2 across 3 seeds in Fig. 3 for a selection of Atari games in
NoisyNet-DQN agent. We observe that 2 of the last layer of the network decreases as the learning
proceeds in all cases, whereas in the case of the penultimate layer this only happens for 2 games out
of 5(Pong and Beam rider) and in the remaining 3 games 2 in fact increases. This shows that in the
case of NoisyNet-dQn the agent does not necessarily evolve towards a deterministic solution as one
might have ex pected. Another interesting observation is that the way X evolves significantly differs
from one game to another and in some cases from one seed to another seed, as it is evident from the
error bars. This suggests that NoisyNet produces a problem-specific exploration strategy as opposed
to fixed exploration strategy used in standard DQN
Penultimate layer
Last layer
ean rider
0.014-+breakout
.Dz00
00175
0. 012- space_ invade's
0.0050
mmmmmmmmmmmmm
0.00
100125150175200
255075100125150175200
Million frames
Million frames
Figure 3: Comparison of the learning curves of the average noise parameter 2 across five Atari games
in NoisyNet-DQN. The results are averaged across 3 seeds and error bars(+/-standard deviation ) are
plc
5 CONCLUSION
We have presented a general method for exploration in deep reinforcement learning that shows
significant performance improvements across many Atari games in three different agent architec
tures. In particular, we ohserve that in games such as Beam rider, Asteroids and Freeway that the
standard DQN, Dueling and A3C perform poorly compared with the human player, NoisyNet-DQN,
NoisyNet-Dueling and noisynet-A3C achieve super human performance, respectively. Although the
improvements in performance might also come from the optimisation aspect since the cost functions
are modified the uncertainty in the parameters of the networks introduced by noisynet is the only
exploration mechanism of the method Having weights with greater uncertainty introduces more
variability into the decisions made by the policy, which has potential for exploratory actions, but
further analysis needs to be done in order to disentangle the exploration and optimisation effects
Another advantage of noisyNet is that the amount of noise injected in the network is tuned automati-
cally by the rl algorithm. This alleviates the need for any hyper parameter tuning(required with
standard entropy bonus and e-greedy types of exploration). This is also in contrast to many other
methods that add intrinsic motivation signals that may destabilise learning or change the optimal
policy. Another interesting feature of the Noisy Net approach is that the degree of exploration is
ontextual and varies from state to state based upon per-weight variances. While more gradients
are needed, the gradients on the mean and variance parameters are related to one another by a
computationally efficient affine function, thus the computational overhead is marginal. Automatic
differentiation makes implementation of our method a straightforward adaptation of many existing
methods. A similar randomisation technique can also be applied to LStM units (Fortunato et al
2017)and is easily extended to reinforcement learning, we leave this as future work
Published as a conference paper at ICLR 2018
Note Noisy Net exploration strategy is not restricted to the baselines considered in this paper. In
fact, this idea can be applied to any deep rl algorithms that can be trained with gradient descent,
including DDPG (Lillicrap et al. 2015), TRPO(Schulman et al. 2015) or distributional RL(C51)
Bellemare et al. 2017). As such we believe this work is a step towards the goal of developing a
universal exploration strategv
Acknowledgements We would like to thank Koray Kavukcuoglu, Oriol Vinyals, Daan Wierstra
Georg Ostrowski, Joseph Modayil, Simon Osindero, Chris Apps, Stephen Gaffney and many others at
DeepMind for insightful discussions, comments and feedback on this work
REFERENCES
Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement
learning. Advances in Neural Information Processing Systems, 19: 49, 2007
Mohammad Gheshlaghi azar, lan Osband and remi munos minimax regret bounds for reinforce
ment learning. arXiv preprint arXiv: /703.05449, 2017
Marc Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment:
An evaluation platform for general agents. In Twenty-Fourth international Joint Conference on
Artificial Intelligence, 2015
Marc Bellenare, SriraIn Srinivasan, Georg Ostrowski, Tom Schaul, David Saxton, and Remi Munos
Unifying count-based exploration and intrinsic motivation. In Advances in Neural information
P
g System,pp.1471-1479,2016
Marc G Bellemare, will Dabney, and remi munos. a distributional perspective on reinforcement
learning. In International Conference on Machine Learning, pp. 449-458, 2017
Richard Bellman and robert Kalaba. Dynamic programming and modern control theory. Academic
Press New York. 1965
Dimitri Bertsekas. Dynamic programming and optimal control, volume 1. Athena Scientific, Belmont,
MA,1995.
Chris m Bishop Training with noise is equivalent to Tikhonov regularization Neural computation, 7
(1):108-116,1995
Charles blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in
neural networks. In Proceedings of The 32nd International Conference on Machine learning, pp
1613-1622,2015.
Jeremy Fix and Matthieu geist. Monte-Carlo swarm policy search. In Swarm and Evolutionary
Computation, pp 75-83. Springer, 2012
Meire Fortunato, Charles blundell, and Oriol Vinyals. Bayesian recurrent neural networks. arXiv
preprint arXiv: 1704.02798 2017
Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model
uncertainty in deep learning. In Maria Florina Balcan and Kilian Q. Weinberger (eds ) Proceedings
of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of machine
ittp: //proceedings, mlrpress/v48/gal1 6. htm 20-22 Jun 2016. PMLR. URL
Learning research, pp. 1050-1059, New York, New York, USA
Matthieu geist and Olivier Pietquin. Kalman temporal differences. Journal of artificial intelligence
research,39:483-532,20l0a
Matthieu Geist and Olivier Pietquin. Managing uncertainty within value function approximation in
reinforcement learning. In Active Learning and Experimental Design workshop(collocated with
AISTATS 2010), Sardinia, Italy, volume 92, 2010b
Alex Graves. Practical variational inference for neural networks. In Advances in Neural information
Processing Systems, pp. 2348-2356, 201 1
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.