开发工具:
文件大小: 672kb
下载次数: 0
上传时间: 2019-09-02
详细说明:关于duelingdqn的原始论文,适合初学者对深度强化学习duelingdqn的认识和了解Dueling Network Architectures for Deep Reinforcement Learning
et al.(2016). The results of Schaul et al.(2016) are the 2.1. Deep Q-networks
current published state-of-the-art
The value functions as described in the preceding section
are high dimensional objects. To approximate them, we can
2. Background
use a deep Q2-network: Q (s, 0; 0) with parameters 6. To
We consider a sequential decision making setup, in which estimate this network, we optimize the following sequence
of loss functions at iteration i
an agent interacts with an environment over discrete time
steps, see Sutton Barto(1998)for an introduction. In the
L2(O2)=E,a,s(
DON
Atari domain, for example, the agent perceives a video St
Q(s,a: 0
(4)
consisting of M image frames: St =(at-M+1
S at time step t. The agent then chooses an action from a with
discrete set at E A=(1,.,A and observes a reward
D N=r+r max Q(s,a; 6),
5
signal rt produced by the game emulator
The agent seeks maximize the expected discounted re-
here g represents the parameters of a fixed and sepa
rate target network. We could attempt to use standard Q
turn where we define the discounted return as r
∑=:?7- rr. In this formulation,y∈0,1 is a discount
learning to learn the parameters of the network Q (s, 0; 8)
online. However, this estimator performs poorly in prac-
factor that trades-off the importance of immediate and fu
tice. a key innovation in(Mnih et al., 2015) was to freeze
ture rewards
the parameters of the target network Q(s, a: 0)for a
For an agent behaving according to a stochastic policy I, fixed number of iterations while updating the online net
the values of the state-action pair(s, a)and the state s are work Q(3, a; 0i) by gradient descent. (This greatly im
defined as follows
proves the stability of the algorithm. The specific gradient
update is
Q(S,a)=E Rt st=S, at=aT],and
V(
Eann(s Q"(s, a)
(1)Ve, Li(0;)=Es,a, 7, s"llyDQN
Q(s,0:0)Ve,Q(s,:6
The preceding state-action value function(Q function for
This approach is model free in the sense that the states and
short) can be computed recursively with dy namic program
rewards are produced by the environment. It is also off
min
g
policy because these states and rewards are obtained with
a behavior policy (epsilon greedy in DQn) different from
Q(s,a)=E,[+?E2r()[Q2(s,a勹|6,a,x]
the online policy that is being learned
Another key ingredient behind the success of dQN is expe
We define the optimal Q*(s, a)- maxQ(s, a). Un- rience replay(Lin, 1993; Mnih et al., 2015). During learn-
der the deterministic policy a= arg maxa'cAQ(S, a), ing, the agent accumulates a dataset D=el, e2,...,et
it follows that V*(s)= maxa Q*s, a). From this, it also of experiences et=(St: at, rt, St+1) from many episodes
follows that the optimal Q function satisfies the Bellman When training the Q-network, instead only using the
equation:
current experience as prescribed by standard temporal
difference learning, the network is trained by sampling
Q"(s, a)=Eg|r+y maxQ"(s,a')
mini-batches of experiences from d uniformly at random
The sequence of losses thus takes the form
We define another important quantity, the advantage func
I(0)=R(n)(D)(00-0(s,0;
tion, relating the value and Q functions
A S,a=Q"(s, a)-V(
Experience replay increases data efficiency through re-use
(3)
of experience samples in multiple updates and, importantly
Note that Ear(s)[47
it reduces variance as uniform sampling from the replay
0. Intuitively, the valu
buffer reduces the correlation among the samples used in
function V measures the how good it is to be in a particular
the update.
state s. The Q function, however, measures the the value
of choosing a particular action when in this state. The ad-
vantage function subtracts the value of the state from the Q
2.2. Double Deep Q-networks
function to obtain a relative measure of the importance of The previous section described the main components of
each action
DQN as presented in(Mnih et al., 2015).In this paper,
Dueling Network Architectures for Deep Reinforcement Learning
we use the improved Double dQn (ddQn) learning al- function. As in(Mnih et al., 2015), the output of the net
gorithm of van Hasselt et al.(2015 ). In Q-learning and work is a set of Q values, one for each action
dON, the max operator uses the same values to both select
and evaluate an action This can therefore lead to overopti
Since the output of the dueling network is a Q function,
mistic value estimates(van Hasselt, 2010). To mitigate this it can be trained with the many existing algorithms. Such
as ddQn and sarsA. In addition, it can take advantage
problem, DDQN uses the tollowing target
of any improvements to these algorithms, including better
乃DQN
r+?Q(s, arg max Q(s, a': 0i); 0).(6) replay memories, better exploration policies, intrinsic mo
tivation and so on
DdQN is the same as for dQn (see Mnih et al. (2015)), but
DO N
DDON
The module that combines the two streams of fully
with the target y
replaced by y
The pseudo
connected layers to output a Q estimate requires very
code for ddqn is presented in Appendix a
thoughtful design
2.3. Prioritized Replay
From the expressions for advantage Q(S, a)=V(s)+
A(,a)and state-value V(s)=EaNT(s)[Q(s,a)],
recent innovation in prioritized experience re- follows that WanT(s)
A(s,a)l=0. Moreover, for a de
play (Schaul et al, 2016) built on top of ddQn and terministic policy, a*- arg max
a′∈A
Q(s, a), it follows
further improved the state-of-the-art. Their key idea was that Q(s, a*)=V(s) and hence A(s, a*)
to increase the replay probability of experience tuples
that have a high expected learning progress(as measured Let us consider the dueling network shown in Figure 1
via the proxy of absolute TD-error). This led to both
where we make one stream of fully-connected layers out
aster learning and to better final policy quality across put a scalar V(s: 0, B), and the other stream output an A/
Inost games of the Atari benchmark suite, as compared to
dimensional vector A(s, a: 0, a). Here, 0 denotes the pa
uniform experience replay
rameters of the convolutional layers while a and B are the
parameters of the two streams of fully-connected layers
To strengthen the claim that our dueling architecture is
complementary to algorithmic innovations we show that
Using the definition of advantage, we might be tempted to
it improves performance for both the uniform and the pri-
construct the aggregating module as follows
oritized replay baselines(for which we picked the easier
Q(s,a;6,a,B)=V(s;6,3)+A(s,a;0,a),
to implement rank-based variant), with the resulting priori
tized dueling variant holding the new state-of-the-art
Note that this expression applies to all (s, a)instances; that
is, to express equation(7) in matrix form we need to repli
3. The Dueling Network Architecture
cate the scalar, V(s: 0, B),A times
However, we need to keep in mind that Q(s, a: 0, a, B
The key insight behind our new architecture, as illustrated is only a parameterized estimate of the true Q-function
in Figure 2, is that for many states, it is unnecessary to es- Moreover, it would be wrong to conclude that V(s; 0, B)
timate the value of each action choice. For example, in is a good estimator of the state-value function, or likewise
the Enduro game setting, knowing whether to move left or that A(s, a: 0,a) provides a reasonable estimate of the ad
right only matters when a collision is eminent. In some vantage function
states, it is of paramount importance to know which action
to take but in many other states the choice of action has no
Equation (7)is unidentifiable in the sense that given Q
repercussion on what happens For bootstrapping based al
we cannot recover V and A uniquely. To see this, add a
gorithms, however, the estimation of state values is of great
constant to V(s: 0,B)and subtract the same constant from
importance for every state
A(s, a: 0, a). This constant cancels out resulting in the
Q value. This lack of identifiab
d b
To bring this insight to fruition, we design a single Q- poor practical performance when this equation is used di
network architecture as illustrated in Figure 1. which we rectly.
refer to as the dueling network. The lower layers of the
dueling network are convolutional as in the original DQNs
To address this issue of identifiability we can force the ad
(Mnih et al., 2015). However, instead of following the con
vantage function estimator to have zero advantage at the
volutional layers with a single sequence of fully connected
chosen action That is we let the last module of the net
layers, we instead use two sequences (or streams) of fully
work implement the forward mapping
connected layers. The streams are constructed such that
they have they have the capability of providing separate es
Q(s,a; 8, a,B)=V(s: 0,B)+
timates of the value and advantage functions. Finally, the
A(,a:)-max4(,a).(8
two streams are combined to produce a single output Q
a'∈|A
Dueling Network Architectures for Deep Reinforcement Learning
or a
arg naxal∈AQ(s.u′;6,a,)
ticular task because it is very useful for evaluating network
arg maxa'EA A(s, a'; 8. a), we obtain Q(s, a*; B c, B)
architectures. as it is devoid of con founding factors such as
V(s; 8. 8). Hence, the stream V(s: 0,B)provides an esti- the choice of exploration strategy, and the interaction be
mate of the value function, while the other stream produces tween policy improvement and policy evaluation
an estimate of the advantage function
In this experiment, we employ temporal difference learning
An alternative module replaces the max operator with an (without eligibility traces, i.e,A=0)to learn Q
More specifically, given a behavior policy T, we seek to
estimate the state-action value Q"(, by optimizing the
Q(s,a:;6,a,3)=V(s;θ,β)+
sequence of costs of equation(4), with target
A(s,a:6,a)
∑
A(s,a‘;,a).(9)
9=7+Ea~r(s)Q(s,a;0:)
The above update rule is the same as that of Expected
On the one hand this loses the original semantics of v and SARSA(van Seijen et al., 2009). We, however, do not
A because they are now off-target by a constant, but on modify the behavior policy as in Expected SARSa
the other hand it increases the stability of the optimization To evaluate the learned Q values, we choose a simple envi-
with(9)the advantages only need to change as fast as the
mean, instead of having to compensate any change to the
ronment where the exact Q"(s, a) values can be computed
optimal actions advantage in( 8). We also experimented
separately for all (s, a)C SXA.This environment, which
with a softmax version of equation (8), but found it to de
we call the corridor is composed of three connected cor-
liver similar results to the simpler module of equation(9)
ridors. A schematic drawing of the corridor environment
is shown in Figure 3, The agent starts from the bottom left
Hence, all the experiments reported in this paper use the
module of equation(9)
corner of the environment and must move to the top right
to get the largest reward. a total of 5 actions are available
Note that while subtracting the mean in equation(9) helps go up, down, left, right and no-op. We also have the free
with identifiability, it does not change the relative rank of dom of adding an arbitrary number of no-op actions. In our
the A(and hence Q) values, preserving any greedy or E- setup the two vertical sections both have 10 states while
greedy policy based on Q values from equation(7). When the horizontal section has 50
acting, it suffices to evaluate the advantage stream to make
decisions
We use an t-greedy policy as the behavior policy T, which
chooses a random action with probability e or an action
It is important to note that equation(9) is viewed and im-
according to the optimal function arg maxaE A O*(s, a
plemented as part of the network and not as a separate algo- with probability 1-c. In our experiments, c is chosen to
rithmic step Training of the dueling architectures, as with be 0.001
standard Q networks(e.g. the deep Q-network of Mnih
et al.(2015)), requires only back-propagation. The esti
We compare a single-stream Q architecture with the duel-
mates V(8; 0, B)and A(s, a; 0, a) are computed automati-
ing architecture on three variants of the corridor environ
ment with 5, 10 and 20 actions respectively The 10 and 20
cally without any extra supervision or algorithmic modifi
action variants are formed by adding no-ops to the original
cations
environment. We measure performance by Squared Error
As the dueling architecture shares the same input-output in-(SE)against the true state values: SESac4(Q(s, a; 0)
terface with standard Q networks, we can recycle all learn- Q"(s, a)2. The single-stream architecture is a three layer
ing algorithms with Q networks(e. g, DDQN and SARSA) MLP with 50 units on each hidden laver. The dueling ar
to train the dueling architecture
chitecture is also composed of three layers. After the firs
hidden layer of 50 units, however, the network branches off
4. Experiments
into two streams each of them a two layer mlp with 25 hid
den units. The results of the comparison are summarized in
We now show the practical performance of the dueling net- Figure 3
work. We start with a simple policy evaluation task and
then show larger scale results for learning policies for gen
The results show that with 5 actions. both architectures
eral Atari game-playing
converge at about the same speed. However, when we in
crease the number of actions, the dueling architecture per
4.1. Policy evaluation
forms better than the traditional e-network. In the dueling
network, the stream V(s; 0, B) learns a general value that
We start by measuring the performance of the dueling ar- is shared across many similar actions at s, hence leading
chitecture on a policy evaluation task. We choose this par- to faster convergence. This is a very promising result be
Dueling Network Architectures for Deep Reinforcement Learning
CORRIDOR ENVIRONMENT
S ACTIONS
I0 ACTIONS
20 ACTIONS
10
Single
N
lte
o. terations
No. terations
No, Iterations
(c)
Figure 3. (a) The corridor environment. The star marks the starting state. The redness of a state signifies the reward the agent receives
upon arrival. The game terminates upon reaching either reward state. The agents actions are going up down, left, right and no action
Plots (b),(c) and (d) shows squared error for policy evaluation with 5, 10, and 20 actions on a log-log scale. The dueling network
(Duel) consistently outperforms a conventional single-stream network(Single), with the performance gap increasing with the number of
actions
cause nany control tasks with large action spaces have this as many outputs as there are valid actions. We combine the
property, and consequently we should expect that the dul- value and advantage streams using the module described by
eling network will often lead to much faster convergence Equation(9). Rectifier non-linearities(Fukushima, 1980)
than a traditional single stream network. In the following are inserted between all adjacent layers
section, we will indeed see that the dueling network results
We adopt the optimizers and hyper-parameters of van has
in substantial gains in performance in a wide-range of Atari
selt et al.(2015), with the exception of the learning rate
game
which we chose to be slightly lower(we do not do this for
double don as it can deteriorate its performance). Since
4.2. General Atari Game-Playin
both the advantage and the value stream propagate gradi
We perform a comprehensive evaluation of our proposed ents to the last convolutional layer in the backward pass,
method on the Arcade Learning Environment(Bellemare we rescale the combined gradient entering the last convo-
et al., 2013), which is composed of 57 Atari games. The lutional layer by 1/v2. This simple heuristic mildly in-
challenge is to deploy a single algorithm and architecture, creases stability. In addition, we clip the gradients to have
with a fixed set of hyper-parameters, to learn to play all their norm less than or equal to 10. This clipping is nol
the games given only raw pixel observations and game re- standard practice in deep rl, but common in recurrent net
wards. This environment is very demanding because it is work training( Bengio et al., 2013)
both comprised of a large number of highly diverse games
To isolate the contributions of the dueling architecture, we
and the observations are high-dimensional
re-train ddQn with a single stream network using exactly
We follow closely the setup of van Hasselt et al. (2015) and the same procedure as described above. Specifically, we
compare to their results using single-stream Q-networks. apply gradient clipping, and use 1024 hidden units for the
We train the dueling network with the DDQn algorithm first fully-connected layer of the network so that both archi
as presented in Appendix A. At the end of this section, tectures( dueling and single) have roughly the same number
we incorporate prioritized experience replay(schaul et al., of parameters. We refer to this re-trained model as single
2016).
Clip, while the original trained model of van Hasselt et al
(2015)is referred to as single
Our network architecture has the same low -level convolu
tional structure of DON (Mnih et al., 2015 van Hasselt As in(van Hasselt et al., 2015), we start the game with up
et al., 2015). There are 3 convolutional layers followed by to 30 no-op actions to provide random starting positions for
2 fully-connected layers. The first convolutional layer has the agent. To evaluate our approach, we measure improve
328x8 filters with stride 4. the second 64 4 x 4 filters with ment in percentage(positive or negative)in score over the
stride 2, and the third and final convolutional layer consists better of human and baseline agent scores
643x 3 filters with stride 1. As shown in Figure 1, th
ScoreAgent- scoreBaseline
dueling network splits into two streams of fully connected
max Score Human, Score Baseline/-So
(10)
coreRandom
layers. The value and advantage streams both have a fully-
connected layer with 512 units. The final hidden layers of
We took the maximum over human and baseline agent
the value and advantage streams are both fully-connected
scores as it prevents insignificant changes to appear as
with the value stream having one output and the advantage
The number of actions ranges between 3-18 actions in the
ale environment
Dueling Network Architectures for Deep Reinforcement Learning
Table 1 mean and median scores across all 5/ atari games. mea
sured in percentages of human performance
30 no-ops
Human starts
Mean Median Mean Median
Prio. Duel clip5919%172.1%567.0%115.3%
Prior. single
4346%123.7%3867%1129%
33.73%
Duel cli
373.1%1515%343.8%117.1%
Single cli
341.2%132.6%3028%114.1%
Single
3073%1178%3329%110.9%
None This game
Nature DQN
227.9%
79.1%219.6%
68.5%
国5.2%
AheRO
Duel Clip does better than Single Clip on 75.4%o of the
games(43 out of 57). It also achieves higher scores com
Montezuma's Reveng
pared to the Single baseline on 80.7%o(46 out of 57)of the
games Of all the games with 18 actions, Duel Clip is better
86.6%c of the time(26 out of 30). This is consistent with the
findings of the previous section. Overall, our agent (duel
Clip) achieves human level performance on 42 out of 57
games. Raw scores for all the games, as well as measure
Figure 4. Improvements of dueling architecture over the baseline ments in human performance percentage. are presented in
Single network of van Hasselt et al.(2015), using the metric de
the apper
nIdi
scribed in Equation (Io). Bars to the right indicate by how much
the dueling network outperforms the single-stream network
Robustness to human starts. One shortcoming of the 30
no-ops metric is that an agent does not necessarily have to
generalize well to play the Alari games. Due to the deter
ministic nature of the atari environment. from an unique
large improvements when neither the agent in question nor
the baseline are doing well. For example, an agent that
starting point, an agent could learn to achieve good perfor
achieves 2% human performance should not be interpreted
mance by simply remembering sequences of actions
as two times better when the baseline agent achieves 1%0 To obtain a more robust measure, we adopt the methodol
human performance. We also chose not to measure perfor- ogy of Nair et al. (2015). Specifically, for each game, we
mance in terms of percentage of human performance alone use 100 starting points sampled from a human experts tra-
because a tiny difference relative to the baseline on some jectory. From each of these points, an evaluation episode
games can translate into hundreds of percent in human per- is launched for up to 108, 000 frames. The agents are eval-
formance difference
uated only on rewards accrued after the starting point. We
The results for the wide suite of 57 games are summarized
refer to this metric as human starts
in Table 1. Detailed results are presented in the Appendix. As shown in Table 1, under the Human Starts metric, Duel
USing this 30 no-ops performance measure, it is clear that
Clip once again outperforms the single stream variants. In
the dueling network (Duel Clip) does substantially better
particular, our agent does better than the single baseline on
than the Single Clip network of similar capacity. It also
70.2%c(40 out of 57)games and on games of 18 actions
does considerably better than the baseline(Single) of van
Duel clip is 83. 3% better(25 out of 30)
Hasselt et al. (2015). For comparison we also show results Combining with Prioritized Experience Replay. The du
for the deep q-network of Mnih et al. (2015), referred to as eling architecture can be easily combined with other algo-
Nature DQN
rithmic improvements In particular prioritization of the
Figure 4 shows the improvement of the dueling network
experience replay has been shown to significantly improve
over the baseline single network of van hasselt et al
performance of Atari games(Schaul et al., 2016). Further
(2015). Again, we seen that the improvements are often
more,as prioritization and the dueling architecture address
very dramatic
very different aspects of the learning process, their combi
nation is promising. So in our final experiment, we inves
As shown in Table 1, Single Clip performs better than Sin- tigate the integration of the dueling architecture with pri
gle. We verified that this gain was mostly brought in by oritized experience replay. We use the prioritized variant
gradient clipping. For this reason, we incorporate gradient of DDQN(Prior. Single) as the new baseline algorithm,
clipping in all the new approaches.
which replaces with the uniform sampling of the experi
Dueling Network Architectures for Deep Reinforcement Learning
097.①%
The combination of prioritized replay and the dueling
Wizard ofor
work results in vast improvements over the previous st
of-the-art in the popular ale benchmark
Chopper comsat
Saliency maps. To better understand the roles of the value
and the advantage streams, we compute saliency maps(Si-
axon
0%
monyan et al., 2013). More specifically, to visualize the
2233
lient part of the image as seen by the value stream, we
compute the absolute value of the jacobian of v with re
Battle zone
spect to the input frames: VsV(s: 0). Similarly, to visu-
alize the salient part of the image as seen by the advan-
tage stream, we compute VsA(s, arg maxa, A(s, a): 0)
Montezumas
3.03%
Both quantities are of the same dimensionality as the input
frames and therefore can be visualized easily alongside the
ErO
Input frames
603
Here we place the gray scale input frames in the green and
blue channel and the saliency maps in the red channel. all
Couble di
three channels together form an rGB image Figure 2 de
picts the value and advantage saliency maps on the Enduro
game for two different time steps. As observed in the in-
Figure 5. Improvements of dueling architecture over Prioritized troduction, the value stream pays attention to the horizon
DDQN baseline, using the same metric as Figure 4. Again, the where the appearance of a car could affect future perfor
dueling architecture leads to significant improvements over the mance. The value stream also pays attention to the score
single-stream baseline on the majority of games
The advantage stream, on the other hand cares more about
cars that are on an immediate collision course
ence tuples by rank-based prioritized sampling. We keep 5. Discussion
all the parameters of the prioritized replay as described
in (Schaul et aL., 2016), namely a priority exponent of 0.7,
The advantage of the dueling architecture lies partly in its
and an annealing schedule on the importance sampling ex
ability to learn the state-value function efficiently. With
ponent from 0.5 to 1. We combine this baseline with our
every update of the Q values in the dueling architecture
dueling architecture(as above), and again use gradient clip
the value stream V is updated -this contrasts with the up
ping(Prior Duel Clip)
dates in a single-stream architecture where only the value
for one of the actions is updated the values for all other
Note that, although orthogonal in their objectives, these actions remain untouched. This more frequent updating of
extensions(prioritization, dueling and gradient clipping) the value stream in our approach allocates more resources
interact in subtle ways. For example, prioritization inter- to V, and thus allows for better approximation of the state
acts with gradient clipping, as sampling transitions with values, which in turn need to be accurate for temporal
high absolute TD-errors more often leads to gradients with difference-based methods like Q-learning to work(Sutton
higher norms. To avoid adverse interactions, we roughly Barto, 1998). This phenomenon is reflected in the ex
re-tuned the learning rate and the gradient clipping norm on periments, where the advantage of the dueling architecture
a subset of 9 games. As a result of rough tuning, we settled over single-stream Q net works grows when the number of
on 6. x 10 for the learning rate and 10 for the gradient actions is large
clipping norm( the same as in the previous section)
Furthermore, the differences between Q-values for a given
When evaluated on all 57 Atari games, our prioritized du- state are often very small relative to the magnitude of Q
eling agent performs significantly better than both the pri- For example, after training with ddQn on the game of
oritized baseline agent and the dueling agent alone. The Seaquest the average action gap(the gap between the Q
full mean and median performance against the human per- values of the best and the second best action in a given
formance percentage is shown in Table 1. When initializ- state) across visited states is roughly 0.04, whereas the av-
ing the games using up to 30 no-ops action, we observe erage state value across those states is about 15. This differ
mean and median scores of 591%0 and 172% respectively
ence in scales can lead to small amounts of noise in the up-
The direct comparison between the prioritized baseline and
dates can lead to reorderings of the actions, and thus make
prioritized dueling versions, using the metric described in the nearly greedy policy switch abruptly. The dueling ar
Equation 10, Is presented in Figure 5
Dueling Network Architectures for Deep Reinforcement Learning
chitecture with its separate advantage stream is robust to Lin, L.J. Reinforcement learning for robots using neu
such effects
ral networks. PhD thesis, School of Computer Science
Carnegie Mellon University, 1993
6. Conclusions
Maddison, C. J, Huang, A, Sutskever, I, and Silver, D.
We introduced a new neural network architecture that de-
Move Evaluation in Go Using Deep Convolutional Neu
ral Networks. In ICLR. 2015
couples value and advantage in deep Q-networks, while
sharing a common feature learning nodule. The new duel-
Mnih, V, Kavukcuoglu, K, Silver, D. Rusu, A. A, V
ing architecture, in combination with some algorithmic im
ness.J. Bellemare M.G.. graves. A. Riedmiller. M
provements, leads to dramatic improvements over existing
Fidjeland, A.K., Ostrowski, G, Petersen, s, Beattie, C
approaches for deep rl in the challenging Atari domain. Sadik, A, Antonoglou, I, King, H, Kumaran, D, Wier
The results presented in this paper are the new state-of-the
stra, D Legg, S, and Hassabis, D. Human-level con-
art in this popular domain
trol through deep reinforcement learning. Nature, 518
(7540):529-53,2015
References
Nair. A. Srinivasan. P. blackwell. s. Alcicek. C
Fearon, R, Maria, A. De, Panneershelvam, V, suley-
Ba, J, Mnih, V, and Kavukcuoglu, K. Multiple object
man
C, Petersen, S,Legg, s, Mnih,
recognition with visual attention. In IClR. 2015
V, Kavukcuoglu, K, and Silver, D. Massively paral
Baird, L.C. Advantage updating. Technical Report WL
lel methods for deep reinforcement learning. In Deep
TR-93-1146, Wright-Patterson Air Force Base, 1993
Learning Workshop, ICML, 2015
Bellemare, M.G., Naddaf, Y, Veness, J, and Bowling, M. Schaul, T, Quan, J, Antonoglou, I, and Silver, D. Priori
The arcade learning environment: An evaluation plat-
tized experience replay. In ICLR, 2016
form for general agents. Journal of Artificial intelligence
Schulman. . Moritz. P. levine. s. Jordan, M.I. and
Research,47:253-279,2013.
abbeel, P. High-dimensional continuous control us
Bellemare. M.G. Ostrovski G. Guez. A. Thomas. P S
ing generalized advantage estimation. ar Xiv preprint
and Munos, R. Increasing the action gap: New operator
ar Xiv:1506.02438,2015
for reinforcement learning. In AAAl, 2016. To appear
Silver, D, huang, A, Maddison, C.J., Guez, A, Sifre, L,
Bengio, Y, Boulanger- Lewandowski, N, and Pascanu, R
van den Driessche, G, Schrittwieser, J. Antonoglou, I
Advances in optimizing recurrent networks. In ICASSP
Panneershelyam. V. Lanctot. M. Dieleman. s. Grewe
pp.8624-8628,2013
D, Nham, J, Kalchbrenner, N, Sutskever, I, Lillicrap
Fukushima, K. Neocognitron: A self-organizing neural
T, Leach, M, Kavukcuoglu, K, Graepel, T, and Has-
network model for a mechanism of pattern recognition
sabis, D. Mastering the game of go with deep neural
unaffected by shift in position. Biological Cybernetics
networks and tree search. Nature, 529(7587): 484489
36:193-202.1980
012016.
Guo, X, Singh, S, Lee, H, Lewis, R. L. and wang, X.
Simonyan,
K, Vedaldi, A, and Zisserman, A. Deep in
Deep learning for real-time Atari game play using offine
side convolutional networks: Visualising image clas
Monte-Carlo tree search planning. In NIPs, pp. 3338
sification models and saliency maps. arXiv preprint
arXiv:1312.0034,2013
3346.2014
Harmon, M.E. and Baird, L C. Multi-player residual ad
Stadie, B. C,, Levine, s, and abeel, P. Incentivizing
vantage learning with general function approximation
ploration in reinforcement learning with deep predictive
Technical Report WL-TR-1065, Wright-Patterson Air
models. arXiv preprint arXiv: 1507.00814, 2015
Force Base. 1996
Sutton, R.s. and Barto, A G. Introduction to reinforce
Harmon, M.E., Baird, L.C., and Klopf, A H. Advantage
ment learning. MIT Press, 1998
updating applied to a differential game. In G. Tesauro, Sutton, R.s., Mcallester, D, Singh,s, and mansour, Y.
D.S. Touretzky and Leen, TK(eds ) NIPS,1995
Policy gradient methods for reinforcement learning with
LeCun, Y, Bengio, Y, and Hinton, G. Deep learning. Na-
function approximation. In NIPS, pp. 1057-1063, 2000
ture,521(7553):436-444,2015
van Hasselt, H. Double Q-learning. NIPS, 23: 2613-2621
Levine s. Finn. C. Darrell. T. and abeel. p. End-to-
2010
end training of deep visuomotor policies. arXiv preprint van Hasselt, H, Guez, A and Silver, D. Deep reinforce
arIl:l504.00702.2015.
ment learning with double Q-learning. arXiv preprint
Dueling Network Architectures for Deep Reinforcement Learning
aXin:}509.06461,2015
van Seijen, H, van Hasselt, H, Whiteson,S, and wier
ing, M. a theoretical and empirical analysis of Expected
Sarsa. In IEEE Symposium on Adaptive Dynamic Pro
gramming and Reinforcement Learning, pp. 177-184
2009.
Watter, M, Springenberg, J. T, Boedecker, J, and Ried
miller, M. A. Embed to control: A locally linear latent
dynamics model for control from raw images. In NIPs,
2015
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.