Deep.Reinforcement.Learning.Han.-.Maxim.Lapan.pdfD

文件名称: Deep.Reinforcement.Learning.Han.-.Maxim.Lapan.pdf

所属分类: 深度学习

开发工具:

文件大小: 12mb

下载次数: 0

上传时间: 2019-08-18

提供者: wanghui4********

下载 (12mb)

不能下载？报告错误

详细说明：Deep Reinforcement Learning Hands-On by Maxim LapanTable of contents Deep reinforcement Learning Hands-On Why subscribe? Packtpub.com Contributors about the author about the reviewers Packt is Searching for Authors Like You Preface Who this book is for What this book covers To get the most out of this book Download the example code file Download the color images Conventions used Get in touch Reviews 1. What is Reinforcement Learning? earning-supervised, unsupervised, and reinforcement RL formalisms and relations Reward The agent The enyironment Actions Observations Markov decision processes Markov process Markov reward process Markov decision process Summary 2. OpenAl Gym he anatomy of the agent Hardware and software requirements Openal Gym APt Action space Observation space The environment Creation of the environment The CartPole session The random Cart Pole agent The extra Gym functionality-wrappers and monitors Wrappers Monitor Summary 3. Deep Learning with PyTorch Tensors reation of tensors Scalar tensors Tensor operations GPU tensors Gradients Tensors and gradients nN building blocks Custom layers Final glue- loss functions and optimizers Loss functions Optimizers Monitoring with Tensor Board Tensor Board 10 Plotting stuff Example- gan on Atari images Summar 4. The Cross-Entropy Method Taxonomy of RL methods Practical cross-entropy Cross-entropy on Cartpole Cross-entropy on FrozenLake Theoretical background of the cross-entropy method Summary 5. Tabular Learning and the Bellman Equation Value, state, and optimality The Bellman equation of optimality Value of action The value iteration method Value iteration in practice Q-learning for FrozenLake Summary 6. Deep o-Networks Real-life value iteration Tabular O-learning Deep Q-learning Interaction with the environment SGD optimization Correlation between steps The Markov property The final form of don training dON on pong Wrappers DON model Training Running and performance Your model in action Summary 7. DON Extensions The PyTorch Agent Net library Age ent Agent's experience Experience buffer Gym env wrappers Basic don N-step dON Implementation Double don Implementation Results Noisy networks Implementation Results Prioritized replay buffer Implementation Results Dueling DoN Implementation Results Categorical DON Implementation Results Combining everything Implementation Results ummar y References 8. Stocks Trading Using RL Trading Data Problem statements and key decisions The trading environment Model raining code Results The feed-forward model The convolution model Things to try Summary 9. Policy gradients- An Alternative Values and policy Why policy? Policy representation Policy gradients The reinforce method The CartPole example Results Policy-based versus value-based methods REINFORCE issues Full episodes are required High gradients variance Exploration Correlation between samples PG on cartPole Results PG on pong Results Summar 10. The Actor-Critic Method Variance reduction CartPole variance Actor-critic A2C on Pong △2 C on pong results Tuning hyperparameters Learning rate Entropy beta Count of environments Batch size ummary 11. Asynchronous Advantage Actor-Critic Correlation and sample efficiency Adding an extra a to A2C Multiprocessing in Python A3C-data parallelism Results A3C-gradients parallelism Results Summary 12. Chatbots Training with rl Chatbots overview Deep np basics Recurrent Neural Networks Embeddings Encoder -Decoder raining of seqzseg Log-likelihood training Bilingual evaluation understudy bleu score RLin seq2seq Self-critical sequence training The chatbot example The example structure Modules: cornell. py and data. py BLEU Score and utils.py Model Training: cross-entropy Running the training Checking the data Testing the trained model Training SCst Running the scst training Results Telegram bot Summary 13. Web navigation Web navigation Browser automation and rl Mini world of bits benchmark OpenAl Universe Installation Actions and observations Environment creation Mini wob stability Simple clicking approach Grid actions Example overview Model Training code Starting containers Training process Checking the learned policy Issues with simple clicking Human demonstrations Recording the demonstrations Recording format Training using demonstrations Results TicTacToe problem Adding text description Results Things to try S ummary 14. Continuous Action Space Why a continuous space? Action space Environments The Actor-Critic(A2C) method Implementation Results USing models and recording videos Deterministic policy gradients Exploration Implementation Results Recording videos Distributional policy gradients Architecture Implementation Results hings to try Summary 15. Trust Regions- TRPO, PPO, and ACKTR Introduction R h oboschoo a2C baseline Results Videos recording gg Proximal Policy optimization Implementation Results Trust Region Policy optimization Implementation Results A2C using ACKTR Implementation Results Summary 16. Black-Box Optimization in RL Black-box methods Evolution strategies ES on cartPole Results ES on halfcheetah Results Genetic algorithms GA on cartpole Results GA tweaks Deep ga Novelty search Ga on cheetah Results ummary References 17. Beyond Model-Free-Imagination Model-based versus model-free Model imperfections Imagination-augmented agent The enyironment model The rollout policy

(系统自动生成,下载前可以参看下载内容)