展开全部

主编推荐语

从零开始到透彻理解,知其然并知其所以然。

内容简介

本书从强化学习最基本的概念开始介绍,将介绍基础的分析工具,包括贝尔曼公式和贝尔曼最优公式,然后推广到基于模型的和无模型的强化学习算法,最后推广到基于函数逼近的强化学习方法。本书强调从数学的角度引人概念、分析问题、分析算法,并不强调算法的编程实现。

本书不要求读者具备任何关于强化学习的知识背景,仅要求读者具备一定的概率论和线性代数的知识。如果读者已经具备强化学习的学习基础,本书可以帮助读者更深入地理解一些问题并提供新的视角。

目录

  • 版权信息
  • 作者简介
  • 内容简介
  • Preface
  • Overview of this Book
  • Chapter 1 Basic Concepts
  • 1.1 A grid world example
  • 1.2 State and action
  • 1.3 State transition
  • 1.4 Policy
  • 1.5 Reward
  • 1.6 Trajectories, returns, and episodes
  • 1.7 Markov decision processes
  • 1.8 Summary
  • 1.9 Q&A
  • Chapter 2 State Values and the Bellman Equation
  • 2.1 Motivating example 1: Why are returns important?
  • 2.2 Motivating example 2: How to calculate returns?
  • 2.3 State values
  • 2.4 The Bellman equation
  • 2.5 Examples for illustrating the Bellman equation
  • 2.6 Matrix-vector form of the Bellman equation
  • 2.7 Solving state values from the Bellman equation
  • 2.8 From state value to action value
  • 2.9 Summary
  • 2.10 Q&A
  • Chapter 3 Optimal State Values and the Bellman Optimality Equation
  • 3.1 Motivating example: How to improve policies?
  • 3.2 Optimal state values and optimal policies
  • 3.3 The Bellman optimality equation
  • 3.4 Solving an optimal policy from the BOE
  • 3.5 Factors that inf luence optimal policies
  • 3.6 Summary
  • 3.7 Q&A
  • Chapter 4 Value Iteration and Policy Iteration
  • 4.1 Value iteration
  • 4.2 Policy iteration
  • 4.2.1 Algorithm analysis
  • 4.3 Truncated policy iteration
  • 4.4 Summary
  • 4.5 Q&A
  • Chapter 5 Monte Carlo Methods
  • 5.1 Motivating example: Mean estimation
  • 5.2 MC Basic: The simplest MC-based algorithm
  • 5.3 MC Exploring Starts
  • 5.4 MC ϵ-Greedy: Learning without exploring starts
  • 5.5 Exploration and exploitation of ϵ-greedy policies
  • 5.6 Summary
  • 5.7 Q&A
  • Chapter 6 Stochastic Approximation
  • 6.1 Motivating example: Mean estimation
  • 6.2 Robbins-Monro algorithm
  • 6.3 Dvoretzky’s convergence theorem
  • 6.4 Stochastic gradient descent
  • 6.5 Summary
  • 6.6 Q&A
  • Chapter 7 Temporal-Difference Methods.
  • 7.1 TD learning of state values
  • 7.1.3 Convergence analysis
  • 7.2 TD learning of action values: Sarsa
  • 7.3 TD learning of action values: n-step Sarsa
  • 7.4 TD learning of optimal action values: Q-learning
  • 7.5 A unif ied viewpoint
  • 7.6 Summary
  • 7.7 Q&A
  • Chapter 8 Value Function Approximation
  • 8.1 Value representation: From table to function
  • 8.2 TD learning of state values w ith function approximation
  • 8.3 TD learning of action values w ith function approximation
  • 8.4 Deep Q-learning
  • 8.5 Summary
  • 8.6 Q&A
  • Chapter 9 Policy Gradient Methods
  • 9.1 Policy representation: From table to function
  • 9.2 Metrics for def ining optimal policies
  • 9.3 Gradients of the metrics
  • 9.4 Monte Carlo policy gradient (REINFORCE)
  • 9.5 Summary
  • 9.6 Q&A
  • Chapter 10 Actor-Critic Methods
  • 10.1 The simplest actor-critic algorithm (QAC)
  • 10.2 Advantage actor-critic (A2C)
  • 10.3 Off-policy actor-critic
  • 10.4 Deterministic actor-critic
  • 10.5 Summary
  • 10.6 Q&A
  • Appendix A Preliminaries for Probability Theory
  • Appendix B Measure-Theoretic Probability Theory
  • Appendix C Convergence of Sequences
  • C.1 Convergence of deterministic sequences
  • C.2 Convergence of stochastic sequences
  • Appendix D Preliminaries for Gradient Descent
  • Bibliography
  • Symbols
  • Index
展开全部

评分及书评

评分不足
1个评分

出版方

清华大学出版社

清华大学出版社成立于1980年6月,是由教育部主管、清华大学主办的综合出版单位。植根于“清华”这座久负盛名的高等学府,秉承清华人“自强不息,厚德载物”的人文精神,清华大学出版社在短短二十多年的时间里,迅速成长起来。清华大学出版社始终坚持弘扬科技文化产业、服务科教兴国战略的出版方向,把出版高等学校教学用书和科技图书作为主要任务,并为促进学术交流、繁荣出版事业设立了多项出版基金,逐渐形成了以出版高水平的教材和学术专著为主的鲜明特色,在教育出版领域树立了强势品牌。