詳細書目資料

資料來源: 三民書局
9
0
0
0
0

Reinforcement learning : an introduction

  • 作者: Sutton, Richard S., author.
  • 其他作者:
  • 其他題名:
    • Adaptive computation and machine learning.
  • 出版: Cambridge, Massachusetts : The MIT Press
  • 版本:Second edition.
  • 叢書名: Adaptive computation and machine learning
  • 主題: Reinforcement learning.
  • ISBN: 9780262039246 (hbk.): US$69.7 、 0262039249 (hbk.)
  • 資料類型: 圖書
  • 內容註: Includes bibliographical references and index. Machine generated contents note: 1.Introduction -- 1.1.Reinforcement Learning -- 1.2.Examples -- 1.3.Elements of Reinforcement Learning -- 1.4.Limitations and Scope -- 1.5.An Extended Example: Tic-Tac-Toe -- 1.6.Summary -- 1.7.Early History of Reinforcement Learning -- 2.Multi-armed Bandits -- 2.1.A k-armed Bandit Problem -- 2.2.Action-value Methods -- 2.3.The 10-armed Testbed -- 2.4.Incremental Implementation -- 2.5.Tracking a Nonstationary Problem -- 2.6.Optimistic Initial Values -- 2.7.Upper-Confidence-Bound Action Selection -- 2.8.Gradient Bandit Algorithms -- 2.9.Associative Search (Contextual Bandits) -- 2.10.Summary -- 3.Finite Markov Decision Processes -- 3.1.The Agent-Environment Interface -- 3.2.Goals and Rewards -- 3.3.Returns and Episodes -- 3.4.Unified Notation for Episodic and Continuing Tasks -- 3.5.Policies and Value Functions -- 3.6.Optimal Policies and Optimal Value Functions -- 3.7.Optimality and Approximation -- 3.8.Summary -- 4.Dynamic Programming Note continued: 4.1.Policy Evaluation (Prediction) -- 4.2.Policy Improvement -- 4.3.Policy Iteration -- 4.4.Value Iteration -- 4.5.Asynchronous Dynamic Programming -- 4.6.Generalized Policy Iteration -- 4.7.Efficiency of Dynamic Programming -- 4.8.Summary -- 5.Monte Carlo Methods -- 5.1.Monte Carlo Prediction -- 5.2.Monte Carlo Estimation of Action Values -- 5.3.Monte Carlo Control -- 5.4.Monte Carlo Control without Exploring Starts -- 5.5.Off-policy Prediction via Importance Sampling -- 5.6.Incremental Implementation -- 5.7.Off-policy Monte Carlo Control -- 5.8.*Discounting-aware Importance Sampling -- 5.9.*Per-decision Importance Sampling -- 5.10.Summary -- 6.Temporal-Difference Learning -- 6.1.TD Prediction -- 6.2.Advantages of TD Prediction Methods -- 6.3.Optimality of TD(0) -- 6.4.Sarsa: On-policy TD Control -- 6.5.Q-learning: Off-policy TD Control -- 6.6.Expected Sarsa -- 6.7.Maximization Bias and Double Learning Note continued: 6.8.Games, Afterstates, and Other Special Cases -- 6.9.Summary -- 7.n-step Bootstrapping -- 7.1.n-step TD Prediction -- 7.2.n-step Sarsa -- 7.3.n-step Off-policy Learning -- 7.4.*Per-decision Methods with Control Variates -- 7.5.Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm -- 7.6.*A Unifying Algorithm: n-step Q(u) -- 7.7.Summary -- 8.Planning and Learning with Tabular Methods -- 8.1.Models and Planning -- 8.2.Dyna: Integrated Planning, Acting, and Learning -- 8.3.When the Model Is Wrong -- 8.4.Prioritized Sweeping -- 8.5.Expected vs. Sample Updates -- 8.6.Trajectory Sampling -- 8.7.Real-time Dynamic Programming -- 8.8.Planning at Decision Time -- 8.9.Heuristic Search -- 8.10.Rollout Algorithms -- 8.11.Monte Carlo Tree Search -- 8.12.Summary of the Chapter -- 8.13.Summary of Part I: Dimensions -- 9.On-policy Prediction with Approximation -- 9.1.Value-function Approximation -- 9.2.The Prediction Objective (VE) Note continued: 9.3.Stochastic-gradient and Semi-gradient Methods -- 9.4.Linear Methods -- 9.5.Feature Construction for Linear Methods -- 9.5.1.Polynomials -- 9.5.2.Fourier Basis -- 9.5.3.Coarse Coding -- 9.5.4.Tile Coding -- 9.5.5.Radial Basis Functions -- 9.6.Selecting Step-Size Parameters Manually -- 9.7.Nonlinear Function Approximation: Artificial Neural Networks -- 9.8.Least-Squares TD -- 9.9.Memory-based Function Approximation -- 9.10.Kernel-based Function Approximation -- 9.11.Looking Deeper at On-policy Learning: Interest and Emphasis -- 9.12.Summary -- 10.On-policy Control with Approximation -- 10.1.Episodic Semi-gradient Control -- 10.2.Semi-gradient n-step Sarsa -- 10.3.Average Reward: A New Problem Setting for Continuing Tasks -- 10.4.Deprecating the Discounted Setting -- 10.5.Differential Semi-gradient n-step Sarsa -- 10.6.Summary -- 11.*Off-policy Methods with Approximation -- 11.1.Semi-gradient Methods -- 11.2.Examples of Off-policy Divergence Note continued: 11.3.The Deadly Triad -- 11.4.Linear Value-function Geometry -- 11.5.Gradient Descent in the Bellman Error -- 11.6.The Bellman Error is Not Learnable -- 11.7.Gradient-TD Methods -- 11.8.Emphatic-TD Methods -- 11.9.Reducing Variance -- 11.10.Summary -- 12.Eligibility Traces -- 12.1.The A-return -- 12.2.TD(A) -- 12.3.n-step Truncated A-return Methods -- 12.4.Redoing Updates: Online A-return Algorithm -- 12.5.True Online TD(A) -- 12.6.*Dutch Traces in Monte Carlo Learning -- 12.7.Sarsa(A) -- 12.8.Variable A and ry -- 12.9.Off-policy Traces with Control Variates -- 12.10.Watkins's Q(A) to Tree-Backup(A) -- 12.11.Stable Off-policy Methods with Traces -- 12.12.Implementation Issues -- 12.13.Conclusions -- 13.Policy Gradient Methods -- 13.1.Policy Approximation and its Advantages -- 13.2.The Policy Gradient Theorem -- 13.3.REINFORCE: Monte Carlo Policy Gradient -- 13.4.REINFORCE with Baseline -- 13.5.Actor-Critic Methods Note continued: 13.6.Policy Gradient for Continuing Problems -- 13.7.Policy Parameterization for Continuous Actions -- 13.8.Summary -- 14.Psychology -- 14.1.Prediction and Control -- 14.2.Classical Conditioning -- 14.2.1.Blocking and Higher-order Conditioning -- 14.2.2.The Rescorla-Wagner Model -- 14.2.3.The TD Model -- 14.2.4.TD Model Simulations -- 14.3.Instrumental Conditioning -- 14.4.Delayed Reinforcement -- 14.5.Cognitive Maps -- 14.6.Habitual and Goal-directed Behavior -- 14.7.Summary -- 15.Neuroscience -- 15.1.Neuroscience Basics -- 15.2.Reward Signals, Reinforcement Signals, Values, and Prediction Errors -- 15.3.The Reward Prediction Error Hypothesis -- 15.4.Dopamine -- 15.5.Experimental Support for the Reward Prediction Error Hypothesis -- 15.6.TD Error/Dopamine Correspondence -- 15.7.Neural Actor-Critic -- 15.8.Actor and Critic Learning Rules -- 15.9.Hedonistic Neurons -- 15.10.Collective Reinforcement Learning -- 15.11.Model-based Methods in the Brain Note continued: 15.12.Addiction -- 15.13.Summary -- 16.Applications and Case Studies -- 16.1.TD-Gammon -- 16.2.Samuel's Checkers Player -- 16.3.Watson's Daily-Double Wagering -- 16.4.Optimizing Memory Control -- 16.5.Human-level Video Game Play -- 16.6.Mastering the Game of Go -- 16.6.1.AlphaGo -- 16.6.2.AlphaGo Zero -- 16.7.Personalized Web Services -- 16.8.Thermal Soaring -- 17.Frontiers -- 17.1.General Value Functions and Auxiliary Tasks -- 17.2.Temporal Abstraction via Options -- 17.3.Observations and State -- 17.4.Designing Reward Signals -- 17.5.Remaining Issues -- 17.6.Experimental Support for the Reward Prediction Error Hypothesis.
  • 摘要註: "Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives while interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the field's key ideas and algorithms."--
  • 讀者標籤:
  • 引用連結:
  • Share:
  • 系統號: 005434714 | 機讀編目格式
  • 館藏資訊

    The significantly expanded and updated new edition of a widely used text on reinforcement learning, one of the most active research areas in artificial intelligence.Reinforcement learning, one of the

    資料來源: 三民書局
    延伸查詢 Google Books Amazon
    回到最上