自适应重要采样Actor-Critic算法
首发时间:2010-03-08
摘要:在离策略Actor-Critic(AC)强化学习中,虽然Critic使用重要采样技术可以减小值函数估计的偏差,但是重要采样方法没有考虑估计的方差,算法性能倾向于不稳定。为了减小估计方差,提出一种自适应重要采样AC学习算法。该算法将自适应重要采样技术应用于带资格迹的最小二乘时间差分AC方法中,重复使用策略更新过程中收集的数据样本,在重要权重的基础上引入一个用于权衡策略梯度估计偏差和方差的平衡因子,其中平衡因子的值由重要权重交叉验证方法根据样本和策略自动选择。排队问题的仿真结果表明,本文所提AC算法不仅具有稳定的性能,而且学习速度快
关键词: 策略梯度 自适应重要采样 重要权重交叉验证 最小二乘时间差分 AC学习
For information in English, please click here
Adaptive Importance Sampling Actor-Critic Algorithms
Abstract:In the off-policy reinforcement learning, the Critic uses importance sampling techniques for reducing the bias of value function estimators, but importance sampling techniques do not take the variance of the estimators into account and therefore the performance of algorithms tends to be unstable. In order to reduce the estimation variance, a kind of AC learning algorithm based on an adaptive importance sampling (AIS) is proposed. We applied the AIS technique to an AC method based on the least squares temporal difference with eligibility trace and repeatedly use samples collected from policy update processes, as well as a balance factor is introduced to the importance weight for contorling the trade-off between bias and variance of the estimation of policy gradient, where the value of the balance factor is determined by an importance-weight cross-validation method automatically from samples and policies. Simulation results of a queuing problem show that the proposed AC algorithm not only has good and stable learning performance but also has quick learning speed.
Keywords: policy gradient adaptive importance sampling importance-weights cross-validation least squares temporal difference AC learning
基金:
论文图表:
引用
No.4048349043412680****
同行评议
共计0人参与
勘误表
自适应重要采样Actor-Critic算法
评论
全部评论0/1000