查询词典 policy iteration
- 与 policy iteration 相关的网络例句 [注:此内容来源于网络,仅供参考]
-
In order to describe the topology relationship of an environment better, a geodesic distance is substituted for a Euclidean distance used in an ordinary Gaussian function and a policy iteration reinforcement learning method based on geodesic Gaussian basis function is proposed.
为更好地描述环境的拓扑关系,采用测地线距离来替换普通高斯函数中的欧氏距离,提出一种基于测地高斯基函数的策略迭代强化学习方法。
-
However, the enormous number of states makes the inverse of the transition probability matrix (which is of size 9×108) computation-prohibitive and thus complicates the application of policy iteration method in the context of Markov Decision Process to solve our problem.
然而,此一庞大的状态数目使得转置机率的反矩阵(其大小为 9×108)无法计算出来,也因此复杂化了应用马可夫决策过程中的「策略叠代法」(policy-iteration method)来解决我们的问题。
-
Based on the MDP, an algorithm including numerical iteration and policy iteration is then proposed.
文末仿真结果验证了该方法的正确性和有效性。
-
Reinforcement learning theory and approaches are applied to JLQ model and Q function-based policy iteration algorithm is designed to optimize system performance.
将强化学习的理论和方法应用于JLQ模型,设计基于Q函数的策略迭代算法,以优化系统性能。
-
An appropriate selection of basis function directly influences the learning performance of a policy iteration method during the value function approximation.
在策略迭代结强化学习方法的值函数逼近过程中,基函数的合理选择直接影响方法的性能。
-
This idea comes from the appearance of the curse of dimension in computational process, for example, in Markov decision processes, its not practical for improving policy computation using the general policy iteration or value iteration method.
当系统的计算出现维数灾难时,比如在Markov决策过程的求解问题中,如果系统的动作空间非常之大,那么利用一般的策略递归算法或值递归算法,来进行策略的改进计算是不实际的。
-
The semiMarkov decision processes were studied by the Mstep lookahead policy iteration based on the performance potentials.
运用基于性能势的M步向前异步策略迭代算法研究了半Markov决策过程优化问题。
-
We contribute a new idea to the risk-sensitive evolution policy iteration algorithm for solving reinforcement learning problem and discuss the optimality of polices for this algorithm.
我们提出了风险敏感度渐进策略递归激励学习算法并对策略的最优性进行了讨论。
-
With a suitable performance function and an initial control that made the closed system bounded, a serial of controls that made the performances better could be obtained by policy iteration, and ergodic Markov chains were constructed by the state serial with the corresponding feedback control.
在合适的性能指标并能找到一个使系统性能有界的控制的前提下,通过策略迭代可以求出逐步改善系统性能的控制序列,同时得到状态序列在相应反馈控制作用下构成遍历的马尔可夫链。
-
A neural network is then used to represent the estimation of potentials,both the parameterized TD(0) learning formulas and algorithm are also derived for approximating the policy evaluation.By the approximation values of potentials and approximation policy iteration,a unified neuro-dynamic programming optimization approach is consequently proposed for both two criteria.
根据定义式,建立性能势在平均和折扣性能准则下统一的即时差分公式,并利用一个神经元网络来表示性能势的估计值,导出参数TD(0)学习公式和算法,进行逼近策略评估;然后,根据性能势的逼近值,通过逼近策略迭代来实现两种准则下统一的神经元动态规划(neuro-dynamic programming,NDP)优化方法。
- 推荐网络例句
-
Lugalbanda was a god and shepherd king of Uruk where he was worshipped for over a thousand years.
Lugalbanda 是神和被崇拜了一千年多 Uruk古埃及喜克索王朝国王。
-
I am coming just now,' and went on perfuming himself with Hunut, then he came and sat.
我来只是现在,'歼灭战perfuming自己与胡努特,那麼,他来到和SAT 。
-
The shamrock is the symbol of Ireland and of St.
三叶草是爱尔兰和圣特里克节的标志同时它的寓意是带来幸运。3片心形叶子围绕着一根断茎,深绿色。