Abstract — Partially Observable Markov Decision Processes (POMDP) has been widely applied in fields including robot navigation, machine maintenance, marketing, Medical Diagnosis, and so on [1]. But its exact solution is inefficient in both space and time. This paper investigates Smooth Partially Observable Value Approximation (SPOVA) [2], which approximates belief values by a differentiable function and then use gradient descent to update belief values. This POMDP approximation algorithm is applied on pole-balancing problem with regulation. Simulation results turn out this regulated approach is capable of estimating state transition probabilities and improving its policy simultaneously.
Keywords – POMDP; SPOVA; Pole-balancing.
Introduction
Markov Decision Process (MDP) has proven to be a useful framework for solving a variety of problems in areas, including robotics, economics, and manufacturing. Unfortunately, many real-world problems cannot be modeled as MDP, especially when the states of the problems are partially observable. Partially Observable Markov Decision Processes (POMDP) extends the MDP framework to include states those are not fully observable. With this extension, we are able to model more practical problems, while the solution methods that exist for MDP will no longer be applicable.
The computational intensive of POMDP algorithms is much more than that of MDP. This complexity is due to the uncertain about the true state, which leads to a probability distribution over the states. So POMDP algorithms are dealing with probability distributions, while MDP algorithms are working on a finite number of discrete states. This difference changes an optimization problem over a discreate space into that defined ...
... middle of paper ...
... total discounted reward over an infinite-horizon. The expected reward for policy π initializing from belief b_o is defined as
J^π (b_o )=E[∑_(t=0)^∞▒β^t r(s_t,a_t )│b_o,π] (3)
where 0<
Works Cited
Anthony R. Cassandra, “A survey of POMDP applications,” in AAAI Fall Symposium on Planning with Partially Observable Markov Decision Processes, 1998
R. Parr, S. Russell, “Approximating Optimal Policies for Partially Observable Stochastic Domains,” International Joint Conference on Artificial Intelligence, 1995.
E. J. Sondik, “The Optimal Control of Partially Observable Markov Decision Processes,” PhD thesis, Stanford University, Stanford, California, 1971.
Richard Ernest Bellman, “Dynamic Programming,” Princeton University Press, Princeton, New Jersey, 1957.
R. H. Cannon, “Dynamics of Physical Systems,” McGraw-Hill, New York, 1967.
Baudrillard, Jean. "Simulacra and Simulations." Jean Baudrillard, Selected Writings, ed Mark Poster. Stanford University Press, 1998, pp.166-184.
Dowding, K. (2011). Rational choice theory. In M. Bevir (Ed.), The SAGE Handbook of Governance (pp. 36-40). Retrieved from http://books.google.com/books?id=dU8BNNYnZesC&printsec=frontcover
Kurzweil, Ray. "Reinventing Humanity: the Future of Machine-human Intelligence." The Futurist 1 Mar. 2006. Print.
In the beginning of the article the author stated that the father of operant conditioning was B.F. Skinner. Skinner introduced the concept of reinforcement. Reinforcement was when something was given or taken to increase the likelihood of a certain
Fama, Eugene F, 1968. Risk, Return and Equilibrium: Some Clarifying Comments. Journal of Finance Vol. 23, No. 1, pp. 29–40.
Brooks, R. A. 2003. Prologue, In: Flesh and Machines: How Robots Will Change Us, Vintage.
The MDMP is a seven step process, including receipt of the mission, mission analysis, course of action (COA) development, COA analysis (also known as war gaming), COA comparison, COA approval, and orders production, dissemination and transition (HQDA, 2014). Immediately when commanders receive the mission, is when the MDMP and planning process begin.
The Army Problem Solving Model, and the Rapid Decision Making and Synchronization Process (RDSP) are systems that commanders use to solve issues that may arise. Both systems require time to complete. Commanders use Army problem solving when the problem is the pressing issue, and time is secondary. Commanders and staff use the RDSP when time is the major factor rapid is the key.
Goertzel, B., & Pennachin, C. (2007). In Artificial General Intelligence. Heidelburg, New York: Springer Berlin. Retrieved on July 31, 2010 from Google books Database.
How to Make Good Decisions and Be Right All the Time, (2008), Iain King, p. 147.
Michie, Donald and Johnston, Rory. The Knowledge Machine. Artificial Intelligence and the Future of Man. William Morrow and Company, Inc., NY., 1985.
J.S. Mill, 'What Utilitarianism Is' from Peter Y. Windt, An Introduction to Philosophy: Ideas in Conflict, St Paul, MN: West Publishing, 1982.
It is a shared truth that humans often tend to think of robots as nothing more than computer machines made of objects like metal, plastic, silicone and computer chips. However, in truth, a robot’s general purpose is more complex than some know. In order for a robot to function, it must carry out a set of arithmetic or logical operations, and programming the specs is difficult task that could take years to finish depending on the purpose of the robot.
Prozorov, Sergei, 2010, " Why Giorgio Agamben is an optimist" in Philosophy and Social Criticism, Vol. 39(9), 2010, Sage, University of Helsinki, Finland.
Searle, John R. “Minds, Brains, and Programs.” The Philosophy of Artificial Intelligence. Margaret A. Boden, ed. New York: Oxford UP, 1990. 67-88.