A Bellman equation (also known as a dynamic programming equation), named after its discoverer, Richard Bellman, is a necessary condition for optimality associated with the mathematical optimization method known as dynamic programming. It breaks a dynamic optimization problem into simpler subproblems, writing the value of a decision problem at a certain point in time in terms of the payoff from some initial choices and the value of the remaining decision problem that results from those initial choices, as Bellman's Principle of Optimality prescribes.
The Bellman equation was first applied to engineering control theory and to other topics in applied mathematics, and subsequently became an important tool in economic theory.
Almost any problem which can be solved using optimal control theory can also be solved by analyzing the appropriate Bellman equation. However, the term 'Bellman equation' usually refers to the dynamic programming equation associated with discretetime optimization problems. In continuoustime optimization problems, the analogous equation is a partial differential equation which is usually called the HamiltonJacobiBellman equation.
Contents 
To understand the Bellman equation, it helps to define several intuitive concepts that underlie the dynamic programming optimization approach. First, any optimization problem has some objective minimizing travel time, minimizing cost, maximizing profits, maximizing utility, etcetera. The mathematical function that describes this objective is called the objective function.
Dynamic programming breaks a multiperiod planning problem into simpler steps at different points in time. Therefore, it requires keeping track of how the decision situation is evolving over time. The information about the current situation which is needed to make a correct decision is called the state (See Bellman, 1957, Ch. III.2).^{[1]}^{[2]} For example, to decide how much to consume and spend at each point in time, a person would need to know (among other things) her initial wealth. Therefore, wealth would be one of her state variables, but there would probably be others.
The variables chosen at any given point in time are often called the control variables. For example, given her current wealth, a person might decide how much to consume now. Choosing the control variables now may be equivalent to choosing the next state; more generally, the next state is affected by other factors in addition to the current control. For example, in the simplest case, today's wealth (the state) and consumption (the control) might exactly determine tomorrow's wealth (the new state), though typically other factors will affect tomorrow's wealth too.
The dynamic programming approach describes the optimal plan by finding a rule that tells what the controls should be, given any possible value of the state. For example, if consumption (c) depends only on wealth (W), we would seek a rule c(W) that gives consumption as a function of wealth. Such a rule, determining the controls as a function of the states, is called a policy function (See Bellman, 1957, Ch. III.2).^{[1]}
Finally, by definition, the optimal decision rule is the one that achieves the best possible value of the objective. For example, if someone chooses consumption, given wealth, in order to maximize happiness (assuming happiness H can be represented by a mathematical function, such as a utility function), then each level of wealth will be associated with some highest possible level of happiness, H(W). The best possible value of the objective, written as a function of the state, is called the value function.
Richard Bellman showed that a dynamic optimization problem in discrete time can be stated in a recursive, stepbystep form by writing down the relationship between the value function in one period and the value function in the next period. The relationship between these two value functions is called the Bellman equation.
Let the state at time t be x_{t}. For a decision that begins at time 0, we take as given the initial state x_{0}. At any time, the set of possible actions depends on the current state; we can write this as , where the action a_{t} represents one or more control variables. We also assume that the state changes from x to a new state T(x,a) when action a is taken, and that the current payoff from taking action a in state x is F(x,a). Finally, we assume impatience, represented by a discount factor 0<β≤1.
Under these assumptions, an infinitehorizon decision problem takes the following form:
subject to the constraints
Notice that we have defined notation V(x_{0}) to represent the optimal value that can be obtained by maximizing this objective function subject to the assumed constraints. This function is the value function. It is a function of the initial state variable x_{0}, since the best possible value obtainable depends on the initial situation.
The dynamic programming method breaks this decision problem into smaller subproblems. Richard Bellman's Principle of Optimality describes how to do this:
Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. (See Bellman, 1957, Chap. III.3.)^{[1]}^{[2]}^{[3]}
In computer science, a problem that can be broken apart like this is said to have optimal substructure.
As suggested by the Principle of Optimality, we will consider the first decision separately, setting aside all future decisions (we will start afresh from time 1 with the new state x_{1}). Collecting the future decisions in brackets on the right, the previous problem is equivalent to:
subject to the constraints
Here we are choosing a_{0}, knowing that our choice will cause the time 1 state to be x_{1} = T(x_{0},a_{0}). That new state will then affect the decision problem from time 1 on. The whole future decision problem appears inside the brackets on the right.
So far it seems we have only made the problem uglier by separating today's decision from future decisions. But we can simplify by noticing that what is inside the brackets on the right is the value of the time 1 decision problem, starting from state x_{1} = T(x_{0},a_{0}).
Therefore we can rewrite the problem as a recursive definition of the value function:
This is the Bellman equation. It can be simplified even further if we drop time subscripts and plug in the value of the next state:
The Bellman equation is classified as a functional equation, because solving it means finding the unknown function V, which is the value function. Recall that the value function describes the best possible value of the objective, as a function of the state x. By calculating the value function, we will also find the function a(x) that describes the optimal action as a function of the state; this is called the policy function.
Dynamic programming can be especially useful in stochastic decisions, that is, optimization problems affected by random shocks. For example, consider a problem exactly like the one discussed above, except that x_{t + 1} is a random variable, which may be influenced by x_{t} and a_{t} but is not determined by them exactly. We can describe this case by defining the probability distribution conditional on x_{t} and a_{t}, for example,
Given this probability law determining x_{t + 1} conditional on x_{t} and a_{t}, the Bellman equation can be written as
where E_{G} represents a conditional expectation under distribution G.
The first known application of a Bellman equation in economics is due to Martin Beckmann and Richard Muth.^{[4]} Martin Beckmann also wrote extensively on consumption theory using the Bellman equation in 1959. His work influenced Edmund S. Phelps, among others.
A celebrated economic application of a Bellman equation is Merton's seminal 1973 article on the intertemporal capital asset pricing model.^{[5]} (See also Merton's portfolio problem).The solution to Merton's theoretical model, one in which investors chose between income today and future income or capital gains, is a form of Bellman's equation. Because economic applications of dynamic programming usually result in a Bellman equation that is a difference equation, economists refer to dynamic programming as a "recursive method."
Stokey, Lucas & Prescott describes stochastic and nonstochastic dynamic programming in considerable detail, giving many examples of how to employ dynamic programming to solve problems in economic theory.^{[6]} This book led to dynamic programming being employed to solve a wide range of theoretical problems in economics, including optimal economic growth, resource extraction, principalagent problems, public finance, business investment, asset pricing, factor supply, and industrial organization. Ljungqvist & Sargent apply dynamic programming to study a variety of theoretical questions in monetary policy, fiscal policy, taxation, economic growth, search theory, and labor economics.^{[7]} Dixit & Pindyck showed the value of the method for thinking about capital budgeting.^{[8]} Anderson adapted the technique to business valuation, including privatelyheld businesses.^{[9]}
Using dynamic programming to solve concrete problems is complicated by informational difficulties, such as choosing the unobservable discount rate. There are also computational issues, the main one being the curse of dimensionality arising from the vast number of possible actions and potential state variables that must be considered before an optimal strategy can be selected. For an extensive discussion of computational issues, see Miranda & Fackler.^{[10]}, and Meyn 2007^{[11]}
In reinforcement learning, a Bellman equation refers to a recursion for expected rewards. For example, the expected reward for being in a particular state s and following some fixed policy π has the Bellman equation:
This equation describes the expected reward for taking the action prescribed by some policy π.
The equation for the optimal policy is referred to as the Bellman optimality equation:
It describes the reward for taking the action giving the highest expected return.
