To keep things simple, we start with the case of a passive learning agent using a state-based representation in a known, accessible environment. In passive learning, the environment generates state transitions and the agent perceives them.1 Consider an agent trying to learn the utilities of the states shown in Figure 20.1 (a). We assume, for now, that it is provided with a model My giving the probability of a transition from state i to state j, as in Figure 20.1(b). In each training sequence, the agent starts in state (1,1) and experiences a sequence of state transitions until it reaches one of the terminal states (3,2) or (3,3), where it receives a reward.2 A typical set of training sequences might look like this:

(1,1 )->(!, 2)—(1,3)->(2,3)-(2,2)->(2,3)->(3,3) ±1 (1,1)—(1,2)—(1, !)—(!, 2)-<l, 1)^(2,1)^(2,2)-<2,3)-+(3,3) ±1 (1? !)->.( 1,2)^(2,2)-Kl, 2)^(1,3)—(2,3)-*(l, 3)-<2,3)-K3,3) ±i (1,1)—K2,1)^(2,2)^(2,1)—Kl, D—(1,2)—»(1.3)—»(2,3)—*-(2,2)—»(3,2) -1 (1,1)-K2,1)-K1,1)-<1,2)-K2,2)^(3,2)^1

The object is to use the information about rewards to learn the expected utility U(i) associated with each nonterminal state. We will make one big simplifying assumption: the utility of a sequence is the sum of the rewards accumulated in the states of the sequence. That is, the utility
function is additive in the sense defined on page 502. We define the reward-to-go of a state as the sum of the rewards from that state until a terminal state is reached. Given this definition, it is . ^..- .- easy to see that the expected utility of a state is the expected reward-to-go of that state. ' S: The generic agent design for passive reinforcement learning of utilities is shown in Figure 20.2.

The agent keeps an estimate U of the utilities of the states, a table N of counts of how many times each state was seen, and a table M of transition probabilities from state to state. We assume that each percept e is enough to determine the STATE (i.e., the state is accessible), the agent can determine the REWARD component of a percept, and the agent can tell if a percept indicates a TERMINAL? state. In general, an agent can update its current estimated utilities after each observed transition. The key to reinforcement learning lies in the algorithm for updating the utility values given the training sequences. The following subsections discuss three possible approaches to UPDATE.