Original Paper

Decision Transformer: Reinforcement Learning via Sequence Modeling

What’s good about this model?

Technique

The core technique used in this paper is feeding a transformer with a sequence of state, action, and reward-to-go.

What’s reward-to-go?

We intuitively might want to use cumulative reward as a measure of how well an agent is doing. But, we don’t really care about anything before when measuring how well an action is. What matters is everything after the action, in other words, the action’s consequence. So we use another measure called the reward-to-go, which is the sum of all future rewards that an agent expects to receive from a certain time step $t$ onwards within an episode:

$$ \hat{R}t = \Sigma{t'=t}^{T}R(s_{t'}, a_{t'}, s_{t'+1}) $$

$$ \hat{R}t = \Sigma{T-t}^{k=0}\gamma^kr_{t+k} $$

The cumulative reward is given as $G_t = \Sigma_{T-1}^{k=0}\gamma^kr_k$ (for reference to compare with reward-to-go).

The Sequence

Now, the sequence fed to the transformer is like so:

$$ \tau = (\hat{R}_1, s_1, a_1, \hat{R}_2, s_2, a_2,\cdots, \hat{R}_T, s_T, a_T) $$

As explained earlier, the sequences fed to the transformer during the training stage is just suboptimal trajectories (e.g. finding the shortest distance from just a random walk trajectory dataset)

The output sequence from the transformer is just the actions $(a_1, a_2, \cdots, a_T)$

Embeddings

We must embed the sequence to vector space and add positional embeddings before feeding it to the transformer.