Why Temporal Difference (TD) Learning?
- model-free
- tabular methods (like Q-learning) are computationally infeasible with huge datasets
- TD learning tackles this issue by estimating the target instead of using rewards that are actually collected by the agent
How?
TD learning is updating the value function based on the error between the TD target (estimate of the return) and the value function (the difference is called TD error)
- this idea is similar to proportional control (from control systems)
$$
V(S_t) \leftarrow V(S_t) + \alpha[(R_{t+1} + V(S_{t+1})) - V(S_t)]
$$
- $R_{t+1} + V(S_{t+1})$ is the TD target
- $\alpha$ is the learning rate
This originates from the following equation:
$$
V(S_t) \leftarrow (1 - \alpha)V(S_t) + \alpha G_t
$$
The idea behind this is that the learning rate determines the balance between the current estimate and the return $G_t$ (which is replaced with TD target for TD learning)
- if learning rate is $\rightarrow 1$, the value function is influenced more by the return (hence more learning)
- if learning rate $\rightarrow 0$, the value function is influenced more by the current value estimate (hence less learning)