Update Rule in Temporal difference
- by Betamoo
The update rule TD(0) Q-Learning:
Q(t-1) = (1-alpha) * Q(t-1) + (alpha) * (Reward(t-1) + gamma* Max( Q(t) ) )
Then take either the current best action (to optimize) or a random action (to explorer)
Where MaxNextQ is the maximum Q that can be got in the next state...
But in TD(1) I think update rule will be:
Q(t-2) = (1-alpha) * Q(t-2) +…