Nstep discounts specification is confusing
According to the document https://ymd_h.gitlab.io/cpprb/features/nstep/
Nstep target definition is following;
G _t ^N =\sum _{k=0} ^{N-1} \gamma ^k r_{t+k} + \gamma ^N \max _a Q(s_{t+N},a)
ReplayBuffer
with Nstep
keyword replaces Nstep["next"]
(e.g. "next_obs"
) and Nstep["rew"]
(e.g. "rew"
) as follows;
\begin{aligned}
s_{t+1} &\to s_{t+N} \cr
r_t &\to \sum _{k=0}^{N-1} \gamma ^k r_{t+k}
\end{aligned}
The buffer also returns "discounts"
as
\gamma ^{N-1}
This specification is confusing because user have to multiply gamma by theirselves;
sample["disounts"] * gamma * tf.reduce_max(Q(sample["next_obs"]),axis=1)
To avoid additional multiplication, it is better to change "discounts" returns
\gamma ^N
This change breaks compatibility, so that we should increase major version to v10.