Nstep discounts specification is confusing

Nstep target definition is following;

G _t ^N =\sum _{k=0} ^{N-1} \gamma ^k r_{t+k} + \gamma ^N \max _a Q(s_{t+N},a)

ReplayBuffer with Nstep keyword replaces Nstep["next"] (e.g. "next_obs") and Nstep["rew"] (e.g. "rew") as follows;

\begin{aligned}
s_{t+1} &\to s_{t+N} \cr
r_t &\to \sum _{k=0}^{N-1} \gamma ^k r_{t+k}
\end{aligned}

The buffer also returns "discounts" as

\gamma ^{N-1}

This specification is confusing because user have to multiply gamma by theirselves;

sample["disounts"] * gamma * tf.reduce_max(Q(sample["next_obs"]),axis=1)

To avoid additional multiplication, it is better to change "discounts" returns

\gamma ^N

This change breaks compatibility, so that we should increase major version to v10.

Admin message