Important Ideas from Bandit Algorithms

Motivations for Analysis and key concepts

1. Key Ideas

In a single arm bandit, we can simplify this to:

because \(Q_{k}(a) = Q_{k}\) [action dropped as there is only one action in a single bandit problem and \(N_t(a)\) simplifies to just \(n\) as that is the number of times the arm was pulled]

\[\begin{align*} &Q_1 = R_1 \\ &Q_2 = \frac{R_1 + R_2}{2} \\ &Q_3 = \frac{R_1 + R_2 + R_3}{3} \\ &\vdots\\ &Q_{k+1} = \frac{1}{k} \sum_{i=1}^{k} R_i \end{align*}\] \[\begin{equation*} \begin{split} Q_{k+1} &= \frac{1}{k} \left( R_k + \sum_{i=1}^{k-1}R_i\right)\\ &= \frac{1}{k} \left( R_k + (k-1)Q_k\right)\\ &= \frac{1}{k} \left( R_k + (k-1)Q_k + Q_k - Q_k\right)\\ &= \frac{1}{k} \left( R_k + kQ_k - Q_k\right)\\ &= Q_k + \frac{1}{k} \left( R_k - Q_k\right)\\ \end{split} \end{equation*}\]

where only \(Q_k\) and \(R_k\) from the the previous time are used to update the Q value. This is called the incremental update rule.

Note that a similar equation can be written for a multi-arm bandit problem and it would look like this

\[Q_{k+1}(a) = Q_k(a) + \frac{1}{N_k(a)}(R_k - Q_k(a))\]

The way to look at it is: \(\text{current value = previous value + scaling factor . [target value - previous value]}\)

which basically says that we are taking a step of size scaling factor towards the target value given the current value.