I've found that the micro task 5.1 does not send any reward when the learner outputs a space in the correct position, that is when listening to the teacher's answer, on every second step. The reward function in the learner is not called. What happens is that the micro 5.1 environment sends reward=None in such cases, and the BaseTask.reward function drops such calls in the try_reward function without redirecting them to the child class. Why the reward is set to None is beyond me, the code gets complicated in that place with many things done through event handlers etc.; it's hard to debug.
Could this somehow be by design? The relevant challenge specs read:
The environment sends reward (-1,0 or 1) and data (one byte) to the agent and receives the agent’s action
(one byte) in response. This happens in a continuous cycle. During a single simulation step , the environment
processes the received action from the agent and sends reward with new data to the agent; the agent
processes this input and sends an action back to the environment.
I understood this text as meaning that the reward will be given at every step. The diagram above is also pretty clear. This has costed me a few hours of debugging my code, because I couldn't imagine that the reward() function is not always called. Note that reward=None and reward=0 is not the same thing, the latter also happens sometimes (at task instance switch I think), which is very confusing if you leave it like that, and could also be exploited to detect task switches.
This probably affects later tasks as well.
Test code: https://pastebin.com/WjgJsq8W
You will see that it almost always outputs "qm: False", meaning that reward() is not called when question_mode == True.