Programming BI in SCM

Experiments in RL

This sequence of blogs conducts a few experiments on reinforcement learning. We will use the common practice tools that led to the success.
In agents winning Atari games on a super-human level. As you can find introductions to the methodology to play those games everywhere in the web.
We will use the tools below, but not explain them:

Parallel environments: The Atari games showed that millions to billions of agent steps are necessary to reach a good learning level. In order to keep the computational time acceptable agents should be trained in parallel on the GPU.
Actor to Critic algorithm: We use one of the standard algorithms that led to the current improvements in reinforcement learning. As mentioned, tutorials on this algorithm can be found everywhere, e.g. …

The idea of this blog is open-ended. We will apply the methodology to a toy problem and see how far we get. Our toy problem will be sports bets with quotes. And the agents have the freedom to place an amount on a certain bet or not to bet at all. Obviously, the goal is to learn how to obtain a win on average. Wether this is possible, is an open question. For now the following sequence of blogs
is planned:

Part 1: Setup of the parallel environment and a test if parallelism on the GPU gives the promised compute time benefits.
Part 2: Reward shaping and investigation of improvements.

Hopefully, the first parts lead to further improvement ideas that then will be the subject of subsequent parts.

Short description of agent environments

In the world of Atari Games pre-figured environments are used, and rightfully so: in principle the agent observes the pixel state of the game screen, takes an action (e.g when the agent tries lands on the moon, if is accelerating or breaking (better English to be introduced)). Then the agent, receives a reward from the environment, and so on. The gymnasium module and its successors do all the heavy lifting. And the work is to find an algorithm that lets the agent learn. For our toy model we can program and environment from scratch. We follow the canonical defintion of a python environment class and implement class methods as follows:

RL environment

step: W have an action as input. The environment keeps track of the current state the agent is in. With the implemented reward scheme, the environment returns a reward, a done = false status and the next state. This goes on a long as the episode takes: in Atari games this is normally the “game over” state. then done = true is returned. In our toy model we have the possibility to place n randomly chosen bets and then the episode is over.
reset: This resets the environment to the start state for a new episode. As we will randomly choose matches from our base data, we will just continue to do so, even after an episode has ended.
render: for a Atari game the rendering is obviously the pixel state (screen) and the score after the taken action. In our model we can print out the reward received and win or loss of the bet.

Implementation of the step function

The record for the sports bet contains the quotes for home win, away win and draw, the propabilities derived from these quotes and the label, which is 1 for a home win and 0 for a draw or a away win. Everything what is not label, we call state. Here, we see already, that the definition of states of the model leaves ample room for definition. In the Atari game example this is self-definitory: the pixel state of the screen defines the state sufficiently.

Home Quote	Away Quuote	Draw Quote	Home Prob	Away Prob	Draw Prob	No Home Win Prob	Label
1.57	3.99	4.44	0.57	0.23	0.20	0.43	1.0

We will use agents that will place their actions in parallel. Thus, the step functions will accept a vector of actions. All operations will be vectorized if possible and computed on the GPU.

RL environment

This the simplest environment in this domain: a match with quotes is randomly taken and the reward is just the amount won or lost. Actions can be 0 if no bet is placed, otherwise bets can be placed up to a maximum amount. Here, we see that we have full freedom of shaping the reward function. We will not go into detail about the standard actor critic algorithm that we are going to use. However, an interesting question is, why learning should be working at all in the described case: we choose the sequence of bets randomly and they are definetely completely independent from each other. One of the reasons might be that the methods used are model-free methods. They are not learning transition probabilities between states, rather they figure out policies directly. Now we have a simple environment that lets agents learn in parallel. So, let’s start the agents and see which rewards and returns we are getting.

Key Takeaways from Part 1:

Environments:

We have scetched a simple environment from scratch and understood that reward shaping leaves a lot of freedom to engineer the environment.

States: With tabular data we can feature-engineer our state space. This is basically a classic feature engineering task. If we calculate the accuracy of a non-RL machine learning model, basically the policy is: bet if the probability is bigger 0.5 and do not bet if the probability is smaller than 0.5 is applied. Thus, the feature engineering influences the probability and then the policy in this simple manner. So, can we find a state definition that facilitates agent learning.

Next steps:

Part 1: Setup of the parallel environment and a test if parallelism on the GPU gives the promised compute time benefits.
Part 2: Reward shaping and investigation of improvements.