The original Kelly coinflip game has a clear way to make it more general and difficult: randomize the 3 key parameters of edge/max wealth/rounds and hide them, turning it into a partially observable Markov decision process.

To solve a POMDP one often has to record the entire history, but with judicious choice of what distribution we randomize the 3 parameters from, we can make it both plausibly reflecting the human subjects’ expectations and allow summarizing the history in a few sufficient statistics.

A convenient distribution for such a cap would be the Pareto distribution , parameterized by a minimum x m and a tail parameter, α. If x m is set to something like 100 (the agent thinks the cap is >$100), for most rounds, the agent would not be able to update this, but towards the end, as it exceeds $100, it then updates the posterior—if it is allowed to make a bet of $50 at $100, then x m becomes 150 while α remains the same, and so on. In this case, the maximum observed wealth is the sufficient statistic. The posterior mean of the cap is then α⋅xmα−1. A reasonable prior might be α=5 and x m = 200, in which case the mean wealth cap is 250.

The maximum payout would be revealed if and when subjects placed a bet that if successful would make their balance greater than or equal to the cap. We set the cap at $250.

This can be extended to scenarios where we do learn something about the stopping time, similarly to the coin-flip probability, by doing Bayesian updates on sp. Or if sp changes over rounds. If we had a distribution N(300,25) over the stopping time, then the probability of stopping at each round can change inside the value function given an additional argument n for rounds elapsed as the sufficient statistic.

So continuing with the R hash table, it might go like this:

This has the difficulty that hypothetically, a game might go on indefinitely long, but since a player must eventually hit $0 or $250 and stop playing, it would turn out like the penny simulator—after a few thousand rounds, the probability of having not reached $250 is so small it can’t be handled by most software. The problem is more that this expands the states out enough to again make it unevaluable, so it helps to put an upper bound. So the problem becomes, playing the game knowing that the game might stop with a certain sp (eg 0.01) each round or stops at an upper limit number of rounds (eg b =300).

What is sp as a distribution? Well, one might believe that every round, there is a ~1% chance of the game stopping, in which case sp is the memoryless geometric distribution (not exponential distribution , since this is a discrete setup); in this case, one must discount each additional round n by 1% cumulatively. (Discounting in general could be seen as reflecting a probability of stopping at each stage.)

One way to calculate this would be to assume that the player doesn’t learn anything about the stopping time while playing, and merely has a prior over the stopping time. For example, N(300,25). The game stopping can be seen as a third probabilistic response to a player’s bet, where each move with some sp, if it happens the number of bets left immediately becomes b =0 (so they win their current w) and they win the net value V(w,0); and if it doesn’t happen the value is the usual 0.6*V(w+x, b-1) + 0.4*V(w-x, b-1) . Or in longer form, the value of a action/bet x is (1-sp)*(0.6*V(w+x,b-1) + 0.4*V(w-x,b-1)) + sp*V(w,0) .

Another variant would be to make the stopping time uncertain; strictly speaking, the game wasn’t fixed at exactly 300 rounds, that was just their guess at how many rounds a reasonably fast player might get in before the clock expired. From the player’s perspective, it is unknown how many rounds they will get to play although they can guess that they won’t be allowed to play more than a few hundred, and this might affect the optimal strategy.

With b =300, there are so many rounds available to bet, that the details don’t matter as much; while it there are only a few bets, the maximizing behavior is to bet a lot so the regret for small b is also probably small; the largest regret for not knowing p is probably somewhere, like with the KC vs decision tree, in the small-medium area of b.

Why is the Bayesian decision tree able to perform so close to the known-probability decision tree? I would guess that it is because all agents bet small amounts in the early rounds (to avoid gambler’s ruin, since they have hundreds of rounds left to reach the ceiling), and this gives them data for free to update towards p=0.6; the agent which knows that p=0.6 can bet a little more precisely early on, but this advantage over a few early rounds doesn’t wind up being much of a help in the long run.

And since we know from earlier that an agent following the optimal policy believing p=0.6 would earn $246.606, it follows that the Bayesian agent, due to its ignorance, loses ~$1. So the price of ignorance ( regret ) in this scenario is surprisingly small.

For our prior, w = 25, and b =300, we can evaluate run memo (7+1) (3+1) 300 25 :: Dual and find that the subjective EV is $207.238 and the actual EV is $245.676.

This prior leads to somewhat more aggressive betting in short games as the EV is higher, but overall performs much like the decision tree.

With the additional parameters, the memoized version becomes impossibly slow, so I rewrite it to use R environments as a poor man’s hash table:

We know that they know that 100% is not an option (as that would be boring) and <50% is impossible too since they were promised an edge; but it can’t be too high because then everyone would win quickly and nothing would be learned. I think a reasonable prior would be something like 70%, but with a lot of room for surprise. let’s assume a weak prior around 70%, which is equivalent to playing 10 games and winning 7, so Beta(7, 3) .

Ideally we would use the prior of the participants in the experiment, to compare how well they do compared to the Bayesian decision tree, and estimate how much mistaken beliefs cost them. Unfortunately they were not surveyed on this and the authors do not give any indication of how much of an edge the subjects expected before starting the game, so for analysis, we’ll make up a plausible prior.

Going back to the R implementation, we add in the two additional parameters, the beta estimation, and estimate the hardwired probabilities:

This beta Bayesian update is easy and doesn’t require calling out to a library: with a beta uniform prior ( Beta(1,1) ), update on binomial (win/loss) is conjugate with the simple closed form posterior of Beta(1+win, 1+n-win) ; and the expectation of the beta is 1+win / 1+n . Then our expected values in f simply change from 0.6/0.4 to expectation/1-expectation.

The curse of dimensionality can be tamed here a little by observing that a history of coin-flip sequences is unnecessary—it doesn’t matter what order the coin flips happen in, only the number of wins and the number of total coin-flips matter (are the sufficient statistics). So we only need to augment wealth/rounds/bet-amount with wins/n. And one can compute p from wins/n by Bayesian updates of a prior on the observed coin flip results, treating it as a binomial problem of estimating p distributed according to the beta distribution .

A variant on this problem is when the probability of winning is not given, if one doesn’t know in advance p=0.6. How do you make decisions as coin flips happen and provide information on p? Since p is unknown and only partially observed, this turns the coin-flip MDP into a POMDP problem. A POMDP can be turned back into a MDP and solved in a decision tree by including the history of observations as more parameters to explore over.

In the general game, it performs poorly because it self-limits to $250 in the many games where the max cap is higher, and it often wipes out to $0 in games where the edge is small or negative.

One baseline would be to simply take our pre-existing exact solution V / VPplan and run it on the generalized Kelly by ignoring observations and assuming 300 rounds. It will probably perform reasonably well since it performs optimally at least one possible setting, and serve as a baseline:

and then we can modify nshepperd’s Haskell code to allow the edge and wealth cap to be modifiable, and then we simply run memo over the 1000 possible sets of parameters, and take the mean. Using the raw samples turns out to be infeasible in terms of RAM, probably because higher wealth caps (which are Pareto distributed and can be extremely large) increases dimensionality too much to evaluate in my ~25GB RAM, so the wealth cap itself must be shrunk, to TODO $1000? which unfortunately muddles the meaning of the estimate since now it’s a lower bound of an upper bound, or to reverse it, value of clairvoyant play assuming that many games have a lower wealth cap than they do.

The Haskell version is the easiest to work with, but Haskell doesn’t have convenient distributions to sample from, so instead I generate 1000 samples from the prior distribution of parameters in Python (reusing some of the Gym code):

We can approximate it with our pre-existing value function for a known stopping time/edge/max wealth, sampling from the posterior; for example, we might draw 1000 values from NN(300,25), the Pareto, the Beta etc, calculate the value of each set of parameters, and taking the mean value. Since the value of information is always zero or positive, the value of the POMDP with known parameters must be an upper bound on its true value, since we can only do worse in each game by being ignorant of the true parameters and needing to infer them as we play; so if we calculate a value of $240 or whatever, we know the POMDP must have a value <$240, and that gives a performance target to aim for.

What is the value of this full POMDP game with uncertain wealth cap, stopping time, and edge?

Since the Bayesian decision tree is too hard to compute in full, we need a different approach. Tabular approaches in general will have difficulty as the full history makes the state-space vastly larger, but also so do just the sufficient statistics of wins/losses + rounds elapsed + maximum observed wealth.

One approach which should be able to cope with the complexities would be deep reinforcement learning, whose neural networks let them optimize well in difficult environments, and given the sufficient statistics, we don’t need to train an explicitly Bayesian agent, as a deep RL agent trained on random instances can learn a reactive policy which adapts ‘online’ and computes an approximation of the Bayes-optimal agent (Ortega et al 2019 is a good review of DRL/Bayesian RL), which as we have already seen is intractable even for small problems like this.

Unfortunately, the sparsity of rewards in the Kelly coinflip game (receiving a reward only every ~300 steps) does make solving the game much harder for DRL agents.

To create a deep RL agent for the generalized Kelly coin-flip game, I will use the keras-rl library of agents, which is based on the Keras deep learning framework (backed by TensorFlow); deep learning is often sensitive to choice of hyperparameters, so for hyperparameter optimization, I use the hyperas wrapper around hyperopt (paper).

keras-rl offers primarily DQN (Mnih et al 2013) and DDPG (Lillicrap et al 2015).

DQN is a possibility but for efficiency most DQN implementations require/assume a fixed number of actions, which in our case of penny-level betting, implies having 25,000 distinct unordered actions. I tried prototyping DQN for the simple Kelly coin-flip game using the keras-rl DQN example, but after hours, it still had not received any rewards—it repeatedly bet more than its current wealth, equivalent to its current wealth, and busting out, never managing to go 300 rounds to receive any reward and begin learning. (This sort of issue with sparse rewards is a common problem with DQN and deep RL in general, as the most common forms of exploration will flail around at random rather than make deep targeted exploration of particular strategies or try to seek out new regions of state-space which might contain rewards.) One possibility would have been to ‘cheat’ by using the exact value function to provide optimal trajectories for the DQN agent to learn from.

DDPG isn’t like tabular learning but is a policy gradient method (Karpathy explains policy gradients). With deep policy gradients, the basic idea is that the ‘actor’ NN repeatedly outputs an action based on the current state, and then at the end of the episode with the final total reward, if the reward is higher than expected, all the neurons who contributed to all the actions get strengthened to be more likely to produce those actions, and if the reward is lower than expected, they are weakened. This is a very indiscriminate error-filled way to train neurons, but it does work if run over enough episodes for the errors to cancel out; the noise can be reduced by training an additional ‘critic’ NN to predict the rewards from actions, based on an experience replay buffer. The NNs are usually fairly small (to avoid overfitting & because of the weak supervision), use RELUs, and the actor NN often has batchnorm added to it. (To prevent them from learning too fast and interfering with each other, a copy is made of each called the ‘target networks’.) For exploration, some random noise is added to the final action; not epsilon-random noise (like in DQN) but a sort of random walk, to make the continuous action consistently lower or higher (and avoid canceling out).

A continuous action would have less of an unordered action/curse of dimensionality required switching from DQN to DDPG (example). Unfortunately, DDPG also had initialization problems in overbetting, as its chosen actions would quickly drift to always betting either very large positive or negative numbers, which get converted to betting everything or $0 respectively (which then always result in no rewards via either busting & getting 0 or going 300 rounds & receiving a reward of 0), and it would never return to meaningful bets. To deal with the drift/busting out problem, I decided to add some sort of constraint on the agent: make it choose a continuous action 0–1 (squashed by a final sigmoid activation rather than linear or tanh), and convert it into a valid bet, so all bets would be meaningful and it would have less of a chance to overbet and bust out immediately. (This trick might also be usable in DQN if we discretize 0–1 into, say, 100 percentiles.) My knowledge of Keras is weak, so I was unsure how to convert an action 0–1 to a fraction of the agent’s current wealth, although I did figure out how to multiply it by $250 before passing it in. This resulted in meaningful progress but was sensitive to hyperparameters—for example, the DDPG appeared to train worse with a very large experience replay buffer than with the Lillicrap et al 2015 setting of 10,000.

Hyperparameter-wise, Lillicrap et al 2015 used, for the low-dimensional (non-ALE screen) RL problems similar to the general Kelly coin-flip, the settings:

2 hidden layers of 400 & 300 RELU neurons into a final tanh

neurons into a final tanh an experience replay buffer of 10,000

Adam SGD optimization algorithm with learning rate: 10 -4

optimization algorithm with learning rate: 10 discount/gamma: 0.99

minibatch size: 64

an Ornstein-Uhlenbeck process random walk, μ =0, σ =0.2, θ =0.15

For the simple Kelly coin-flip problem, the previous experiment with training based on the exact value function demonstrates that 400+300 neurons would be overkill, and probably a much smaller 3-layer network with <128 neurons would be more than adequate; the discount/gamma being set to 0.99 for Kelly is questionable, as almost no steps result in a reward, and a discount rate of 0.99 would imply for a 300 round game that the final reward from the perspective of the first round would be almost worthless (since 0.99300=0.049), so a discount rate of 0.999 or even just 1 would make more sense; because the reward is delayed, I expect the gradients to not be so helpful, and perhaps a lower learning rate or larger minibatch size would be required; finally, the exploration noise is probably too small, as the random walk would tend to increase or decrease bet amounts by less than $1, so probably a larger σ would be better. With these settings, DDPG can achieve an average per-step/round reward of ~0.8 as compared to the optimal average reward of 0.82 ($246 over 300 steps/rounds).

For the general Kelly coin-flip problem, the optimal average reward is unknown, but initial tries get ~0.7 (~$210). To try to discover better performance, I set up hyperparameter optimization using hyperas / hyperopt over the following parameters:

number of neurons in each layer in the actor and critic (16–128)

size of the experience replay buffer (0–100000)

exploration noise: μ ( N ( 0 , 5 ) ) θ (0–1) σ ( N ( 3 , 3 ) )

learning rate: Adam, main NNs (log uniform −7 to −2; roughly, 0.14—0.00…) Adam, target NNs (*)

minibatch size (8–2056)

Each agent/sample is trained ~8h and is evaluated on mean total reward over 2000 episodes, with 120 samples total.

Full source code for the generalized Kelly coin-flip game DDPG:

import numpy as np import gym from hyperas.distributions import choice, randint, uniform, normal, loguniform from hyperas import optim from hyperopt import Trials, STATUS_OK, tpe from keras.models import Sequential, Model from keras.layers import Dense, Activation, Flatten, Input, concatenate, Lambda, BatchNormalization from keras.optimizers import Adam from rl.agents import DDPGAgent from rl.memory import SequentialMemory from rl.random import OrnsteinUhlenbeckProcess def model(x_train, y_train, x_test, y_test): ENV_NAME = 'KellyCoinflipGeneralized-v0' gym.undo_logger_setup() # Get the environment and extract the number of actions. env = gym.make(ENV_NAME) numpy.random.seed( 123 ) env.seed( 123 ) nb_actions = 1 # Next, we build a very simple model. actor = Sequential() actor.add(Flatten(input_shape = ( 1 ,) + ( 5 ,))) actor.add(Dense({{choice([ 16 , 32 , 64 , 96 , 128 ])}})) actor.add(BatchNormalization()) actor.add(Activation( 'relu' )) actor.add(Dense({{choice([ 16 , 32 , 64 , 96 , 128 ])}})) actor.add(BatchNormalization()) actor.add(Activation( 'relu' )) actor.add(Dense({{choice([ 16 , 32 , 64 , 96 , 128 ])}})) actor.add(BatchNormalization()) actor.add(Activation( 'relu' )) # pass into a single sigmoid to force a choice 0-1, corresponding to fraction of total possible wealth. # It would be better to multiply the fraction against one's *current* wealth to reduce the pseudo-shift # in optimal action with increasing wealth, but how do we set up a multiplication against the first # original input in the Flatten layer? This apparently can't be done as a Sequential... actor.add(Dense(nb_actions)) actor.add(BatchNormalization()) actor.add(Activation( 'sigmoid' )) actor.add(Lambda( lambda x: x * 250 )) print (actor.summary()) action_input = Input(shape = (nb_actions,), name = 'action_input' ) observation_input = Input(shape = ( 1 ,) + ( 5 ,), name = 'observation_input' ) flattened_observation = Flatten()(observation_input) x = concatenate([action_input, flattened_observation]) x = Dense({{choice([ 16 , 32 , 64 , 96 , 128 ])}})(x) x = Activation( 'relu' )(x) x = Dense({{choice([ 16 , 32 , 64 , 96 , 128 ])}})(x) x = Activation( 'relu' )(x) x = Dense({{choice([ 16 , 32 , 64 , 96 , 128 ])}})(x) x = Activation( 'relu' )(x) x = Dense(nb_actions)(x) x = Activation( 'linear' )(x) critic = Model(inputs = [action_input, observation_input], outputs = x) print (critic.summary()) memory = SequentialMemory(limit = {{randint( 100000 )}}, window_length = 1 ) random_process = OrnsteinUhlenbeckProcess(size = nb_actions, theta = {{uniform( 0 , 1 )}}, mu = {{normal( 0 , 5 )}}, sigma = {{normal( 3 , 3 )}}) agent = DDPGAgent(nb_actions = nb_actions, actor = actor, critic = critic, critic_action_input = action_input, memory = memory, nb_steps_warmup_critic = 301 , nb_steps_warmup_actor = 301 , random_process = random_process, gamma = 1 , target_model_update = {{loguniform( - 7 , - 2 )}}, batch_size = {{choice([ 8 , 16 , 32 , 64 , 256 , 512 , 1024 , 2056 ])}}) agent. compile (Adam(lr = {{loguniform( - 7 , - 2 )}}), metrics = [ 'mae' ]) # Train; ~120 steps/s, so train for ~8 hours: agent.fit(env, nb_steps = 3000000 , visualize = False , verbose = 1 , nb_max_episode_steps = 1000 ) # After training is done, we save the final weights. agent.save_weights( 'ddpg_ {} _weights.h5f' , overwrite = True ) # Finally, evaluate our algorithm for n episodes. h = agent.test(env, nb_episodes = 2000 , visualize = False , nb_max_episode_steps = 1000 ) reward = numpy.mean(h.history[ 'episode_reward' ]) print "Reward: " , reward return { 'loss' : - reward, 'status' : STATUS_OK, 'model' : agent} def data(): return [], [], [], [] if __name__ == '__main__' : best_run, best_model = optim.minimize(model = model, data = data, algo = tpe.suggest, max_evals = 120 , # 8h each, 4 per day, 30 days, so 120 trials trials = Trials()) print ( "Best performing model chosen hyper-parameters:" ) print (best_run)

How could this be improved?