Introduction

Recent breakthroughs in game playing AIs have caused a massive surge in interest in tabula rasa reinforcement learning methods⁷, that is, methods that use no prior knowledge or expert heuristics but instead achieve learning by huge iterations of self-play and basic knowledge of game rules.

Moreover the idea of combining data-driven learning techniques with more “classical” symbolic AI ones has also seen a recent surge⁴ ⁶: layers of neural networks feeding into classical symbolic systems are being studied as a way to minimize known issues with pure neural network/reinforcement learning based approaches - namely the opaqueness of the AI decision process and the huge training data sets required for convergence.

This type of architecture composed of model-free learners and model-based solvers seems particularly promising as it echoes the dual-process theory of cognitive neuroscience whereby learning and reasoning arise by the interplay between a fast, opaque and intuitive process — System 1 — and a slow, logical, analytical one — System 2. (Although not proven, evidence supporting this dual-process theory exists in experiments like the Iowa Gambling Task² and several other studies³).

Even more recent breakthroughs (see DeepMind’s AlphaZero⁵) involve embedding a look-ahead Monte Carlo search algorithm (MCS) within the training loops, resulting in very rapid improvements in learning rates and precision.

Imandra

At Aesthetic Integration, we’ve developed Imandra™, a powerful automated reasoning engine (belonging to the “classical” symbolic AI space) with, amongst its many features, first-class support for a very powerful and efficient form of state-space exploration called Principal Region Decomposition. This feature allows one to compute a symbolic representation of the possible unique behaviours of a program subject to a given query constraint, and to sample possible input values belonging to each particular region.

When it comes to RL (Reinforcement Learning)— as we demonstrate below — we model the environment state along with all possible (valid) state transitions as a (potentially infinite) state machine. Imandra’s decomposition machinery then allows us to compute N-many symbolic state transitions ahead. The word ‘symbolic’ in the last sentence is key – most relevant models have infinitely many (or infeasibly many) possible states, so an efficient way of reasoning about “all possible” transitions is important. Monte Carlo Search (MCS) methods also compute N-many steps ahead, however they do so either in a naive way that can hardly be applied to domains more complex than a board game, or require very complex and ad-hoc evaluation/pruning policies. This is a key reason why we believe Imandra paves the way to scalable generic RL.

Compared to more wide-spread symbolic reasoning systems, like logic programming languages (e.g. Prolog) or rule systems, Imandra can express significantly more powerful concepts with greater precision, trading complete automation for a much more powerful logic based on symbolic analysis of recursive functions and datatypes, an arsenal of advanced reasoning techniques and counterexample-producing search strategies that are complete for key fundamental reasoning tasks.

We thus believe that Imandra’s capabilities have the potential to significantly extend, complement and improve on current state-of-the-art AI approaches in several different areas:

development of safer AI systems: safety-critical fields like avionics, autonomous driving systems or bioinformatics based on statistical-AIs or incomplete search methods are doomed to behave in unexpected ways, potentially leading to loss of life in situations that could easily be avoided by introducing exhaustive search methods and quantitative analysis. With Imandra, we can use decompositions of a system state-space to create test scenarios automatically and derive quantitative coverage metrics to help determine how much of the search space has been explored. This rigorous approach to state-space exploration has the potential to bring about much more complete and safe training procedures.

improvements over current black-box methods: the push for non-symbolic methods lead to the well-known problem of the emergence of opaque decision procedures: we can see that a trained machine performs the task at hand, but we don’t know how or why it does what we’ve trained it to do. A symbolic-statistical symbiosis (so called deep symbolic reinforcement learning) has the potential to significantly improve on this, by mapping high-dimensional low-level raw input into a lower dimensional, highly abstract state space, such as a representation reasonable through Imandra.

faster convergence of reinforcement-learning based approaches: while completely generic, current tabula-rasa reinforcement learning methods require an absurd amount of computing power and humongously large training sets. Recent research has shown how utilising symbolic search methods can significantly improve the convergence of RL methods, especially in the initial phases.

In the remainder of this post we’ll explore this last hypothesis by going over the results of a recent paper¹ showcasing how embedding a Monte Carlo search algorithm within a reinforcement learning loop results in appreciably faster convergence, and extend this idea by replacing the MCS phase with Imandra’s decomposition leading to even faster convergence.

The Proof of Concept

For our PoC implementation we’ve decided to reproduce the simple experimental setup from the QM-learning paper¹, using a simple TicTacToe game and a table-based Q-learning; we’re aware that this approach won’t scale past simple games but we’re focusing on proving the viability of this approach first and on scaling it later, thus we’ll not investigate more complex neural-network based approaches in this initial post.

We’re also aware that TicTacToe is not a very exciting game to solve and brute-force methods can solve it much more efficiently than using reinforcement-learning techniques, but it’s a simple enough game that the entire implementation can fit in a few dozens lines and so it’s a perfect fit for analysis purposes; moreover our PoC is written in such a way to be completely game-generic and would work without any change on any 2-player 0-sum game (and can be extended to other games with very minor changes).

Our implementation of TicTacToe is written in logic-mode IML (a pure subset of OCaml for which Imandra has a formal semantics) and thus is directly formally analysable and “reasonable”. This means that we can use the game model to ask Imandra questions about or to enumerate possible future game states under arbitrary constraints. This is going to be crucial for our decomposition-based agent.

Q(M|D)-learning

Without going too much into details (there’s plenty of literature available⁷), the idea behind Q learning is for an agent to learn a policy used to determine which is the best action to take when in a particular game state.

In its simplest form (and the one we’re going to use), Q-Learning is just a table of Q-values (long-term rewards) for every <state,action> pair possible in the game environment — this table is updated during learning matches using the following Bellman equation:

with α ∈ [0,1] = learning rate, γ ∈ [0,1] = discount factor for future rewards, a ranging over “future available actions” and R = reward.

In pseudocode, the entire learning algorithm looks like:

#1 for each episode:

#2 s = initial state

#3 loop until termination:

#4 if ε-greedy():

#5 a = random_action(Q(s))

#6 else:

#7 if exists Q(s) > 0:

#8 a = argmax(Q(s))

#9 else:

#10 a = random_action_preferring_unexplored_ones(Q(s))

#11 R, s’ = step (s, a)

#12 Q(s,a) = Q(s,a) + α[R+γmaxₐQ(s’,a’) — Q(s, a)]

#13 s = s’

where ε-greedy is an agent-defined function determining the exploration policy, we’re using two different policies depending on the agent:

static ε-greedy defined as rand 1.0 <= ε

defined as dynamic decreasing ε-greedy defined as:

rand 1.0 <= ε*(1+cos(episode_n*π*(2*learning_matches)⁻¹))-ε

The change from Q to QM-learning consists in changing line #10 to:

#10 a = mcs(s, time_limit)

where mcs is defined roughly as:

#1 as = available_actions(s)

#2 for i = 0; i < as.size && time < time_limit; i = i+1%as.size:

#3 s’ = step s as[i]

#4 score[i] += get_score(random_playout (s’))

#5 visits[i] += 1

#6 for i = 0 to as.size:

#7 score[i] = score[i]/visits[i]

#8 return argmax(score)

By embedding this Monte Carlo search, we’re effectively adding a version of the playout and backup stages of MCTS to the Q-learning loop.

Finally, to get QD-learning (Q-learning using Imandra’s decomposition) we instead replace line #10 with:

#10 a = idf_s (s, max_lookahead)

where idf_s is defined roughly as:

#1 regions = idf.decompose game_model

state:s

template:(repeat max_lookahead AnyMove)

#2 samples = map idf.sample_region regions

#3 for i=0 to samples.size:

#4 moves = samples[i]

#5 s’ = play(s, moves)

#6 score[moves[0]] += get_score(random_playout(s’))

#7 visits[moves[0]] += 1

#8 for i = 0 to as.size:

#9 score[i] = score[i]/visits[i]

#10 return argmax(score)

The core of this algorithm resides in the idf.decompose function. That function is produced by our IDF framework (we’ve written about it, with executable examples, in our imandra_tools notebook) by statically analysing our logic-mode implementation of TicTacToe and, when invoked, through a complex dance of static analysis, symbolic reasoning and code synthesis, returns all the possible regions of state space that are reachable under a side-condition (in this case our side-condition will be TicTacToe.is_valid_move ), exploiting region symmetries to efficiently collapse distinct sequences of moves resulting in identical final states and thus potentially reducing the exploration space of more complex games.

This approach has the significant advantage over the MCS one in that in its simplest incarnation (as we described above) guarantees exploration of all possible distinct paths of length max_lookahead towards a future game state.

In more complex game with massively larger state-space it is often not viable to produce this wide lookahead tree, and thus a random sampling of regions will be needed (using policies like UCT as used in MCTS) instead of an exhaustive search through them, but the IDF approach will still be able to explore many more unique regions of behavior than a traditional MCS approach, resulting in the Q-agent to converge faster. Moreover even if forced to use a non-exhaustive search, our approach still allows us to keep track of the unexplored regions and mark them for future exploration, and to use model aware pruning algorithms for better results.

In other words, we use our IDF framework as a completely generic way of generating MC-like search machines that are model-aware, without doing any of the work of manually encoding complex game heuristics or game-specific search algorithms, instead inferring them automatically through formal reasoning over the game rules.

The benchmarks

We’ve implemented and compared 6 different agents against a random player:

- QStatic : classic ε-greedy Q-Agent with static ε

- QDynamic : classic ε-greedy Q-Agent with dynamically decreasing ε

- QMStatic : ε-greedy Q-Agent with static ε + time limited MCS

- QMDynamic : ε-greedy Q-Agent with dynamically decreasing ε + time limited MCS

- QDStatic : ε-greedy Q-Agent with static ε + move-limited decomposition lookahead

- QDDynamic : ε-greedy Q-Agent with dynamically decreasing ε + move-limited decomposition lookahead

The hyper-parameters are the ones used in the QM paper we’re comparing against, namely:

α = 0.1

γ = 0.9

ε = 0.3 (0.5 in the dynamic version)

Every agent starts with an initial win rate of 50%, and does 30_000 “learning” matches, where ε-greediness is turned on, inserting random exploratory episodes (that is, letting the agent explore new paths of actions), and a subsequent 20_000 matches of playouts with ε-greediness turned off where the agent plays in a pure exploitative manner (that is, making the agent exploit the learned paths of actions, without trying to explore new ones, thus consolidating learned knowledge).

The above benchmarks confirm the findings that embedding a limited search inside a Q-learning loop leads to significantly faster and appreciably higher convergence rates than “pure” Q-learning agents.

It also confirms our hypothesis that replacing this limited search from MCS to a decomposition-based approach leads to even faster and higher convergence rates.

Interestingly we see that while for the classic Q and QM agents the dynamic ε-greedy policy, while slower, makes the agent converge to a higher rate of wins, the QD agents converge to roughly the same value. This is the case simply due to the small state-space of TicTacToe, which results in the QD players not particularly benefitting from ε-greediness — in games with appreciably larger state spaces we would see the QDDynamic agent too eventually converge to a higher win rate than the QDStatic one.

What’s Next

As we’ve seen from this small and self-contained PoC, we at AI believe that augmenting current state-of-the-art RL techniques with symbolic processing capabilities offered by Imandra can lead to significant improvements in learning rates, and in particular we see Imandra’s state space decomposition facilities as a major new player in the reinforcement-learning space.

Some possible future research areas involve using decomposition to directly sample Q-values out of the game model + current game state and making more use of Imandra’s theorem-proving capabilities in the symbolic-reasoning phase.

We intend to further this research by extending this approach to neural-network based approaches, thus transcending the limitations of the simple table-based Q-learning approach showcased in the above PoC and attempting to scale this approach beyond simple board games.

We’re also committed (and on our way) to building what we call “Imandra Reasoning Machines”, a cloud-based, automatically scalable, reasoning infrastructure — think AWS lambdas, but tailored to symbolic reasoning.

Make sure to follow our work at on Imandra by checking out our interactive notebooks and other posts if what we’re doing interests you!

References

¹ https://arxiv.org/pdf/1802.05944v2.pdf

² https://www.sciencedirect.com/science/article/pii/0010027794900183

³ https://scottbarrykaufman.com/wp-content/uploads/2014/04/dual-process-theory-Evans_Stanovich_PoPS13.pdf

⁴ https://www.ijcai.org/proceedings/2018/0002.pdf

⁵ https://arxiv.org/pdf/1712.01815.pdf

⁶ https://arxiv.org/pdf/1609.05518.pdf

⁷ http://incompleteideas.net/book/bookdraft2017nov5.pdf