The AIXI formula (2.3) gives a precise, mathematical description of the optimal behaviour in essentially any situation. Unfortunately, the formula itself is incomputable, and cannot directly be used in a practical agent. Nonetheless, having a description of the right behaviour is still useful when constructing practical agents, since it tells us what behaviour we are trying to approximate. The following three sections describe three substantially different approximation approaches. They differ widely in their approximation approaches, and have all demonstrated convincing experimental performance. Sect. 2.4.4 connects UAI with recent deep learning results.

2.4.1 MC-AIXI-CTW

MC-AIXI-CTW [85] is the most direct approximation of AIXI. It combines the Monte Carlo Tree Search algorithm for approximating expectimax planning, and the Context Tree Weighting algorithm for approximating Solomonoff induction. We describe these two methods next.

Planning with sampling. The expectimax planning principle described in Sect. 2.3.4 requires exponential time to compute, as it simulates all future possibilities in the planning tree seen in Fig. 2.2. This is generally far too slow for all practical purposes.

A more efficient approach is to randomly sample paths in the planning tree, as illustrated in Fig. 2.3. Simulating a single random path \(a_te_t\dots a_me_m\) only takes a small, constant amount of time. The average return from a number of such simulated paths gives an approximation \(\hat{V}(\ae _{<t}a_t)\) of the value. The accuracy of the approximation improves with the number of samples.

Open image in new window A simple way to use the sampling idea is to keep generating samples for as long as time allows for. When an action must be chosen, the choice can be made based on the current approximation. The sampling idea thus gives rise to an anytime algorithm that can be run for as long as desired, and whose (expected) output quality increases with time.

Monte Carlo Tree Search. The Monte Carlo Tree Search (MCTS) algorithm [2, 11, 36] adds a few tricks to the sampling idea to increase its efficiency. The sampling idea and the MCTS algorithm are illustrated in Fig. 2.3.

One of the key ideas of MCTS is in optimising the informativeness of each sample. First, the sampling of a next percept \(e_k\) given a (partially simulated) history \(\ae _{<k}a_k\) should always be done according to the current best idea about the environment distribution; that is, according to \(M(e_k\mid \ae _{<k}a_k)\) for Solomonoff-based agents.

The sampling of actions is more subtle. The agent itself is responsible for selecting the actions, and actions that the agent knows it will not take, are pointless for the agent to simulate. As an analogy, when buying a car, I focus the bulk of my cognitive resources on evaluating the feasible options (say, the Ford and the Honda) and only briefly consider clearly infeasible options such as a luxurious Ferrari. Samples should be focused on plausible actions.

One way to make this idea more precise is to think of the sampling choice as a multi-armed Bandit problem (a kind of “slot machine” found in casinos). Bandit problems offer a clean mathematical theory for studying the allocation of resources between arms (actions) with unknown returns (value). One of the ideas emerging from the bandit literature is the upper confidence bound (UCB) algorithm that uses optimistic value estimates \(V^+\). Optimistic value estimates add an exploration bonus for actions that has received comparatively little attention. The bonus means that a greedy agent choosing actions that optimise \(V^+\) will spend a sufficient amount of resources exploring, while still converging on the best action asymptotically.

The MCTS algorithm uses the UCB algorithm for action sampling, and also uses some dynamic programming techniques to reuse sampling results in a clever way. The MCTS algorithm first caught the attention of AI researchers for its impressive performance in computer Go [22]. Go is infamous for its vast playout trees, and allowed the MCTS sampling ideas to shine.

Induction with contexts. Computing the universal probability \(M(e_t\mid \ae _{<t}a_t)\) of a next percept requires infinite computational resources. To be precise, conditional probabilities for the distribution M are only limit computable [48]. We next describe how probabilities can be computed efficiently with the context tree weighting algorithm (CTW) [86] under some simplifying assumptions.

One of the key features of Solomonoff induction and UAI is the use of histories (Sect. 2.3.1), and the arbitrarily long time dependencies they allow for. For example, action \(a_1\) may affect the percept \(e_{1000}\). This is desirable, since the real world sometimes behaves this way. If I buried a treasure in my backyard 10 years ago, chances are I may find it if I dug there today. However, in most cases, it is the most recent part of the history that is most useful when predicting the next percept. For example, the most recent five minutes is almost always more relevant than a five minute time slot from a week ago for predicting what is going to happen next.

We define the context of length c of a history as the last c actions and percepts of the history:

Relying on contexts for prediction makes induction not only computationally faster, but also conceptually easier. For example, if my current context is 0011, then I can use previous instances where I have been in the same context to predict the next percept:

In the pictured example, \(P(1)=2/3\) would be a reasonable prediction since in two thirds of the cases where the context 0011 occurred before it was followed by a 1. (Laplace’s rule gives a slightly different estimate.) Humans often make predictions this way. For example, when predicting whether I will like the food at a Vietnamese restaurant, I use my experience from previous visits to Vietnamese restaurants.

2.2 Short context More data Less precision Long context Less data Greater precision One question that arises when doing induction with contexts is how long or specific the context should be. Should I use the experience from all Vietnamese restaurants I have ever been to, or only this particular Vietnamese restaurant? Using the latter, I may have very limited data (especially if I have never been to the restaurant before!) On the other hand, using too unspecific contexts is not useful either: Basing my prediction on all restaurants I have ever been to (and not only the Vietnamese), will probably be too unspecific. Tablesummarises the tradeoff between short and long contexts, which is nicely solved by the CTW algorithm.

The right choice of context length depends on a few different parameters. First, it depends on how much data is available. In the beginning of an agent’s lifetime, the history will be short, and mainly shorter contexts will have a chance to produce an adequate amount of data for prediction. Later in the agent’s life, the context can often be more specific, due to the greater amount of accumulated experience.

Second, the ideal context length may depend on the context itself, as aptly demonstrated by the example to the right. Assume you just heard the word cup or cop. Due to the similarity of the words, you are unable to tell which of them it was. If the most recent two words (i.e. the context) was fill the, you can infer the word was cup, since fill the cop makes little sense. However, if the most recent two words were from the, then further context will be required, as both drink from the cup and run from the cop are intelligible statements.

Context Tree Weighting. The Context Tree Weighting (CTW) algorithm is a clever way of adopting the right context length based both on the amount of data available and on the context. Similar to how Solomonoff induction uses a sum over all possible computer programs, the CTW algorithm uses a sum over all possible context trees up to a maximum depth D. For example, the context trees of depth \(D\le 2\) are the trees:

The structure of a tree encodes when a longer context is needed, and when a shorter context suffices (or is better due to a lack of data). For example, the leftmost tree corresponds to an iid process, where context is never necessary. The tree of depth \(D=1\) posits that contexts of length 1 always are the appropriate choice. The rightmost tree says that if the context is 1, then that context suffices, but if the most recent symbol is 0, then a context of length two is necessary. Veness et al. [85] offer a more detailed description.

\(O(2^{2^D})\) different trees. The trees can be given binary encodings; the coding of a tree \(\varGamma \) is denoted \({ CL}(\varGamma )\) . Each tree \(\varGamma \) gives a probability \(\varGamma (e_t\mid \ae _{<t}a_t)\) for the next percept, given the context it prescribes using. Combining all the predictions yields the CTW distribution: $$\begin{aligned} { CTW}(e_{<t}\mid a_{<t}) = \sum _{\varGamma }2^{-{ CL}(\varGamma )}\varGamma (e_{<t}\mid a_{<t}) \end{aligned}$$ (2.4) 2.1 \({ CTW}(e_t\mid \ae _{<t}a_t)\) takes double-exponential time. However, the CTW algorithm [ 86 \({ CTW}(e_t\mid \ae _{<t}a_t)\) in O(D) time. That is, for fixed D, it is a constant-time operation to compute the probability of a next percept for the current history. This should be compared with the infinite computational resources required to compute the Solomonoff-Hutter distribution M. For a given maximum depth D, there aredifferent trees. The trees can be given binary encodings; the coding of a treeis denoted. Each treegives a probabilityfor the next percept, given the context it prescribes using. Combining all the predictions yields the CTW distribution:The CTW distribution is tightly related to the Solomonoff-Hutter distribution (), the primary difference being the replacing of computer programs with context trees. Naively computingtakes double-exponential time. However, the CTW algorithm [] can compute the predictionin O(D) time. That is, for fixed D, it is a constant-time operation to compute the probability of a next percept for the current history. This should be compared with the infinite computational resources required to compute the Solomonoff-Hutter distribution M.

Despite its computational efficiency, the CTW distribution manages to make a weighted prediction based on all context trees within the maximum depth D. The relative weighting between different context trees changes as the history grows, reflecting the success and failure of different context trees to accurately predict the next percept. In the beginning, the shallower trees will have most of the weight due to their shorter code length. Later on, when the benefit of using longer contexts start to pay off due to the greater availability of data, the deeper trees will gradually gain an advantage, and absorb most of the weight from the shorter trees. Note that CTW handles partially observable environments, a notoriously hard problem in AI.

MC-AIXI-CTW. Combining the MCTS algorithm for planning with the CTW approximation for induction yields the MC-AIXI-CTW agent. Since it is history based, MC-AIXI-CTW handles hidden states gracefully (as long as long-term dependencies are not too important). The MC-AIXI-CTW agent can run on a standard desktop computer, and achieves impressive practical performance. For example, MC-AIXI-CTW can learn to play Rock Paper Scissors, TicTacToe, Kuhn Poker, and even Pacman, just by trying actions and observing percepts, and without additional knowledge about the rules of the game [85]. For computational reasons, in PacMan the agent did not view the entire screen, only a compressed version telling it the direction of ghosts and nearness of food pellets (16 bits in total). Although less informative, this drastically reduced the number of bits per interaction cycle, and allowed for using a reasonably short context. Thereby the less informative percepts actually made the task computationally easier.

Other approximations of Solomonoff induction. Although impressive, a major drawback of the CTW approximation of Solomonoff induction is that the CTW-agents cannot learn time dependencies longer than the maximum depth D of the context trees. This means that MC-AIXI-CTW will underperform in situations where long-term memory is required.

A few different approaches to approximating Solomonoff induction has been explored. Generally they are less well-developed than CTW, however.

A seemingly minor generalisation of CTW is to allow loops in context trees. Such loops allow context trees of a limited depth to remember arbitrarily long dependencies, and can significantly improve performance in domains where this is important [12]. However, the loops break some of the clean mathematics of CTW, and predictions can no longer be computed in constant time. Instead, practical implementations must rely on approximations such as simulated annealing to estimate probabilities.

The speed prior [71] is a version of the universal distribution M where the prior is based on both program length and program runtime. The reduced probability of programs with long runtime makes the speed prior computable. It still requires exponential or double-exponential computation time, however [18]. Recent results show that program-based compression can be done incrementally [19]. These results can potentially lead to the development of a more efficient anytime-version of the speed prior. It is an open question whether such a distribution can be made sufficiently efficient to be practically useful.