The Rubik’s cube

The 3 × 3 × 3 Rubik’s cube consists of smaller cubes called cubelets. These are classified by their sticker count: centre, edge and corner cubelets have 1, 2 and 3 stickers, respectively. The Rubik’s cube has 26 cubelets with 54 stickers in total. The stickers have colours and there are six colours, one per face. In the solved state, all stickers on each face of the cube are the same colour. Given that the set of stickers on each cubelet is unique (that is, there is only one cubelet with white, red and green stickers), the 54 stickers themselves can be uniquely identified in any legal configuration of the Rubik’s cube. The representation given to the DNN encodes the colour of each sticker at each location using a one-hot encoding. As there are six possible colours and 54 stickers in total, this results in a state representation of size 324.

Moves are represented using face notation: a move is a letter stating which face to rotate. F, B, L, R, U and D correspond to turning the front, back, left, right, up and down faces, respectively. Each face name is in reference to a fixed front face. A clockwise rotation is represented with a single letter, whereas a letter followed by an apostrophe represents an anticlockwise rotation. For example: R rotates the right face by 90° clockwise, while R′ rotates it by 90° anticlockwise.

The Rubik’s cube state space has 4.3 × 1019 possible states. Any valid Rubik’s cube state can be optimally solved with at most 26 moves in the quarter-turn metric, or 20 moves in the half-turn metric22,25. The quarter-turn metric treats 180° rotations as two moves, whereas the half-turn metric treats 180° rotations as one move. We use the quarter-turn metric.

Additional combinatorial puzzles

Sliding puzzles

Another combinatorial puzzle we use to test DeepCubeA is the n-piece sliding puzzle. In the n puzzle, n square sliding tiles, numbered from 1 to n, are positioned in a square of length \(\sqrt {n + 1}\), with one empty tile position. Thus, the 15 puzzle consists of 15 tiles in a 4 × 4 grid, the 24 puzzle consists of 24 tiles in a 5 × 5 grid, the 35 puzzle consists of 35 tiles in a 6 × 6 grid and the 48 puzzle consists of 48 tiles in a 7 × 7 grid. Moves are made by swapping the empty position with any tile that is horizontally or vertically adjacent to it. For both puzzles, the representation given to the neural network uses one-hot encoding to specify which piece (tile or blank position) is in each position. For example, the dimension of the input to the neural network for the 15 puzzle would be 16 * 16 = 256. The 15 puzzle has 16!/2 ≈ 1.0 × 1013 possible states, the 24 puzzle has 25!/2 ≈ 7.7 × 1024 possible states, the 35 puzzle has 36!/2 ≈ 1.8 × 1041 possible states and the 48 puzzle has 49!/2 ≈ 3.0 × 1062 possible states. Any valid 15 puzzle configuration can be solved with at most 80 moves33,34. The largest minimal numbers of moves required to solve the 24 puzzle, 35 puzzle and 48 puzzle are not known.

Lights Out

Lights Out contains N2 lights on an N × N board. The lights can either be on or off. The representation given to the DNN is a vector of size N2. Each element is 1 if the corresponding light is on and 0 if the corresponding light is off.

Sokoban

The Sokoban environment we use is a 10 × 10 grid that contains four boxes that an agent needs to push onto four targets. In addition to the agent, boxes and targets, Sokoban also contains walls. The representation given to the DNN contains four binary vectors of size 102 that represent the position on the agent, boxes, targets and walls. Given that boxes can only be pushed, not pulled, some actions are irreversible. For example, a box pushed into a corner can no longer be moved, creating a sampling problem because some states are unreachable when starting from the goal state. To address this, for each training state, we start from the goal state and allow boxes to be pulled instead of pushed.

Deep approximate value iteration

Value iteration15 is a dynamic programming algorithm14,16 that iteratively improves a cost-to-go function J. In traditional value iteration, J takes the form of a lookup table where the cost-to-go J(s) is stored in a table for all possible states s. Value iteration loops through each state s and updates J(s) until convergence:

$$J(s) \leftarrow {\mathrm{min}}_{a}\mathop {\sum}\limits_{s^{\prime}} {P^a} (s,s^\prime )(g^a(s,s^\prime ) + \gamma J(s^\prime ))$$ (3)

Here Pa(s, s′) is the transition matrix representing the probability of transitioning from state s to state s′ by taking action a; ga(s, s′) is the cost associated with transitioning from state s to s′ by taking action a; γ is the discount factor. In principle, this update equation can also be applied to the puzzles investigated in this Article. However, as these puzzles are deterministic, the transition function is a degenerate probability mass function for each action, simplifying equation (3). Furthermore, because we wish to assign equal importance to all costs, γ = 1. Therefore, we can update J(s) using equation (1).

However, given the size of the state space of the Rubik’s cube, maintaining a table to store the cost-to-go of each state is not feasible. Therefore, we resort to approximate value iteration16. Instead of representing the cost-to-go function as a lookup table, we approximate the cost-to-go function using a parameterized function j θ , with parameters θ. This function is implemented using a DNN. Therefore, we call the resulting algorithm DAVI:

Algorithm 1: DAVI

Input:

B: Batch size

K: Maximum number of scrambles

M: Training iterations

C: How often to check for convergence

\(\epsilon\): Error threshold

Output:

θ: Trained neural network parameters

θ ← initialize_parameters()

θ e ← θ

for m = 1 to M do

X ← get_scrambled_states(B, K)

for x i ∈ X do

\(y_i \leftarrow {\rm{min}}_{a}(g^a(x_i,A(x_i,a)) + j_{\theta _{\mathrm{e}}}(A(x_i,a)))\)

\(\theta ,{\mathrm{loss}} \leftarrow {\mathrm{train}}(j_\theta ,X,{\mathbf{y}})\)

if (M mod C = 0) and (\({\mathrm{loss}} < \epsilon\)) then

θ e ← θ

Return θ

To train the DNN, we have two sets of parameters: the parameters being trained, θ, and the parameters used to obtain an improved estimate of the cost-to-go, θ e . The output of \(j_{\theta _{\mathrm{e}}}(s)\) is set to 0 if s is the goal state. The DNN is trained to minimize the mean squared error between its estimation of the cost-to-go and the estimation obtained from equation (1). Every C iterations, the algorithm checks if the error falls below a certain threshold \(\epsilon\); if so, then θ e is set to θ. The entire DAVI process is shown in Algorithm 1. Although we tried updating θ e at each iteration, we found that the performance saturated after a certain point and sometimes became unstable. Updating θ e only after the error falls below a threshold \(\epsilon\) yields better, more stable, performance.

Training set state distribution

For learning to occur, we must train on a state distribution that allows information to propagate from the goal state to all the other states seen during training. Our approach for achieving this is simple: each training state x i is obtained by randomly scrambling the goal state k i times, where k i is uniformly distributed between 1 and K. During training, the cost-to-go function first improves for states that are only one move away from the goal state. The cost-to-go function then improves for states further away as the reward signal is propagated from the goal state to other states through the cost-to-go function. This can be seen as a simplified version of prioritized sweeping35. Exploring in reverse from the goal state is a well-known technique and has been used in means-end analysis36 and STRIPS37. In future work we will explore different ways of generating a training set distribution.

Distributed training

In the Rubik’s cube environment, there are 12 possible actions that can be applied to every state. Using equation (1) to update the cost-to-go estimate of a single state thus requires applying the DNN to 12 states. As a result, equation (1) takes up the majority of the computational time. However, as is the case with methods such as ExIt38, this is a trivially parallelizable task that can easily be distributed across multiple GPUs.

BWAS

A* search17 is a heuristic-based search algorithm that finds a path between a starting node x s and a goal node x g . A* search maintains a set, OPEN, from which it iteratively removes and expands the node with the lowest cost. The cost of each node x is determined by the function f(x) = g(x) + h(x), where g(x) is the path cost (the distance between x s and x) and h(x) is the heuristic function, which estimates the distance between x and x g . After a node is expanded, that node is then added to another set, CLOSED, and its children that are not already in CLOSED are added to OPEN. The algorithm starts with only the starting node in OPEN and terminates when the goal node is removed from OPEN.

In this application, each node corresponds to a state of the Rubik’s cube and the goal node corresponds to the goal state shown in Fig. 1. The path cost of every child of a node x is set to g(x) + 1. The path cost of x s is 0. The heuristic function h(x) is obtained from the learned cost-to-go function shown in equation (2).

A variant of A* search, called weighted A* search19, trades potentially longer solutions for potentially less memory usage. In this case, the function f(x) is modified to f(x) = λg(x) + h(x), with weight λ ∈ [0, 1]. While decreasing the weight λ will not necessarily decrease the number of nodes generated39, in practice our experiments show that decreasing λ generally reduces the number of nodes generated and increases the length of the solutions found. In our implementation, if we encounter a node x that is already in CLOSED, and if x has a lower path cost than the node that is already in CLOSED, we remove that node from CLOSED and add x to OPEN.

The most time-consuming aspect of the algorithm is the computation of the heuristic h(x). The heuristic of many nodes can be computed in parallel across multiple GPUs by expanding the N best nodes from OPEN at each iteration. Our experiments show that larger values of N generally lead to shorter solutions and evaluate more nodes per second than searches with smaller N. We call the combination of A* search with a path-cost weight λ and a batch size of N ‘BWAS’.

To satisfy the theoretical bounds on how much the length of a solution will deviate from the length of an optimal solution, the heuristic used in the weighted A* search must be admissible. That is to say that the heuristic can never overestimate the cost to reach the goal. Although DeepCubeA’s value function is not admissible, we empirically evaluate by how much DeepCubeA overestimates the cost to reach the goal. To do this, we obtain the length of a shortest path to the goal for 100,000 Rubik’s cube states scrambled between 1 and 30 times. We then evaluate those same states with DeepCubeA’s heuristic function j θ . We find that DeepCubeA’s heuristic function does not overestimate the cost to reach the goal 66.8% of the time and 97.4% of the time it does not overestimate it by more than one. The average overestimation of the cost is 0.24.

Neural network architecture

The first two hidden layers of the DNNs have sizes of 5,000 and 1,000, respectively, with full connectivity. These are then followed by four residual blocks27, where each residual block has two hidden layers of size 1,000. Finally, the output layer consists of a single linear unit representing the cost-to-go estimate (Supplementary Fig. 3). We used batch normalization40 and rectified linear activation functions41 in all hidden layers. The DNN was trained with a batch size of 10,000, optimized with ADAM42, and did not use any regularization. The maximum number of random moves applied to any training state K was set to 30. The error threshold ε was set to 0.05. We checked if the loss fell below the error threshold every 5,000 iterations. Training was carried out for 1 million iterations on two NVIDIA Titan V GPUs, with six other GPUs used in parallel for data generation. In total, the DNN saw 10 billion examples during training. Training was completed in 36 h. When solving scrambled cubes from the test set, we use four NVIDIA X Pascal GPUs in parallel to compute the cost-to-go estimate. For the 15 puzzle, 24 puzzle and Lights Out we set K to 500. For the 35 puzzle, 48 puzzle and Sokoban we set K to 1,000. For the 24 puzzle we use six residual blocks instead of four.

Comparison to multi-step lookahead update strategies

Instead of using equation (1), which may be seen as a depth-1 breadth-first search (BFS), to update the estimated cost-to-go function we experimented with a depth-2 BFS. To obtain a better perspective on how well DeepCubeA’s learning procedure trains the given DNN, we also implemented an update strategy of trying to directly imitate the optimal cost-to-go function calculated using the handmade optimal solver25 by minimizing the mean squared error between the output of the DNN and the oracle value provided by the optimal solver. We demonstrate that the DNN trained with DAVI achieves the same performance as a DNN with the same architecture trained with these update strategies. The performance obtained from a depth-2 BFS update strategy is shown in Supplementary Fig. 1. Although the final performance obtained with depth-2 BFS is similar to the performance obtained with depth-1 BFS, its computational cost is significantly higher. Even when using 20 GPUs in parallel for data generation (instead of six), the training time is five times longer for the same number of iterations. Supplementary Fig. 2 shows that the DNN trained to imitate the optimal cost-to-go function predicts the optimal cost-to-go more accurately than DeepCubeA for states scrambled 20 or more times. The figure also shows the performance on solving puzzles using a greedy best-first search with this imitated cost-to-go function suffers for states scrambled fewer than 20 times. We speculate that this is because imitating the optimal cost-to-go function causes the DNN to overestimate the cost to reach the goal for states scrambled fewer than 20 times.

Hyperparameter selection for BWAS

To choose the hyperparameters of BWAS, we carried out a grid search over λ and N. Values of λ were 0.0, 0.2, 0.4, 0.6, 0.8 and 1.0 and values of N were 1, 100, 1,000 and 10,000. The grid search was performed on 100 cubes that were generated separately from the test set. The GPU machines available to us had 64 GB of RAM. Hyperparameter configurations that reached this limit were stopped early and thus not included in the results. Supplementary Fig. 4 shows how λ and N affect performance in terms of average solution length, average number of nodes generated, average solve time and average number of nodes generated per second. The figure shows that as λ increases, the average solution length decreases; however, the time to find a solution typically increases as well. The results also show that larger values of N lead to shorter solution lengths, but generally also require more time to find a solution; however, the number of nodes generated per second also increases due to the parallelism provided by the GPUs. Because λ = 0.6 and N = 10,000 resulted in the shortest solution lengths, we use these hyperparameters for the Rubik’s cube. For the 15 puzzle, 24 puzzle and 35 puzzle we use λ = 0.8 and N = 20,000. For the 48 puzzle we use λ = 0.6 and N = 20,000. We increased N from 10,000 to 20,000 because we saw a reduction in solution length. For Lights Out we use λ = 0.2 and N = 1,000. For Sokoban we use λ = 0.8 and N = 1.

PDBs

PDBs26 are used to obtain a heuristic using lookup tables. Each lookup table contains the number of moves required to solve all possible combinations of a certain subgoal. For example, we can obtain a lookup table by enumerating all possible combinations of the edge cubelets on the Rubik’s cube using a BFS. These lookup tables are then combined through either a max operator or a sum operator (depending on independence between subgoals)7,8 to produce a lower bound on the number of steps required to solve the problem. Features from different PDBs can be combined with neural networks for improved performance43.

For the Rubik’s cube, we implemented the PBD that Korf uses to find optimal solutions to the Rubik’s cube7. For the 15 puzzle, 24 puzzle and 35 puzzle, we implement the PDBs described in Felner and other’s work on additive PDBs9. To the best of our knowledge, no-one has created a PDB for the 48 puzzle. We create our own by partitioning the puzzle into nine subgoals of size 5 and one subgoal of size 3. For all the n puzzles, we also save the mirror of each PDB to improve the heuristic and map each lookup table to a representation of size pk where p is the total number of puzzle pieces and k is the size of the subgoal. Although this uses more memory, this is done to increase the speed of the lookup table9. For the n puzzle, the optimal solver algorithm (IDA*23) adds an additional optimization by only computing the location of the beginning state in the lookup table and then only computing offsets for each subsequently generated state.

Web server

We have created a web server, located at http://deepcube.igb.uci.edu/, to allow anyone to use DeepCubeA to solve the Rubik’s cube. In the interest of speed, the hyperparameters for BWAS are set to λ = 0.2 and N = 100 in the server. The user can initiate a request to scramble the cube randomly or use the keyboard keys to scramble the cube as they wish. The user can then use the ‘solve’ button to have DeepCubeA compute and post a solution, and execute the corresponding moves. The basic web server’s interface is displayed in Supplementary Fig. 5.

Conjugate patterns and symmetric states

Because the operation of the Rubik’s cube is deeply rooted in group theory, solutions produced by an algorithm that learns how to solve this puzzle should contain group theory properties. In particular, conjugate patterns of moves of the form aba−1 should appear relatively often when solving the Rubik’s cube. These patterns are necessary for manipulating specific cubelets while not affecting the positions of other cubelets. Using a sliding window, we gathered all triplets in all solutions to the Rubik’s cube and found that aba−1 accounted for 13.11% of all triplets (significantly above random), while aba accounted for 8.86%, aab accounted for 4.96% and abb accounted for 4.92%. To put this into perspective, for the optimal solver, aba−1, aba, aab and abb accounted for 9.15, 9.63, 5.30 and 5.35% of all triplets, respectively.

In addition, we found that DeepCubeA often found symmetric solutions to symmetric states. One can produce a symmetric state for the Rubik’s cube by mirroring the cube from left to right, as shown in Fig. 4. The optimal solutions for two symmetric states have the same length; furthermore, one can use the mirrored solution of one state to solve the other. To see if this property was present in DeepCubeA, we created mirrored states of the Rubik’s cube test set and solved them using DeepCubeA. The results showed that 58.30% of the solutions to the mirrored test set were symmetric to those of the original test set. Of the solutions that were not symmetric, 69.54% had the same solution length as the solution length obtained on the original test set. To put this into perspective, for the handmade optimal solver, the results showed that 74.50% of the solutions to the mirrored test set were symmetric to those of the original test set.