Owing to the diverse range of settings under which the generative agents framework can be examined (with different action sets, rendering environments, brush types, episode lengths, agent and discriminator hyper-parameters, and so on), we first provide highlights of our main findings in section 3 . In figures, we show selected agents to illustrate phenomena that are possible to observe with the framework. This is not to imply that the settings those agents were trained with are necessary or sufficient for the observed results. In the appendix we provide a systematic analysis of the major components of SPIRAL++ via controlled ablations.

We show highlights of our experiments on Celeba-HQ (Karras et al., 2017 ) in Figure 4 . Amongst all the framework’s settings, the one we observed to have the most profound impact on the agents’ behaviour was the number of brush strokes (steps) they were allowed in each episode to generate an image. Agents that were constrained with short episodes learned qualitatively different policies than those that could afford numerous interactions, producing images with a degree of visual abstraction, not unlike how humans do when similarly constrained (Selim, 2018 ; Fan et al., 2019 ) . For this reason we structure the results into two sections: short episodes in subsection 3.1 and long episodes in subsection 3.2 .

Random unconditional samples generated by agents trained on CelebA-HQ. Separate agents with varying hyper-parameters were trained to generate samples (a, b, c) in 17 steps with various brushes and action spaces, (d) in 19 steps using an oil paint simulator, (e) in 31 steps, (f) in 400 steps, (g) in 1000 steps. As a baseline, (h) shows unconditional samples generated in 19 steps by our best reimplementation of SPIRAL (, WGAN-GP + single discriminator per population + hyperparameter tweaks). Our improved method scales with episode length from relatively abstract to approach realistic-looking results.

3.1 Short episodes

We encourage the reader to take a few moments to inspect the samples in Figure 4(a-c). Each row was generated by a different agent, differing in the settings of their environments (e.g. brush type, action space, episode length).

It was surprising for us to see the aesthetically pleasing way in which the agent draws faces: e.g. using a large circle to delineate the outline of the face, dots for each of the eyes, a line for the nose and a line for the mouth. Note that the agent has never been exposed to human drawings of human faces, but only to realistic photographs of human faces. Also note that the agents choose to use bright colours and thin strokes to depict salient elements of faces despite no element of the framework encouraging such behaviour. In all cases, the architecture of the agent is constant, and it is the variation in the characteristics of the agent’s environment that creates the diversity of styles. These results serve as an existence proof for the conjecture that some aspects of human drawings of faces can emerge automatically from a learning framework as simple as that of generative agents (namely agents equipped with brushes working against a discriminator), without the need for supervision, imitation or social cues.

Unlike in most modern GANs which directly output pixels, in this setting there is a large discrepancy between what the generator can produce, and what it should produce. The brushes are too constrained and there is simply not enough time in the episodes for the generator to be able to produce a completely photo-realistic image. And in our experiments the discriminators can always distinguish between generated and real images with high confidence. Nevertheless we observe that when sufficiently regularised, discriminators can provide sufficient signal for meaningful learning to take place, suggesting that they are still capable of ranking generated images in a useful way. We explore this further in Figure 5, by training agents on color photos but only with various grayscale brushes.

(a) gray with gray (diverse) (b) color with gray (mode‑collapse) (c) gray with black (mode‑collapse) Figure 5 : Tasked with the impossible, these agents make do. Samples are selected from three 20 step models trained on CelebA-HQ with modified environments. As a baseline, (a) was tasked with generating grayscale photos using a black brush with variable opacity. (b) was tasked with generating color photos using the same variable opacity black brush. (c) was tasked with generating grayscale photos using an opaque black pen. Models (b) and (c) often manage to draw recognizable faces despite the huge gap between what they can and should produce, however both experience severe mode collapse: each agent in these populations generates minor variations on a single image rather than a full distribution of images.

It is informative to examine how agents interact with the simulated canvas to produce these images. In Figure 6 we show a number of episodes as sequences. We see that agents can learn to manipulate the location, colour and thickness of the brush with precision, constructing the final images stroke by stroke. It is worth noting that due to RL’s objective of maximizing potentially delayed rewards, agent policies often deviate from being purely greedy, and they often take actions that appear to reduce the quality of the image, especially early in the episode, only for it to be revealed to have been important for the final drawing. We revisit this point in subsection 3.2 and Figure 17 in the appendix.

Figure 6 : 10 step canvas sequences show precise control. Agents interact with the canvas to produce the final image, manipulating the location, colour and thickness of the brush with precision.

As described in subsection 2.4, we use population based training (PBT, Jaderberg et al., 2017 ). In Figure 7 (and Figure 11 in the appendix) we show samples from three different agents that were trained as part of the same population. We see that the different generators each specialize in a subset of the modes of the full distribution, each producing images with a perceptibly different style. Note that in all three cases the architecture of the agents, their action spaces, and the settings of their environments were identical, and they differ only in their weights and evolved learning rate and entropy cost. Finally, we show conditional samples in Figure 8. The agent is able to match the higher level statistics of the target images, and even appears to capture the faces’ smiles. In Figure 12 in the appendix we visualise the agent’s stochasticity by producing multiple samples for the same target image.

Figure 7 : Within-population and within-agent diversity (short episodes). We show three sets of samples for three different agents in the same population, showing both the diversity of samples for each agent, and the diversity of samples across agents in the same population. The first agent attempts more figurative line drawings, whilst the last agent uses more realistic shading. We do occasionally observe ‘mode collapse’ where certain samples repeat themselves, for instance in the middle agent though there is still slight variation in the way each image is drawn.