Imagine a data set of images labeled “suffering” or “no suffering”. For instance, suppose the “suffering” category contains documentations of war atrocities or factory farms, and the “no suffering” category contains innocuous images – say, a library. We could then use a neural network or other machine learning algorithms to learn to detect suffering based on that data. In contrast to many AI safety proposals, this is feasible to at least some extent with present-day methods.

The neural network could monitor the AI’s output, its predictions of the future, or any other data streams, and possibly shut the system off to make it fail-safe. Additionally, it could scrutinize the AI’s internal reasoning processes, which might help prevent mindcrime.

The naive form of this idea is bound to fail. The neural network would merely learn to detect suffering in images, not the abstract concept. It would fail to recognize alien or “invisible” forms of suffering such as digital sentience. The crux of the matter is the definition of suffering, which raises a plethora of philosophical issues.

An ideal formalization would be comprehensive and at the same time easy to implement in machine learning systems. I suspect that reaching that ideal is difficult, if not impossible, which is why we should also look into heuristics and approximations. Crucially, suffering is “simpler” in a certain sense than the entire spectrum of complex human values, which is why training neural networks – or other methods of machine learning – is more promising for suffering-focused AI safety than for the goal of loading human values.

If direct implementations turn out to be infeasible, we could look into approaches based on preference inference. Just as any other preference, AI systems can potentially learn the preference to avoid suffering via (cooperative) inverse reinforcement learning. Alternatively, we might program AI systems to infer suffering from the actions and expressions of others. That way, if the AI observes an agent struggling to prevent an outcome , the AI should conclude that the realization of that outcome may constitute suffering.

This requires sufficiently accurate models of other minds, which contemporary machine learning systems lack. It is, however, closer to the technical language of real-world AI systems than purely philosophical descriptions such as “a conscious experience with subjectively negative valence”.

Further research on the idea could focus on three areas: