This paper is an extended version of our previous paper [BSB*18] , published at Expressive 2018. As in the original paper, we first discuss various image filtering and stylization techniques that can be used to dampen the negative affect elicited by surgery pictures. We then report on an interview session with four surgeons, who helped us differentiate between techniques that can preserve important information, and techniques that are unusable because they obfuscate too much. We then report on an experiment where the most promising techniques were tested on ordinary subjects. We found that all techniques can reduce the repulsiveness of surgery pictures as judged by participants, although spatial‐domain techniques appear to be more potent than colour manipulations. We then extend this work with an experiment to find suitable parameters for a given stylization technique to support the intended degree and style of abstraction, a discussion of how to extend the technique to video, results from a second set of interviews with surgeons based on the improved implementation, as well as an implementation of the technique for use in the browser to filter offensive content. We conclude by a discussion and opportunities for future work.

In the past, in fact, some researchers have already examined such effects of stylization. Mandryk et al . [MML11] , respectively, Mould et al . [MML12] , e.g. studied people's emotional response to stylized images and found that emotional responses were generally muted, and that responses concentrated around neutral feelings. Others [DWI13] argue (motivated by the results of earlier studies [SSRL96] ) that stylization may affect people's attitude towards a data visualization and result in longer times they spend looking at the visuals. Here, however, we are less interested in potentially positive effects of stylization, but instead in how much it can diminish negative effects caused by unpleasant pictures. Such pictures are involved when surgeons inform their patients before surgical procedures—because many people find surgery pictures repellent (e.g. [TLSL97 , SLW*02] ), effective communication can suffer. This context would seem like an ideal application case for expressive rendering [Ise16] . Nonetheless, the creation of effective illustrative visualizations of a wide variety of surgical procedures is still beyond our abilities. We thus study whether it is possible to use existing stylization techniques for 2D images—applied to real surgical images—to achieve a similar effect, and diminish the negative affect that surgery pictures can elicit. Applications go beyond patient information and include student training (some medical students will not become surgeons), media communication and public education.

Many non‐photorealistic and expressive rendering techniques deal with the stylization of two‐dimensional (2D) images or videos [KCWI13 , RC13] . While much of this work was initially motivated by the desire to replicate artistic techniques and was only guided by a subjective visual comparison to existing artwork, researchers have begun to empirically evaluate the effects of stylization [Ise13 , Sal02 , GLJ*10] . Some researchers argue, however, that controlled experiments are difficult in the context of expressive rendering [Mou14] , and that we should rather concentrate on subjective evaluation [Mou14] and on the appreciation of resulting graphics [HL13] . While such forms of evaluation arguably have their place in the context of the often art‐inspired field of expressive rendering, the goal of creating expressive graphics is increasingly understood to incorporate more than the ‘support of artists (or illustrators)’ or the ‘creation of tools for visual expression for non‐artists’ and to include, e.g. also ‘illustrations […] to inform […] patients’ in a medical context [Ise16] . In this latter case, it is then essential that we understand how different stylistic filters are perceived and experienced by real people, and that we thus study them empirically, through controlled experiments [GLJ*10] .

On a societal level, offensive imagery has been addressed in two major ways: legal censorship and de facto (or self) censorship. While there appears to be close to no legal restriction on what visual content can be published in newspapers [Too14] or in Wikipedia [Wik10a , Wik10b] , films and videos games are usually regulated by rating systems to classify the media with regard to its suitability for different audiences. While movies cannot be easily customized, the video game industry has explored a wide range of ‘adjustable censorship’ techniques. Some old video games had violent and sexual content disabled by default, while giving the option to reactivate it through the use of secret codes. More elaborate adjustable censorship techniques were also developed: some video games (e.g. Silent Hill, Resident Evil, House of the Dead, later release of Ocarina of Time) offer the option to change the colour of blood to various tones such as blue, dark or green depending on the game [TVT18a , Zel18] . While most mangas feature black blood due to the constraints of black and white printing, some colour animes and animated films employ a different blood colour to suit all audiences (e.g. Dragon Ball Kai, Pokémon, Bleach, The little mermaid) [TVT18b] . Similarly, in movies, black and white has been occasionally used to censor scenes with excessive bloodshed [Kil17] . All such practices suggest that blood is considered to epitomize violence, but once deprived from its characteristic red colour, it seems to suddenly become inoffensive in people's minds. These practices provide motivation for considering simple colour manipulation techniques in our work.

In parallel to this body of work focusing on blood‐injury disgust, there has been work in psychology and the neurosciences where various types of emotionally salient stimuli were used to study emotion and cognition. Such stimuli were used, for example, to study cultural differences in emotion processing [WKG*03] , and emotion regulation [EVW*07] . Some of the stimuli involved surgery and injury photos, but again, affective neutralization through image processing has not been a focus. Nevertheless, this area of research has produced standardized stimuli sets which we will use for our own study, as explained in Section 5.1 .

Despite all this previous work, human reactions to the sight of surgery scenes remain poorly understood. Our goal is not to further this understanding, but simply to find out whether processing surgery photos can dampen their affective potency. As far as we know, all studies on blood‐injury disgust have either assessed aversive stimuli in isolation or compared them with neutral stimuli, and none of them has studied the effect of processing aversive stimuli using filters or stylization. When conducting our study, we drew from the experience accumulated in this research area, but simplified the methods to directly answer our research question.

Most of these studies were conducted to untangle the emotions involved when people witness surgeries or injuries, sometimes in the hope of better treating BII phobia. This has proven hard to study, as reactions seem to involve various emotions such as anxiety, fear, disgust and vicarious pain [COL09 , BLD*08] . In particular, the relative role of fear versus disgust has long been a subject of debate, although now the consensus seems to be that disgust is the main emotion involved [COL09 , OCMP10] . To understand why, it helps to recall that fear has evolved for organisms to run away from threats such as spiders, but for static content like body injuries, no such response is necessary [COL09] . More likely, body injuries are experienced as repellent to prevent risks of disease or contagion following physical contact, which requires a disgust response [COL09] . Chapman and Anderson [CA12] introduced a taxonomy of disgust where blood‐injury disgust is a subtype of physical disgust , and whose evolutionary function is to avoid infection. Olatunji et al . [OHMD08] , however, distinguish contamination disgust from animal‐reminder disgust , with animal‐reminder disgust being elicited by ‘attitudes and practices surrounding sex, injury to the body or violations of its outer envelope, and death’ which all act as ‘reminders of our own mortality and inherent animalistic nature’ [OHMD08] .

Researchers have employed various measurements to quantify subject reactions, the most common being heart rate (e.g. [KWW77 , OHMD08] ). Others include facial expression using videotaping [LM92] or electromyography [LGBH93 , OHMD08] , skin conductance [LGBH93] , neural activation using fMRI [SlSW*02] , attentional avoidance using eye tracking [AHO13] and visuomotor processing using a response priming task [HS14] . Researchers also used subjective measures, asking subjects to report to what extent they felt fear and disgust [TLSL97 , SLW*02] , avoided watching [OHMD08] or experienced vasovagal (i.e. pre‐fainting) symptoms [GD12] . A strong reaction to a body injury depiction is often marked by a decrease in heart rate, or an increase followed by a rapid decrease called ‘diphasic response’ [COL09] . It also often involves activation of the corrugator supercilii (the ‘frowning muscle') and the levator labii (which lifts the upper lip) [COL09] . However, studies are inconsistent and there appears to be no perfectly reliable measure that can consistently elicit the same response [COL09] .

For several decades, researchers have studied human response to repellent images to uncover the physiological and psychological mechanisms involved. Studies have used various types of aversive stimuli such as homicide scenes [HWBS70] , spiders [TLSL97] , vomit [OHMD08] , maggots, cadavers and dirty toilets [SlSW*02] . Closer to our concerns, many studies have examined responses to scenes depicting a body envelope violated by an injury or a surgery. Examples include photos of body mutilation (e.g. [KWW77] ), of surgery procedures (e.g. [TLSL97 , SLW*02] ) and videos of medical interventions such as blood draw [GD12] , open‐heart surgery [OHMD08] or surgical amputation [RH08] . Studies have involved both ordinary subjects (e.g. [HWBS70] ), BII‐phobic 1 subjects (e.g. [ÖSL84] ) and often a combination of both (e.g. [HS14] ).

We note that researchers also have examined the opposite path: changing the stylization of images based on emotions detected in a video feed. Shugrina et al . [SBC06] , for example, presented their ‘empathic painting’ technique that recognizes a person's emotional state based on features of their facial expression, which they then use to adjust the parameters of a painterly rendering technique. Here, Shugrina et al . borrow from the psychological literature and created a mapping from the detected emotional state to rendering parameters such as stroke path and colour. Yet, it is not clear if the resulting images also change the emotional state of the viewer, or if so in what way this process can be controlled.

Most relevant for our own work, out of the expressive rendering literature, is Mandryk et al .'s [MML11] and Mould et al .'s [MML12] work who demonstrated that stylization can affect the emotional interpretation of images. Similar to what we do in our experiment, they applied a range of styles (stippling, line art, painterly rendering and blur) to a set of images with different affective content from the International Affective Picture System (IAPS), and analysed people's feeling of arousal, valence, dominance and aesthetics. Stylization generally muted participants’ emotional responses towards a neutral point, yet emotions were never completely suppressed. Their negative stimuli (e.g. a gun pointed at camera, or a cemetery), however, did not have the repulsive potency that surgery photos can have. This study thus inspires our own, but we specifically target surgery pictures that many people cannot easily look at.

Already early work on non‐photorealistic rendering, however, discussed this very effect. Duke et al . [DBHM03] and Halper et al . [HMH*03] , for example, described how the (non‐photorealistic) depiction style can affect people's assessment of danger and safety as well as strength and weakness, and can change their participation and interaction behaviour (for study details, see section 2 of Halper's thesis [Hal03] ). Even before this work, Schumann et al . [SSRL96] provided evidence for stylization to increase people's willingness to interact with visuals. More recently, McDonnell et al . [MBB12] showed that an increased abstraction of virtual characters (according to their participants' classification of ‘realism') decreases appeal, friendliness and trustworthiness up to a point; for highly abstracted depictions, people again feel similar about the stylized virtual characters as they do for realistic depictions—similar to what the Uncanny Valley theory predicts. Like the perceptual studies discussed before, however, these approaches do not shed light on whether stylization changes people's negative emotions towards disturbing images.

In the past, researchers have studied how stylization can influence how people perceive images. Gooch and Willemsen [GW02] , e.g. showed that a line‐based rendering of a virtual scene leads participants to underestimate distances by about a third, quite similar to what happens in ‘photorealistic’ virtual reality (VR) settings. Later, Gooch et al . [GRG04] showed that non‐photorealistic illustrations and caricatures of people's portraits could be learned faster than real photographs. We cannot deduct from these results; however, that stylized images would lead people to feel differently about what is shown.

Another common characteristic of many illustrative techniques and also traditional illustration styles—for medical application and otherwise—is the use of abstraction and emphasis. These aspects have been discussed in the visualization and expressive rendering literature, such as in the contributions by Rautek et al . [RBGV08] and Viola and Isenberg [VI18] . Here, abstraction is ‘a transformation which preserves one or more key concepts and removes detail that can be attributed to natural variation, noise, or other aspects that one intentionally wants to disregard from consideration’ [VI18] —to allow viewers of a visualization to focus on major or important aspects. In this work, however, we explore the abstracting qualities of image filters for the removal of details such that the images are perceived as less offensive—potentially because they no longer depict surgery situations in all their details.

Medical illustration has long been among the primary motivations for non‐photorealistic and expressive rendering [GG01 , SS02] , and consequently, many researchers have developed rendering techniques for this purpose. Several surveys and tutorials cover the field in detail (e.g. [CSESS05 , ECS06 , VCSE*06 , PB14 , LVPI18] ), and we thus refrain from citing specific techniques. Common among them is, however, that they are inspired by traditional, usually hand‐made illustration techniques, styles and examples—they thus focus on clarification and explanation, rather than on emotion or on reducing the negative affect that certain content could induce in people.

Here, we mainly seek to replicate the level of abstraction targeted by the respective works: for Oilpaint, smoothing parameters with a light paint texture that fall into the medium range as described by Semmo et al . [SLKD16] ; an implementation for Watercolorusing flow‐based bilateral filtering of FlowAbs but with wider filter kernels to achieve a similar level of abstraction as Bousseau et al . [BKTS06] and Wang et al . [WWF*14] ; the ‘colourist wash’ preset of BrushStrokes defined by Hertzmann [Her98] to produce semi‐transparent layered brush strokes; the default art map used by Praun et al . [PHWF01] that is linearly mapped to the luminance in CIE‐Lab colour space; and default parameters for Stippling described by Martín et al . [MALI11] at the highest resolution, with a normal distribution and black‐and‐white thresholding.

We set the ApparentGrey filter to use default settings with uniform spatial control of four subbands to locally adjust local chromatic contrasts. For HueShift, we use a shift of −120.0 degrees on the hue channel mapped to the hue‐saturation‐lightness (HSL) colour wheel. We align the Bilateral filter to obtain a soft Gaussian smoothing with a spatial distance using additional filtering in the CIE‐Lab colour space using an increased weight ( ). For FlowAbs, we use default parameters for edge enhancement with a doubled distance for bilateral filtering to compromise with the 512 × 512 pixel images used by Kyprianidis and Döllner [KD08] . We use the Kuwahara filter in a typical configuration with a radius of six pixels aligned to eight sectors, a slightly smoothed structure tensor, and multi‐scale estimation [Kyp11] . For deliberate smoothing across shape boundaries using ShapeSimpl and CoherenceEnh, we configure these filters to perform a single step of shock filtering after every iteration of mean curvature flow—i.e. four [KK11] ) and eight [KL08] steps in total, respectively. Finally, the edge enhancement uses XDoG to output fine coherent lines with high details and a two‐tone thresholding to sparsely obtain negative edges [WKO12] .

The parameterization of the selected techniques is a crucial aspect to strike a balance between filtering information and retaining significant image structures. We thus based our default configurations on the pre‐sets reported by the original authors of each technique, under the assumption that their findings on generalized photos can also be applied to surgical images if details and properties such as contrast are balanced. In addition, we followed the approach by Tomasi and Manduchi [TM98] of applying multiple iterations of the bilateral filter—and Gaussian‐based filters in general—to preserve edges at a better scale while still being able to provide strong simplification effects. The parameters and configurations summarized in Table 1 are based on the reported results and default pre‐sets, and used in our experiments for medical images at a resolution of 1024 × 768 pixels. For images whose resolution differ from 1024 × 768, the techniques’ parameters that relate to spatial distances can be linearly scaled to obtain a stable level of abstraction. In the following, we briefly explain the rationale behind these configurations and kindly refer to the respective original works for in‐depth discussions.

Artistic image stylization has been suggested to dampen emotional responses [MML11] . We considered stylization techniques that simulate traditional media and painting techniques found in illustrative visualization, i.e. watercolour, oil paint, pen‐and‐ink and stippling. Image filters are prominently used as building blocks of complex stylization effects, such as the bilateral filter and DoG to obtain toon renderings (FlowAbs), and flow‐based Gaussian smoothing for more abstract filtering that simulates oil paint [SLKD16] (Oilpaint). In addition, we use a Watercolor technique that simulates effects such as pigment density variation, edge darkening, wet‐in‐wet and wobbling [BKTS06 , WWF*14] . For stroke‐based rendering, a popular method is to iteratively align brush strokes of varying colour, size and orientation according to the input image, for which we use Hertzmann's [Her98] approach (BrushStrokes). Techniques for tonal depiction typically direct tonal art maps based on luminance, for which we use a 2D hatching implementation that borrows from Praun et al . [PHWF01] coupled with a DoG‐based edge (Hatching). Finally, we consider the example‐based stippling technique described by Martín et al . [MALI10 , MALI11] (Stippling), as it is able to provide scale‐dependent results.

A real‐time approach that is less sensitive to noise is to approximate the Laplacian of Gaussian [MH80] using difference of Gaussians (DoG). The approach has shown to provide smooth edges of delicate structures, e.g. with respect to human faces [GRG04] , since they are adapted to the local orientation of an input image to create smooth coherent outputs for line and curve segments. Counterbalancing strong simplifications of the bilateral filter, we thus combine it with the enhanced separable flow‐based implementations of the DoG [KLC07 , KD08 , WKO12] for FlowAbs, to reintroduce filtered structures as enhanced visual cues and obtain a cartoon‐like effect. We also retain the XDoGfilter [WKO12] as a generalized approach that is able to obtain two tone black‐and‐white images, which relates to drawings found in illustrative visualization.

Winnemöller et al . [WKO12] distinguish between gradient‐based edge detection that thresholds the gradient magnitude of an image and Laplacian‐based edge detection that identifies zero‐crossings in the second derivative. Popular gradient‐domain approaches identify image gradients with high magnitudes by using convolution filters, such as the Prewitt and Sobel filter [Pra07] , with subsequent thresholding of the magnitude. The approach is popular with medical images to ease object recognition, and, however, produces results that are sensitive to noise. The Canny edge detector [Can86] as a multi‐stage algorithm provides several enhancements by combining smoothing and differentiation operators. However, although popular with magnetic resonance imaging (MRI) and computed tomography (CT) images, it is more directed to semantic segmentation and may produce disconnected edge segments.

Many filters focus on image decompositions by solving optimization problems to separate detail from base information, e.g. based on weighted least squares [FFLS08] , histograms [KS10] and gradient minimization [XLXJ11] . While they have strengths in applications requiring complementary global optimizations such as tone mapping and colourization, they are typically not suited for interactive applications.

Contrary to previous filters, additional categories weight colours across feature boundaries for higher levels of abstraction, for which we retain methods with shock filtering, i.e. in conjunction with a constrained mean curvature flow [KL08] (ShapeSimpl) and diffusion tensors for coherence‐enhancing abstraction [KK11] (CoherenceEnh). Morphological filtering based on dilation and erosion, and geodesic filtering using distance transforms are also popular choices to obtain results of high abstraction [CSRP10 , Mou12] , but were found to require local control to effectively adjust the level of abstraction.

A popular approach that works accurately even with high‐contrast images—contrary to the bilateral filter—and provides smoothed outputs at curved boundaries, is the Kuwahara filter [KHEK76] and its generalized [PPC07] and anisotropic [KKD09 , Kyp11] variants. The kernel of the anisotropic Kuwahara filter is divided into shape‐aligned overlapping subregions, where the response is defined as the mean of the subregion with minimal variance. We retain the multi‐scale variant [Kyp11] and refer to it as Kuwahara. It maintains a uniform level of abstraction due to local area flattening and can scale with the image resolution.

A mean‐shift is a popular approach for edge‐preserving smoothing [CMM02] and saliency‐guided image abstraction [DS02] . It provides a non‐parametric filter that estimates probability density functions by iteratively shifting colour values to averaged colour values of a local neighbourhood. However, the approach typically produces rough boundaries that is more suited to image segmentation.

The bilateral filter is a popular choice to approximate an anisotropic diffusion by weight‐averaging pixel colours in a local neighbourhood based on their distances in space and range [TM98] . It weights pixels with a high difference in intensity less than a Gaussian filter to preserve image structures at a better scale. We retain it and refer to it as Bilateral. Most relevant applications apply the bilateral filter in a multi‐stage process for real‐time rendering with a cartoon look [WOG06] , and enhance it by flow‐based implementations adapted to the local image structure [KD08 , KLC09] , in particular to reduce smoothing across falsely detected edges. Providing smooth outputs at curved boundaries of delicate structures, we thus consider the flow‐based variant [KD08] , and name it FlowAbs. As a generalized variant, the guided filter [HST13] may provide similar characteristics with reduced unwanted gradient reversal artefacts, but only provides a non‐feature‐aligned implementation.

While colour manipulation may reduce the emotional impact carried by blood, it preserves the details of the original photo. A black‐and‐white photo, in particular, may still appear too crude. We thus consider other types of filters, starting with edge‐aware image smoothing as a building block for abstraction, artistic stylization and tone mapping. Numerous automatic filter‐based techniques have been proposed for these applications, typically by approximating an anisotropic diffusion [Wei99] , i.e. to smooth details without filtering significant image structures. The balancing between both aspects varies between filters and is thus critical for us. Consequently, we favoured filter enhancements that derive local image structures for improved feature‐aware processing by adapting the filter kernels to the shape, scale and orientation of the local image structure.

A simple yet effective method of recolourization is to hue shift the colours in hue‐saturation‐value (HSV) space. We consider a uniform hue shift which makes blood appear in a different colour, to which we refer to as HueShift. The hue shift shown in Figure 1 uses different settings and we discuss it in Section 5.2 . A more sophisticated approach could involve colour transfer between source and target images or colour palettes, which typically relies on image statistics to globally and locally control colour distributions [FPC*14] .

Greyscale conversion is a popular method for image decolourization, where the main challenge is to preserve and make use of the chrominance components so that perceptual image features are retained [Čad08 , MZZW15] . Most algorithms transform the problem into optimization to preserve salient features, e.g. by quantifying colour differences between image locations [GOTG05] or prevailing chromatic contrasts [GD07] , optimizing colour and luminance contrasts [NvN07] or considering the Helmholtz–Kohlrausch colour appearance effect [SLTM08] . The latter localized apparent greyscale algorithm performed best in a previous experiment [Čad08] , and we thus retained it and named it ApparentGrey in this paper. However, the method may suffer non‐homogeneity artefacts near region boundaries, which can be addressed with a global mapping scheme [KJDL09] .

We intentionally did not include techniques that require user input such as locations of particular focus because the need for manual input would limit the possible application domains of our work. We thus restricted our exploration to methods applied to the whole image equally. This means we excluded, for example, focus‐and‐context methods such as lens‐based distortion (e.g. [CCF97] ) or interactive stylization (e.g. [Hae90 , CAS*97 , SIMC07] ). We also excluded techniques that would drastically reduce the amount of detail included in the images such as global pixelization (e.g. [GDA*12 , IK12] )—local pixelization again would require user input. Such approaches could be examined as future work.

In this paper, we use the term image processing technique or simply technique to refer to any procedure that transforms an image into another image, while keeping it recognizable. We considered four classes of techniques of varying complexity: colour manipulation, edge‐preserving smoothing, edge detection and enhancement and image‐based artistic rendering. We first outline relevant work and provide rationales for the techniques we retained, and then provide parameter settings yielding reasonable levels of abstraction. For the selection of the techniques, we strove for classical or state‐of‐the‐art methods that cover the taxonomy proposed by Kyprianidis et al . [KCWI13] . Further, the techniques need to process surgery pictures in a content‐preserving way, because we want to target applications where negative effects are diminished but the abstract pictures can still be used for patient information or media communication. The selected 13 techniques and their settings are summarized in Table 1 and illustrated in Figures 1 and 2 with a 1024 × 768 photo.

To begin with, the Bilateral technique was found to be usable for patients or in a book (× 2 photos, 1 surgeon). In addition to its good ranking, the FlowAbs technique's resemblance to cartoons or comic strips was pointed out, and with it the possibility to remove a bit of ‘violence’ from the photo (× 2, 1). It was also praised for the high visibility it gave to contours (× 3, 2). In contrast, the Hatching technique was reported to remove too many details (× 5, 3). Similarly, while HueShift only manipulated colours, it was reported to cause loss of information (× 3, 1), and made it hard to find anatomical correspondences (× 2, 1). The Oilpaint technique generated mixed reactions. On the one hand, it was praised for its artistic look (× 3, 2) and could potentially be used in books or with patients (× 2, 1). On the other hand, it was reported to remove useful information (× 2, 2). BrushStrokes was also reported to be artistic and possibly useful with patients (× 2, 2). The same qualities were reported for the ShapeSimpl technique. Stippling was explicitly reported as not usable (× 3, 1) and causing too much loss of information (× 7, 3). The Watercolor technique also removed too many details (× 5, 2). Finally, XDoG was also found to cause too much information loss (× 2, 1) as it makes it difficult to distinguish colours and contours (× 5, 2).

We encouraged surgeons to voice comments both during the classification and in the debriefing interview. In general, they reported that some of our processed images could be used for textbooks or classes (× 2 surgeons), to communicate with patients (× 2), and surprisingly even to communicate with other experts (× 1), as ‘drawings can be augmented’ with notes for instance. One surgeon reported that they could be particularly useful for plastic surgery and that it would be interesting to see how automatic processing could help communicate with children patients, as they are more sensitive to surgery images. We now report on the comments that were made for more than one photo and/or by more than one surgeon.

We processed self‐reported preferences as follows: firstly, for each combination of surgeon × photo, we assigned a number to each technique according its pile: one for the most preferred pile, two for the second pile, etc. We then normalized these ranks using the halfway accumulative distribution [JS04] . This method gives each rank a score between 0 and 1 that corrects for possible differences in the way ranks are assigned (e.g. when some surgeons make more piles than others). We then derived preference scores by inverting the normalized ranks ( ). We show all preference scores in Figure 3 . Finally, we averaged preference scores across pictures and surgeons to derive a single aggregated preference score per filter which we also show in Figure 3 , on top. As we can see, CoherenceEnh was the most preferred technique, followed by FlowAbs, Kuwahara and Bilateral. We discuss the surgeons' comments next.

We asked each surgeon to send us three of their own surgery photos that could help them explain a specific procedure to non‐experts. All photos had landscape orientation and we cropped each to an aspect ratio of 4/3, and then resized them to 1600 × 1200 pixels. Having all photos share the exact same pixel dimensions allowed us to better control the experiment, as the effect of a filter typically depends on the resolution of the input image. We then processed the photos as described in Section 3.5 , and printed each on a separate A4 sheet. At the start of each session, we asked the surgeon to compare and classify the processed images by making piles based on how useful the image would be as a support for communication and explanation, especially in terms of how much important information is preserved. Two images perceived as equally useful would go onto the same pile. We repeated this procedure three times, once for each photo. Each surgeon saw 3 photos × 13 techniques = 39 images, in addition to the three original photos. To limit ordering effects, we randomized the order in which processed images were presented. After the classification, we asked the surgeons to comment further on the techniques and inquired them about possible applications.

To determine which of these 13 techniques are useful in practice, we interviewed four surgeons: two otolaryngologists ( S1 and S2 ; 13 and 35 years experience, respectively), one orthopaedic surgeon ( S3 ; 10 years experience) and one reconstructive surgeon ( S4 ; 10 years experience). While surgeons naturally have a high tolerance for surgery images and thus may not be able to assess whether a technique can reduce the affective response to an image, their expertise is needed to study which of the techniques can preserve information well. Although all four surgeons are co‐authors of this paper, none of them was involved in the research at the time of the interviews.

5 Experiment with Lay People

Our interview with surgeons helped us understand which processing techniques preserve key information from surgery photos. However, it is hard for surgeons to accurately predict the affective impact of processed photos on lay people. Thus, we conducted an experiment where we presented surgery photos to 30 participants, both unprocessed and processed, and asked them to rate them according to how repulsive they are. This experiment was approved by Inria's ethics committee (COERLE, approval number 2017‐015).

5.1 Pictures Throughout this section, we use the term picture to refer to an original photo, and the term stimulus to refer to a processed or unprocessed picture that is meant to be presented to participants. [MŻJG14] [LBC97] People 202 (NAPS)—a leg surgery,

People 216 (NAPS)—a leg surgery or autopsy,

People 221 (NAPS)—a surgery in the eye area,

3212 (IAPS)—a surgery performed on a dog and

3213 (IAPS)—a finger surgery. We selected five pictures from two research catalogues: the Nencki Affective Picture System (NAPS)and the IAPS. These catalogs contain a range of emotionally‐evocative photos that have been validated to elicit a positive, neutral or negative affect. We selected five surgery photos among the negative pictures: 2 These photos were selected as follows. For the NAPS catalogue, we selected all photos in the ‘People’ category that had a horizontal orientation, leaving us with 204 photos out of the 1356 initial ones. Ten of them were of surgical procedures, of which we selected three that involved large incisions on recognizable body parts (legs and eye). The other photos either had small incisions, or unrecognizable body parts. For the IAPS catalogue, which only had horizontal photos, four photos out of the 1182 were tagged ‘surgery’, of which we selected two with large incisions on recognizable body parts (chest and hand). We reasoned that photos with recognizable body parts were emotionally more potent. Both catalogs came with data on average subject ratings across different emotional scales including disgust, but since data were missing for some photos in the NAPS catalogue, we decided not to base our selections on those ratings. In addition to those five surgery pictures, we selected five neutral pictures from the NAPS catalogue, all with consistent ratings of 1 on the disgust scale. These photos include, e.g. a surfer riding a wave and a man walking on the beach with his son. The picture resolution is 1600 × 1200 for NAPS and 1024 × 768 for IAPS. For consistency, we rescaled all NAPS pictures to 1024 × 768. Thus, the effect of spatial filters will be slightly stronger than for the interview sessions, which used 1600 × 1200 pictures.

5.2 Techniques We first selected the six most usable techniques according to surgeons (see Figure 3). We observed that Kuwahara and Bilateral yielded almost identical results on our experimental stimuli, and were therefore likely to elicit the same response. Thus, in order to save experimental conditions, we decided to remove Bilateral, since it is already used as a building block of FlowAbs. We further decided to add HueShift in order to cover a broader spectrum of approaches, even though it was ranked poorly. Looking back at surgeon pictures, we realized that HueShift often gives skin and blood the same green tones. This might be the reason why surgeons found that HueShiftsuppressed important information. Thus, we changed the shift angle from 120 to −120 in the HSL space, which maps blood and skin to blue/purple tones where colour discrimination is superior [BDAB12]. We refer to this technique as HueShift2. Finally, by examining results on experimental stimuli, we realized that the higher number of iterations used for ShapeSimpl eliminated significantly more details than other techniques. We therefore tuned the settings to get more comparable levels of abstraction. We refer to this technique as ShapeSimpl2. The changes made to ShapeSimpl could only improve information preservation and the changes made to HueShift do not affect how the technique essentially works (but simply test whether another colour could be better). Consequently, the new techniques were not evaluated or validated by surgeons in a new set of interviews. This leaves us with six techniques, summarized in Table 2. Table 2. Techniques tested in the experiment Abbreviation Parameter settings ApparentGrey Same as Table 1 HueShift2 ShapeSimpl2 Same as Table 1 except CoherenceEnh Same as Table 1 Kuwahara Same as Table 1 FlowAbs Same as Table 1

5.3 Metrics A variety of psycho‐physiological measures exist to quantify emotional response (see Section 2.3), but they tend to be noisy and require specialized equipment. We thus decided to simply measure self‐reported subjective experience. Psychology has adopted standardized scales to assess affect, such as valence, arousal and disgust scales [LBC97, MŻJG14]. However, the difficulty of looking at a repellent surgery image may not directly map to either valence, arousal or disgust as they are understood by participants. Therefore, we chose to ask a more direct question, i.e. ‘how difficult was it for you to look at this image?’, on a nine‐point scale from very easy to very difficult (see Figure 4). The question was framed in the past tense because, as we will see later on, participants did not see the stimulus when they answered the question. Figure 4 Open in figure viewer PowerPoint Experiment screen after stimulus presentation. As a complement to the expert interview, we also asked participants to estimate to what extent the content of the scene has been obfuscated by the filter. The question was ‘how difficult was it to recognize the image's content?’, again on a nine‐point scale from very easy to very difficult. The meaning of both questions was explained in preliminary instructions with examples.

5.4 Design and procedure Because between‐subject designs typically suffer from low statistical power [BBK14], we used a within‐subject design. Each participant saw all combinations of picture and technique. With 10 pictures (five surgery, five neutral) and seven techniques (6 + unfiltered), a total of 70 stimuli were presented to each participant. We fully randomized the order of stimulus presentation across participants, with the constraint that two presentations of the same picture had to be separated by at least two other pictures.

Our experiment instructions warned participants that they would see the same picture multiple times, but asked them to try to answer questions as if they saw each picture for the first time.

We presented each stimulus for 2 s only, after which we displayed a mask and we invited the participant to answer the two questions (see Figure 4). Limiting exposure to each picture was expected to slow down habituation. At the same time, informal tests we conduced confirmed that a two‐second exposure was more than enough to be able to fully scan and recognize a picture.

We interleaved surgery and neutral pictures, which we expected to further slow down habituation and incite participants to stay focused. Due to the random sequence of surgery and neutral pictures, participants could not predict the next image type. We expected strong ordering effects, as a participant is likely to become less sensitive to surgery pictures after repeated exposure. Furthermore, a participant is more likely to recognize the content of a picture if it had been presented before unfiltered or with a weaker filter. We addressed this in four ways: We also used the responses to neutral images to look for participants with poor data. With Likert items, a common and damaging mistake is inversion (e.g. participants replying ‘easy’ instead of ‘hard’ by mistake [SL11, vSSC13]) and we explicitly warned against this mistake in the instructions. Still, we decided (before collecting data) (i) to consider a response of 5 or more to the first question for a neutral image as an obvious inversion, and (ii) to discard the data from all participants who made two or more obvious inversions. The experiment unfolded as follows: firstly, participants were given an information sheet and a consent form to sign. We explained that the purpose of the experiment was ‘to understand people's levels of aversion in response to processed and unprocessed surgery images’. We also stated that the participant ‘will be asked to review a series of potentially aversive stimuli and report [their] reactions. The stimuli will include a mix of neutral photos and photos of surgical procedures’.3 Thus, we tried to be as neutral as possible, and we did not explicitly stated that the purpose of the study was to see whether applying filters to surgical images can reduce negative affect. The information sheet provided an example of a surgery image (not used in the study), and asked participants not to participate if they thought that they were hypersensitive. It also informed them that they would be free to stop at any time, should they feel too uncomfortable. Then, participants read instructions on a computer (MacBook Pro 2015 with a 2880 × 1800 retina display and a mouse) and completed the 70 trials. Finally, they were given a research debriefing sheet, a brief questionnaire, and were invited to comment on the experiment. The entire experiment lasted between 15 and 20 min.

5.5 Participants We recruited 30 unpaid participants (nine females, age 21–49, mean = 29, med = 26, SD = 8.6), in conformity with our planned sample size. One additional participant made two obvious inversions as defined in Section 5.4 and was therefore discarded from the analysis. Participants were recruited by email posting to our institution and to students we give classes to, as well as word of mouth to neighbour institutions.4 Four participants were left‐handed and one was ambidextrous. All had normal or corrected‐to‐normal vision, none were colourblind and they all had higher education diplomas. Twenty‐four reported seeing surgery images or videos before (TV, Internet and books), including real surgeries or traumatic injuries (× 3). Eighteen participants participated in a perception study before. Reported sensitivity to surgery images was on average 4.1 (SD = 2.1) on a scale from 1 (not sensitive at all) to 9 (extremely sensitive).

5.6 Quantitative results We report and interpret all our results using interval estimation instead of p‐values [Cum14, Dra16]. The complete analysis was planned before collecting the data and was preregistered [CGD18] with the Open Science Framework (https://osf.io/34vzj). We report on two dependent variables: repulsiveness, which is the response to the first question in Figure 4, and recognizability, which is the complement ( ) of the response to the second question. For each participant and technique, we averaged responses across all five surgery pictures (neutral pictures were not analysed). Then, for each technique, we derived a point estimate using the mean response across participants, and an interval estimate using the 95% BCa bootstrap confidence interval [KG13]. Figure 5(a) shows point estimates (dots) and interval estimates (grey boxes) within the full space of possible responses. Roughly speaking, each interval indicates the range of plausible values for the population mean, with the point estimate being about seven times more plausible than the interval endpoints [Cum13]. The mean response for unfiltered surgery images is about 5 on the repulsiveness scale and 8 on the recognizability scale. There is strong evidence that all six techniques yield smaller average recognizability as well as smaller repulsiveness, except for ApparentGrey and HueShift2. Unsurprisingly, repulsiveness tends to correlate with recognizability. An ideal processing technique would be an outlier located on the top left of the regression line, but the figure provides no conclusive evidence for such an outlier. Figure 5 Open in figure viewer PowerPoint Mean repulsiveness and recognizability ratings for each technique (left); mean within‐subject reduction in repulsiveness and recognizability (right). Boxes are 95% CIs. The results of our within‐subject study are best examined via average within‐subject reduction in repulsiveness and in recognizability [Cum13], summarized in Figure 5(b). We show the same data in Figure 6, with a separate plot for each dependent variable. There is overwhelming evidence that all techniques have a within‐subject effect on repulsiveness and on recognizability, on average. The overlap between error bars [KA13] for repulsiveness reduction suggests that ApparentGrey has the weakest effect, followed by HueShift2, followed by the remaining four. The results for recognizability reduction are less clear, but it is likely that ApparentGrey yields more recognizability than the other techniques. Figure 6 Open in figure viewer PowerPoint Mean within‐subject reduction in repulsiveness and recognizability. Error bars are 95% CIs. Overall, all six techniques are effective, but colour manipulation appears to be outperformed by space‐domain filtering (ShapeSimpl2, CoherenceEnh, Kuwahara, FlowAbs). HueShift2, in particular, is not as effective at making the surgery pictures easier to look at, despite being comparable in terms of preserving informational content. ApparentGrey, on the other hand, simply appears to be a weaker filter: while it does not reduce repulsiveness dramatically, it better preserves image legibility. Among all techniques, FlowAbs and CoherenceEnh offer the best trade‐offs if we only consider point estimates, but given the large overlaps in interval estimates, the evidence is very weak.

5.7 Participant feedback Among the ones who offered feedback, two found FlowAbs to be the best technique. Several participants mentioned that it looked like a cartoon or comic‐strip (× 8) making surgery photos easier to look at (× 8). A single participant found that it could make it harder because of the more salient contours, while three participants said that it makes content easier to recognize. HueShift2 was said to make photos easier to look at (× 3) but the content harder to recognize (× 3). Four participants found that it made content appear ‘unnatural’ or ‘disturbing’, while one mentioned that it makes patients look like aliens. The ApparentGrey was deemed easier to look at (× 3) and easier to recognize (× 1), while two participants reported that it was harder to recognize the content without colours. Additionally, one participant thought that the ApparentGrey's effectiveness depends on the input picture. This was also reflected by another participant who mentioned that the best technique could be picture‐dependent. CoherenceEnh, Kuwahara, and ShapeSimpl2 were reported to be the best technique by one participant each, and to make content harder to recognize by three participants.