TL;DR I argue effective altruists can and should use happiness surveys to determine cost-effectiveness and show how doing this generates some substantially different charity recommendations from those given by GiveWell.

I argue that, despite some long-standing doubts, happiness can be measured through self-reports and therefore happiness surveys should be used to determine how much happiness different outcomes produce. Specifically, I recommend life satisfaction (LS), found by asking “Overall, how satisfied are you with your life nowadays?” (0 - 10), as a suitable, though imperfect, measure of happiness. As, presumably, everyone values happiness to some extent, it follows that everyone should incorporate information from LS scores when determining how to do the most good. I show how the LS approach could be applied to a novel area: assessing which charities produce the most happiness. According to GiveWell, which does not assess charities solely (or even primarily) by LS scores, certain life-saving and poverty-alleviating charities in the developing world do the most good per dollar. I show that, if we understand good in terms of maximising self-reported LS, alleviating poverty is surprisingly unpromising whereas mental health interventions, which have so far been overlooked, seem more effective. Various philosophical and methodological issues leave it unclear whether GiveWell’s life-saving recommendations are more cost-effective than the mental health charity I discuss. I conclude by explaining the implications of this analysis for effective altruists who want to increase human happiness. 1

Table of contents

1. Introduction

2. The relationship between happiness and measure of subjective well-being

3. Can happiness really be measured?

4. Can we compare individuals’ happiness scores?

5. Using life satisfaction as the common currency for well-being adjusted life years (WALYs)

6. Evaluating the life satisfaction impact of (GiveWell) charities

7. What should effective altruists do next?

Annex A: How good are QALYs and DALYs as proxies for happiness?

1. Introduction

How should we compare the impact of various outcomes, such as the treatment of a health condition or poverty reduction, in terms of how much good they do? The current method used by effective altruists (EAs) is, ultimately, to rely on their own subjective judgements; the value of these outcomes is weighed in the mind. EAs have often used health metrics, such as QALYs and DALYs - which rely on other people’s aggregated subjective judgements - but health is not the only item of interest, therefore subjective judgements are still needed to decide how to compare the value of health outcomes to other outcomes. As they make clear in their cost-effectiveness analysis , GiveWell determine the value of different interventions by polling their staff members. The main question in their analysis is “how many years of doubled consumption are as morally valuable as saving the life of an under-5-year old child?” GiveWell then use the median answer to establish the trade-off ratio between these two outcomes.

We might think relying on our subjective judgements is both unobjectionable and unavoidable on the grounds we are making moral evaluations, and there is simply no other way to do this. However, claiming these are moral judgements is only partly true and, indeed, may not be true at all. Some of what appear to be moral evaluations are judgements about facts and so, in principle, empirical questions. Here’s an analogy. Suppose you and I are trying to determine which of two oddly-shaped jars contains the most water. What sort of assessment are we making here? Not a moral judgement, but a subjective judgement of fact. We could go on to say the best jar is the one that holds the most water, which would be a moral judgement.2 Yet, regardless of whether we’ve made that moral evaluation, the question of how much water the jars hold is still a factual one. Now, suppose you and I are trying to assess which of two outcomes - curing some health condition or alleviating poverty by some amount - increases happiness by more. Suppose we agree on our concept of happiness: a positive balance of enjoyment over suffering. Are we making a moral evaluation when we state which outcome we think would increase happiness more? Clearly, as the jar analogy showed, we are not. Note, need not assume happiness is of any moral importance - perhaps we conclude only liberty has value - to compare the outcomes.

Deciding (a) which thing or things are intrinsically valuable and this constitute a good (b) how to aggregate goods to determine overall value - in philosophical jargon, an axiology, the method of ranking states of affairs in terms of the their ultimate value - is, certainly, a moral judgement.3 Suppose, for example, we decide the value of an outcome is the sum total of happiness in it. However, once you’ve done that, determining how much good an outcome contains is, in principle, an empirical question. Any subjective assessments one makes thereafter will be judgements of facts, not of value. I hope it’s obvious that we will, where possible, want to measure the good(s) directly, and that objective measurement of the facts ‘trumps’ our subjective evaluations of the facts. If we could measure the water capacity of the jars, that would settle the question of which held more and render our guesswork obsolete.

If we want to have productive disagreements with one another about which outcomes do more good, it’s important to make it clear whether disagreements arise from claims about value or from claims about facts. Suppose Singer and MacAskill disagree about whether option A is better than option B. Given Singer and MacAskill have (I think) the same views on value, the disagreement is a factual one.4 By comparison, GiveWell’s analysis combines the opinions of multiple people who have different views about value, it’s not possible to tell whether one disagrees with GiveWell (supposing one does) because one differs about facts or about values. To make this plain, as noted, the key question in GiveWell’s cost-effectiveness is “how many years of doubled consumption are as morally valuable as saving the life of an under-5-year old child?” Here is a non-exhaustive list of factors two people could disagree about when answering that question, and whether that factor is a question of fact or value.

Factor Value or fact question? How much does doubling consumption for a year increase well-being? Fact, for a given account of well-being What is well-being? Value The well-being the child would have if it lived? Fact, given expected well-being How many years the child will live for? Fact What the badness of death is (i.e. is it better, all else equal, to save a 2-year old or a 20-year old? Value



This document aims to illuminate one factor that, presumably, will be of interest to everyone: how much different outcomes improve happiness. While this has generally been judged subjectively, it is a question of fact, not of value (for a given account of happiness). I argue that, despite long-standing doubts, happiness can be measured through population surveys and therefore we should use data from happiness surveys, rather than relying on our own subjective judgements, to determine what increases happiness and by how much. Specifically, I recommend life satisfaction (‘LS’) scores, which is found by inquiring “How satisfied are you with your life nowadays?” (on a scale from 0 “not at all” to 10 “completely”) as the most suitable (although not ideal) proxy measure of happiness. While these claims may be unfamiliar and contentious within philosophy and effective altruism, they are common knowledge within (certain corners of) economics and psychology. This part of the document largely restates claims made by others.5 Having argued that we should use LS scores to determine what maximises happiness in general, I apply it to the specific case of determining what the most cost-effective charities are at producing happiness, which has not yet been done. I show how the LS-approach works differently to and generates partially different recommendations from GiveWell, who do not primarily use LS scores to determine cost-effectiveness.

The rest of the document proceeds as follows. Section 2 explains how ‘subjective well-being’ (SWB), as it is normally called is social science, is measured, how SWB relates to happiness and why LS is a suitable proxy for happiness. Section 3 argues SWB measures are valid and reliable. Section 4 argues SWB measures can be used to make interpersonally cardinal comparisons. Section 5 outlines how LS can be used to determine what increase happiness in general. Section 6 applies the LS approach to charity evaluation. Section 7 sets out future work for effective altruists interested in increasing happiness.

2. The relationship between happiness and measures of subjective well-being

Social scientists (mostly economists and psychologists) talk about measures of ‘subjective well-being’ (‘SWB’), which are, to quote Metcalfe and Dolan (two social scientists) ‘ratings of thoughts and feelings about life’ (Dolan and Metcalfe (2012) . SWB is typically thought to have three components, these are (OECD 2013) :

Evaluation (sometimes called ‘cognitive’) - reflective assessment on a person’s life or some specific aspect of it. Life satisfaction is a life evaluation question but not the only one.

Experience (sometimes called ‘affective’ or ‘hedonic’) - a person’s feeling or emotional states, typically measured with reference to a particular point in time

Eudaimonia - a sense of meaning and purpose in life, or psychological functioning.

For a list of example SWB questions, see OECD (2013, Annex A) .

What is the relationship between the SWB measures and happiness?6 Often, measures of SWB are referred to as measures of ‘happiness’. This is technically incorrect and also misleading. On the earlier definition of happiness - a positive balance of enjoyment over suffering - the experience component of SWB is identical with happiness. Evaluations measure how people feel about their lives, rather than how happy they feel during them. Eudaimonic measures may tap into psychological states - ones related to meaning, whatever this is - that, presumably, feel enjoyable to experience and thus comprise happiness, but do not capture all the psychological states relevant to happiness. Hence, SWB is not only a measure of happiness.

The ‘gold standard’ for measuring happiness is the experience sampling method (ESM), where participants are prompted to record their feelings and possibly their activities one or more times a day.7 While this is an accurate record of how people feel, it is expensive to implement and intrusive for respondents. A more viable approach is the day reconstruction method (DRM) where respondents use a time-diary to record and rate their previous day. DRM produces comparable results to ESM, but is less burdensome to use (Kahneman et al. 2004) .

Given we are interested in measuring happiness, we might think we should ignore the non-experience components altogether. Practically, however, this is unfeasible and we are forced to rely on life satisfaction measures as the main proxy measure for happiness (a ‘proxy’ measure is an indirect measure of the phenomenon of interest). It is much easier to collect LS data as it requires just one quick question that takes subjects around 30 seconds to answer, whereas the DRM takes approximately 40 minutes to fill out. As a result of this ease of use, it is the SWB measure on which most data has been collected and most analysis done. It is now possible to say, a point I return to later, to what extent various outcomes cause an absolute increase in life satisfaction on a 0-10 scale, which is what we need to determine cost-effectiveness (see Layard et al. 2018 ). By contrast, to the best of my knowledge, there is insufficient research on experience measures to draw the same conclusions.

How much of a problem is it to use evaluative measures in lieu of experience ones? Experience and evaluative measures are conceptually different and answered in somewhat different ways. As Deaton and Stone (2013) explain:

Hedonic [i.e. experience] measures are uncorrelated with education, vary over the days of the week, improve with age,and respond to income only up to a threshold. Evaluative measures remain correlated with income even at high levels of income, are strongly correlated with education, are often U-shaped in age, and do not vary over the days of the week (Stone et al. 2010; Kahneman and Deaton 2010)

This doesn’t mean evaluative measures can’t be used as proxies for happiness. The evaluative and experience measures do correlate, suggesting evaluative judgements are, in part if not in whole, determined by how happy people are (OECD 2013, p32-34) . While Deaton and Stone identify some cases where they come apart, it’s hard to think of instances where there would be different priorities, either for governments or for effective altruists, if the goal was trying to maximise life satisfaction scores rather than happiness scores. The sensible approach seems to be use happiness data where it’s available, but LS data where it isn’t and, when using LS data to determine cost-effectiveness, to keep in mind how the two might differ. Further work to investigate if and when using one measure over another would generate different priorities seems valuable.

While eudaimonia measures are regarded as a component of SWB, I will not refer to them again. Not only are they not the most relevant component, little data has been collected on them and it’s not conceptually clear what they capture.

3. Can happiness really be measured?

There are long-standing doubts (in economics) that happiness either can be or needs to be measured. Some historical context may be helpful here. According to Layard (2003) :

In the eighteenth century Bentham and others proposed that the object of public policy should be to maximise the sum of happiness in society. So economics evolved as the study of utility or happiness, which was assumed to be in principle measurable and comparable across people. It was also assumed that the marginal utility of income was higher for poor people than for rich people, so that income ought to be redistributed unless the efficiency cost was too high.

All these assumptions were challenged by Lionel Robbins in his famous book on the Nature and Significance of Economic Science published in 1932. Robbins argued correctly that, if you wanted to predict a person’s behaviour, you need only assume he has a stable set of preferences. His level of happiness need not be measurable nor need it be compared with other people. Moreover economics was, as Robbins put it, about “the relationship between given ends and scarce means”, and how the “ends” or preferences came to be formed was outside its scope.

Interest in measuring happiness has returned in recent decades.8 This seems to be caused by 1) the Easterlin paradox , the (contested) finding that while richer people are more satisfied with their lives than poor people, an increase in average wealth does not raise average life satisfaction; 2) the behavioural economic work of Tversky, Kahneman and others suggesting individuals do not, when left to their own devices, seem to maximise their own utility9 and 3) growing dissatisfaction with GDP as a measure of progress.10

The idea governments should measure SWB and use it to guide policy has started to take root. In 2013 the OECD issued guidelines recommending its member-nations collect SWB data:

There is now widespread acknowledgement that measuring subjective well-being is an essential part of measuring quality of life alongside other social and economic dimensions [...] The Guidelines also outline why measures of subjective well-being are relevant for monitoring and policy making.

The UK’s Office of National Statistics have been collecting data on SWB since 2012 and currently polls 158,000 people a year. (Readers unfamiliar with SWB measures may find their FAQs helpful .)

Now, some scholars who argue we shouldn’t use SWB measures, such as Fleurbaey, Schokkaert and Dencanq (2009) , nevertheless accept such measures are meaningful:

With the mass of data accumulated on happiness and satisfaction and the development of their econometric exploitation, subjective utility seems more measurable than ever. There now seem to be good reasons to trust the existence of sufficient regularity in human psychology, so that interpersonal comparisons appear feasible in principle.These new developments have triggered a revival of welfarism as well. If utility can be measured after all, why not take it as the metric of social welfare? Several authors have taken this line (Kahneman et al. 2004b, Layard 2005). However, none of the recent developments in the field of measurement directly undermine the arguments that were raised against welfarism in the philosophical debates of the previous decades. The fact that something becomes easier to measure does not give any new normative reason to rely on it.

While the above explains there has been a shift in opinion regarding the measurability of SWB, I have not yet said why we would think the SWB measures are accurate - they succeed in measuring what they set out to measure.

The accuracy of a measure is usually assessed in terms of its validity and reliability. Validity refers to whether the measure captures the underlying concept that it purports to measure. Suppose I try to measure your height by weighing you on a set of bathroom scales. The scales might be valid measure of weight but it’s clear, I hope, they are not a valid measure of height. Reliability is about whether the measure gives consistent results in identical circumstances (i.e. it has a high signal-to-noise ratio). If my scales produce a random number every time I step on them, they are not reliable. Reliability is necessary but not sufficient for validity; if I used a normal, non-broken set of scales to measure your height it would give me the same score, and so be reliable (assuming your weight doesn’t fluctuate), but still wouldn’t be valid. As the reliability and validity of SWB scales has been covered at great length in (OECD 2013) and elsewhere I will largely confine myself to explaining the key ideas and providing several illustrative quotations from that document.

Reliability can be assessed in two ways: by internal consistency - whether the items with a multi-item scale correlate, or different scales of the same measure correlate - and by test-retest reliability, where the same question is given to the same respondent more than once at different times. Note that if the item in question genuinely does change between measures, we would expect the test-retest reliability to be low.

Regarding life evaluations, quoting (OECD 2013, pp47) :

Bjornshov (2010), for example, finds a correlation of 0.75 between the average Cantril Ladder measure of life evaluation from the Gallup World Poll and life satisfaction as measured in the World Values Survey for a sample of over 90 countries. [...] Test-retest results for single item life evaluation measure tend to yield correlations of between 0.5 and 0.7 for time period of 1 day to 2 weeks (Krueger and Schkade, 2008). Michalos and Kahlke (2010) report that a single-item measure of life satisfaction had a correlation of 0.65 for a one year period and of 0.65 for a two-year period.

And regarding affect/experience measures:

There is less information available on the reliability of measure of affect and eudaimonic well-being than is the case for measures of life evaluation. However, the available information is largely consistent with the picture for life satisfaction. In terms of internal consistency reliability, Diener et al. (2009) report [...] the positive, negative and affective balance subscale of their Scale of Positive and Negative Experience (SPANE) have alphas of 0.84, 0.88, and 0.88 respectively. [...] In the case of test-retest reliability, [...] Krueger and Schkade (2008) report test-retest scores of 0.5 and 0.7 for a range of different measures of affect over a 2-week period.

The authors conclude the life evaluation and affect measures exhibit sufficient correlation, by the standards of social science, to be deemed acceptably reliable.

Validity, by contrast, is somewhat harder to test than reliability because the underlying phenomena SWB measures attempt to capture are subjective, hence there is no objective way to demonstrate success. Nevertheless are three ways to assess validity. All of these ultimately rely on whether the measures conform to our expectation about the item we are intending to measure.

The first is face validity - do respondents judge the questions as an appropriate way to measure the concept of interest? If not, it’s likely the measures aren’t valid. In the case of SWB measures, it’s somewhat obvious this is the case, e.g. that asking people whether they felt happy yesterday is a good way to assess whether they felt happy yesterday. Participants aren’t generally asked about face validity, but this can be tested by (a) response speed and (b) non-response rates: if people don’t take a long time, or don’t answer, that suggests they don’t understand the question. Median response rates for SWB questions are around 30 seconds for single item measure, suggesting the questions are not conceptually difficult. (ONS, 2011) Quoting from (OECD 2013, pp49) : “in a large analysis by Smith (2013) covering three datasets [...] and over 400,000 observations, item-specific non-response rates for life evaluation and affect were found to be similar for those for [the straightforward] measures of educational attainment, marital and labour force status” which, again, supports the face validity of the questions.

The second is convergent validity - does the item correlate with other proxy measures for the same concept? Kahneman and Krueger (2006) list the following as correlates of both high life satisfaction and happiness: smiling frequency; smiling with the eyes (“unfakeable smile”); rating of one’s happiness made by friends; frequent verbal expressions of positive emotions; happiness of close relatives; self-reported health. In addition, OECD (2013) states “Diener (2011), summarising the research in this area, notes that life satisfaction predicts suicidal ideation (r=0.44) and the low life satisfaction scores predicted suicide 20 years later in a later epidemiological survey from Finland (after controlling for other risk factors [..]” Such items allow us to assess the measures from the perspective of falsifiability: if we expect that (say) those with low life satisfaction would commit suicide more often, but our measure of life satisfaction found those with high LS commit suicide more often, that would suggest the measure lacked validity. As it stands, the results support the validity of the experience and evaluation measures of SWB.

The third is construct validity - while convergent validity assesses how closely the measure correlates with other proxy measures of the same concept, construct validity concerns itself with whether the measure performs in the way we expect it to. From OECD (2013, pp51) :

Measures of [SWB] broadly show the expected relationship with other individual, social and economic determinants. Among individuals, higher incomes are associated with higher levels of life satisfaction and affect, and wealthier countries have higher average levels of both types of subjective well-being than poorer countries (Sacks, Stevenson and Wolfers, 2010). At the individual level, health status, social contact and education and being in a stable relationship with a partner are all associated with higher levels of life satisfaction (Dolan, Peasgood and White, 2008), while unemployment has a large negative impact on life satisfaction (Winkelmann and Winkelmann, 1998). Kahneman and Krueger (2006) report intimate relations, socialising, relaxing, eating and praying are associated with higher levels of positive affect; conversely, commuting, working and childcare and housework are associated with low level of net positive affect. Boarini et al. (2012) find that affect measures have the same broad set of drivers as measures of life satisfaction, although the relative importance of some factors changes.

Major life events, such as unemployment, marriage, divorce and widowhood, are shown to result in long-term, substantial changes to SWB, just as one would expect them to. The time-series in figure 1, from Clark, Diener, Geogellis and Lucas (2007) , displays the LS-impact of such events for males (controlling for other variables) before, during and after they occur (y-axis records the change in LS on a 0-10 scale; the results are similar for females). Note the time series shows anticipation of the event. We can see, for example, a decrease in LS leading up to a divorce, whereas widowhood is barely anticipated and comes as a huge shock.

Figure 1. The dynamic effects of life and labour market events on life satisfaction (male) ( Clark, Diener, Geogellis and Lucas (2007 ) (Y-axis represents absolute change in of life satisfaction on 1-10 scale)

Figure 2 from Clark, Fleche, Layard, Powdthavee, Ward (2017, p100) shows a similar time-series, this time for disability from three different data-sets. Individuals seem to partially, rather than fully, adapt to disability.11 This is what we might suppose would happen: becoming disabled is very bad, but being disabled is somewhat less bad as one’s lifestyle and mindset adjusts. It’s worth noting here one major potential objection to the use of SWB measures is that people do not really adapt to changes in circumstances, they simply change how they use their scales. However, if scale re-norming did take place, we would expect to see adaptation to all conditions. Yet, we do not see this: the LS scores in figure 1 above show people adapt to some things and not others. Further, Oswald and Powdthawee (2008) find there is less adaptation to severe disability than to mild or moderate disability, suggesting scale norming is not occuring and that the SWB scores are reflecting reality.

Figure 2. Adaptation to disability in different country data-sets Clark et al.(2017, p100)

As mentioned before, if the SWB measures had produced counter-intuitive results (got the ‘wrong answers’) that could lead us to conclude they were not valid. The above seems to match our expectations.

One finding that might, at least at first, seem counterintuitive is the relationship between SWB and income. While there is little disagreement that richer people within a given country report higher SWB (both on experience and evaluation measures), and richer countries report higher SWB, there is less consensus over whether SWB increases over time as countries become wealthier. This is the so-called ‘Easterlin Paradox’, displayed in figure 3 below Clark et al. (2018, p203) . A critical response to SWB measures could be made as follows “the Easterlin Paradox shows increasing overall economic prosperity doesn’t increase SWB. But it’s obvious increasing overall economic should raise SWB. Therefore, the SWB measures must be wrong”.

Such a response is too quick. First, the debate still rages over whether the Easterlin Paradox holds - Stevenson and Wolfers (2008) argues it does not, Easterlin et al. (2016) reply. Second, as Clark (2016) notes, a large body of research finds individual SWB depends not just on the individual’s own income, but also their income relative to that of the reference group she compared her income to. Thus, if I am a wealthier than you, I should expect to have higher SWB. However, if my income rises but the income of those I compare my income to also rises, these effects cancel out, leaving my SWB unchanged. Hence the Easterlin paradox can be explained in large part by the phenomenon of social comparison: we judge our lives against those of others.



Figure 3. Change in subjective well-being and GDP/head over time

In a particularly insightful study, Solnick and Hemenway (2005) , individuals were asked to choose between different states of the world, as follows.

A: Your current yearly income is $50,000; others earn $25,000

B: Your current yearly income is $100,000; others earn $200,000

Absolute income is higher in B than in A, while relative income is higher in A than in B. Individuals express a clear preference for A, clearly suggesting the importance of relative income. Hence, with further analysis, the Easterlin paradox is not as counter-intuitive it might seem.

Overall, the evaluation and experience SWB measures seem both reliable and valid.

4. Can we compare individuals’ happiness scores?

We now move on from whether SWB measures are accurate, to whether they are comparable between individuals. Here’s a potential concern: for a given scale, say life satisfaction, is going from 7 to 8 for one person equivalent to another person going from 2 to 3? In jargon, this is the question of whether the scales exhibit interpersonal cardinality. Readers who are not interested in or concerned by this problem are welcome to skip to section 5.

I unpack this concern in stages.

The first question to ask is: does the underlying phenomena of interest - the thing the SWB scales are trying to measure - have a cardinal structure, or is it merely ordinal? That is, it represents something that can be quantified - like length, height, weight, etc. - or does it merely represent an ordering - like ‘A is taller than B’? (1st, 2nd, 3rd … are the ordinal numbers, 1, 2, 3, … are the cardinal numbers).

It is intuitively obvious happiness is cardinal, as revealed by our linguistic use. It is entirely sensible to say ‘X hurt twice as much as Y” or “I feel 10 times better than I did yesterday”.12 If happiness were ordinal, the most we could say would be “X hurts worse than Y” and “I feel better today than I did yesterday”.

If life satisfaction scales capture a psychological state of satisfaction, then this would be cardinal; as above, intuitively, one can feel twice as satisfied about X vs Y.13

Given the underlying phenomena of interest has a cardinal structure, the next question is whether individual’s reporting on the scale is equal-interval (another term for this is linear), i.e. going from 5/10 to 6/10 is an equivalent improvement as going from 7/10 to 8/10. One worry is that individuals interpret SWB scales as logarithmic, like the Richter scale, where the magnitude of going from 6/10 to 7/10 is 10 times that of going from 5/10 to 6/10, rather than as linear/equal-interval.14

While possible, non-linear reporting seem unlikely.15 Experimental evidence from Van Praag (1993) suggests that when presented with a number of (non SWB-related) points, respondents automatically treat the difference between points as roughly equal-interval. Further, it is intuitively much harder for ordinary people (i.e. non-mathematicians) to report how happy/satisfied they feel on a logarithmic scale than a linear one. If I ask myself “how happy am I right now in a 0-10 logarithmic scale?” to try to answer this question I first have to think “how happy am I on a linear 0-10 scale.” I then try to remember how logarithms work and convert from there. This is so much harder to do that I assume scale use must be equal-interval.

Given the scales have intrapersonal cardinality, the final question is whether they have interpersonal cardinality: is one person’s reported one point increase on a 0 to 10 scale equivalent to a one point increase for someone else?

There are two different concerns here. First, individuals could correctly report where they are between the minimum and maximum points of the scales, but have different capacities for SWB. There could be ‘utility monsters’ who experience 1000 times more happiness than others. Second, individuals could have the same maximum and minimum capacities, but use the scales differently. Suppose almost everyone reports a given sensation as 6/10, but a few people report the same feeling as an 8/10; keeping the same terminology, this latter group are ‘language monsters’.

We can make the same reply to both concerns. So long as these differences are randomly distributed, they will wash out as ‘noise’ across large numbers of people: there will be as many people with a greater capacity for SWB as those with less, and as many who use the scale too conservatively as use it too generously. Second, in response to utility monsters, it seems unlikely, given our shared biology, that the utility capacities of humans will, in practice, vary by very much.16 Third, regarding language, I observe we do tend to, in general, regulate one another’s language use. For instance, if I say “I’m having a terrible day: I stubbed my toe” you are likely to say “Hold on. That’s not a terrible day. That’s a mildly bad day”. A hypothesis, which could conceivably be tested, is that this language regulation pushes us towards using SWB scales in a similar way. If language did not have a shared meaning, it would be of no use at all.

We might object to the this last point that, even if groups regulate their members’ language use, different groups could still use scales differently. As an empirical test on this, a study by Helliwell et al. (2016) of immigrants moving from 100 different countries to Canada found that, regardless of country of origin, the average levels and distributions of life satisfaction among immigrants mimic those of Canadians, suggesting LS reports are primarily driven by life circumstances. If there really was substantial cultural difference in LS scale use, this result would not occur.

Therefore, it seems reasonable to interpret SWB data as interpersonally cardinal. However, as this point seems important, more work here would be welcome.

5. Using life satisfaction as the common currency for well-being adjusted life years (WALYs)

Suppose we accept we can use LS scores to measure happiness. What next? One, straightforward option would be to measure LS (and other SWB metrics) impacts directly in RCTs. If we know the costs of a programme, we could then establish how much it costs to produce one ‘life satisfaction point-year’ or ‘LSP’ - equivalent to increasing life satisfaction for one person by one point on a 10 point scale for a year. This method is structurally similar to assessing cost per Quality-Adjusted Life Year (QALY) - which effective altruists are already familiar with and I won’t go into - except QALYs are measured on a 0-1 scale whereas LS is on a 0-10 scale.

Table 1. How adult life satisfaction (0-10) is affected by current circumstances (BHPS) (cross-section) (Clark et al. 2018, p199)

I expect many effective altruists will welcome the idea of using LSPs instead of QALYs. Most EAs already accept that, in principle, we need a measure of ‘well-being adjusted life-years’ (WALYs).17 QALYs capture

health, and as I noted at the start, not only is health not all that matters, we will still need a common currency that allows health and non-health outcomes to be traded-off against one another, and a non-arbitrary method to determine the value of outcomes in this currency. LSPs could partially or fully fulfill the role of being the WALY metric. For those who think happiness is the only intrinsic good, LSPs should be sufficient - unless and until a better measure of happiness can be found. Those that value goods other than happiness will, presumably, value happiness to some extent, and inasmuch as they do, LSPs will be one aspect of WALYs they need to consider alongside other goods.18

RCT data using LS will not always be available. Where it is not, an alternate way to determine how different outcomes affect LS is to rely on data from large population surveys. Using a multivariate regression analysis that controls for different circumstances, researchers can then estimate the strength of the correlations between LS and various other factors. Table 1 from Clark et al. (2018, p199) contains the results of such an analysis both for the impact a given change has on an individual’s LS and that which it has on others.

This information can be used to make inferences about the expected LS effect of a given outcome without requiring an RCT, at least if it’s straightforward to measure the outcome, as it is in cases of unemployment. In other cases the relationship between life satisfaction and other measures, such as particular health metrics, it will need to be established so other metrics can be converted in LS scores. Some of this work been done: see Layard (2016) for such a table converting LS scores into both other SWB measure and various health metrics.

Two sets of comments are worth mentioning before we turn to charity analysis. First, three remarks on the results in the table that will be relevant again shortly: 1) doubling income is associated with a constant increase in life satisfaction; 2) the gain one individual receives from a doubled income causes a nearly equally large equivalent loss in LS to others; 3) mental health, employment and partnership have a much bigger per-person impact that a doubling of income does.

Second, while it is already possible to estimate the LS effect of many outcomes, if effective altruists want to use SWB data to assess effectiveness, they should encourage researchers - most obviously those working in global development - to collect it alongside other variables. This only requires quickly surveying individuals at the start and end of an impact assessment. This generates extra work, but also allows direct measurement of the outcome that is (presumably) of most interest.

6. Evaluating the life satisfaction impact of (GiveWell) charities

GiveWell has identified the charities it considers do the most good per dollar. We can use the LS lens to assess how good these charities are at increasing life satisfaction. GiveWell’s top charities can be divided into (1) life-saving charities, such as the Against Malaria Foundation, and (2) life-improving charities, those that increase individuals’ well-being during their lives, such as GiveDirectly and the Schistosomiasis Control Initiative (SCI). According to GiveWell, the vast majority the benefit of their recommended life-improving charities arises due to eventual income and consumptions gains, rather than gains to health.19 We’ve already seen that making people richer is a surprisingly unpromising way of increasing LS in developed countries (as increasing the wealth of some reduces the happiness of others). I show this also seems to happen even at a low-level, such that treating mental health via a charity like StrongMinds looks much more cost-effective. Then I illustrate how to compare life-saving to life-improving charities. I claim it’s unclear, due to some methodological issues, which of the interventions increases LS more effectively.

6.1 Life-improving charities

Let’s start with GiveDirectly, a charity which provides unconditional cash transfers to Kenyan farmers, as there have now been three studies conducted used life satisfaction data (alongside other SWB metrics). Research suggests GiveDirectly’s cash transfers increase life satisfaction by about 0.3 life satisfaction points - LSPs - on a 10 point scale.20 This was measured after 4.3 months on average, but let’s assume this effect lasts a whole year, this affect applies to everyone in the recipient household, and there are 5 people per household on average. The average cash transfer is $750, which generates 1.5 LSPs with our assumptions (0.3 x 1 x 5), implying a cost-effectiveness of 2 LSPs/$1000.

However, this estimate is likely to substantially overstate the effectiveness of cash transfers. It only accounts for the life satisfaction increase of recipients. Research into GiveDirectly has suggested that their cash transfers, while making some people wealthier (and so more satisfied with life) have negative spillovers: it makes non-recipients less satisfied. As Haushofer, Reisinger and Shapiro (2015, p1) state:

The decrease in life satisfaction induced by transfers to neighbors more than offsets the direct positive effect of transfers, and is largest for individuals who did not receive a direct transfer themselves.

This might seem surprising, but the finding that increasing wealth leaves aggregate LS relatively unchanged is entirely consistent with the findings in table 1 and results mentioned in Clark (2016) above.

We might hope these negative spillovers would dissipate eventually and, over the long run, cash transfers would be effective in increasing life satisfaction. However, a new 2018 study on the long-term (3 year) effects of GiveDirectly by Haushofer and Shapiro (2018, p. 22) finds recipients, compared to non-recipients in distant villages, have 40% more assets but that recipients do no better on a psychological well-being index. GiveWell discuss this study, note it suggests cash transfers are less effective than they thought, but state they are awaiting the results of GiveDirectly’s “general equilibrium” study, which aims to assess spillover effects, before updating their cost-effectiveness assessment.21

I also look forward to further research, but for the moment I think the evidence suggests it’s far from obvious cash transfers have a robust, positive effect on happiness (measured as life satisfaction) in either the short or the long-term. Many people assume cash transfers must increase happiness, but it’s unclear what evidence someone could produce to support this intuition.

There a few objections one could make to defend the effectiveness of cash transfers here.

First, one could simply ignore the negative spillovers. It’s unclear how this would be justified. Even if we did do this, as I will shortly show, StrongMinds, a mental health charity, looks more cost-effective anyway.

Second, we could say that LS scores have got the intuitively wrong answer here and therefore they are not a valid measure of happiness (specifically, the claim is they lack construct validity). This response seems unconvincing: the critic would need to explain how LS seemed to get the wrong result in this case whilst getting the right result in many other areas. If LS measures succeed in measuring individual’s life satisfaction, presumably they do so in general.

Third, one could claim that happiness is not all that matters. Yet, presumably, it is one of the things that matters. Hence, most (if not all) people will want to consider the impact on happiness. Plausibly, cash transfers could be justified on non-happiness grounds, such as autonomy promotion. Someone who pressed this would need to determine how to trade off happiness against autonomy and come to an overall decision about which charity did the most good; this is not a concern I can address here.22

Fourth, we might hope there are long-term, societal effects of increasing wealth, even if it doesn’t increase the aggregate life satisfaction of the immediate recipients and their neighbours over the first 3 or so years. Note this would be a very different justification for donating to GiveDirectly from the usual one given, which is that it benefits the recipients. It would require substantially altering the cost-effectiveness analysis. Further, it’s not obviously true that increasing a country’s wealth will increase aggregate life satisfaction. I’ve already mentioned the Easterlin paradox, which refers to developed countries, but a particular arresting case of development from a low level failing to increase happiness is China: it’s SWB seems to have gone down been 1990 and 2015, even though per capita GDP increased by 5 times ( Easterlin, Wang and Wang 2017 ).23

Fifth, we might think this problem could be avoided if money were given to everyone in the village, rather than just to some. This ignores the concerns about social comparisons. If social comparisons occur, then we would expect making everyone in village A richer to reduce life satisfaction in village B, an adjacent non-recipient village. We could respond to this with “Fine. But what if we made everyone richer?” which is a restatement of the previous objection.

To be clear, the concern about GiveDirectly isn’t that it is an ineffective way of alleviating poverty. Rather, the concern is that alleviating poverty is surprisingly ineffective at increasing happiness (measured as LS). Thus, from the LS perspective, it is unsatisfactory to object that other top-rated GiveWell charities are more effective than GiveDirectly at alleviating poverty. What needs to be shown is that alleviating poverty increases happiness, and it’s unclear what evidence supports this thesis.

We should now be concerned about the happiness-increasing effectiveness of all of GiveWell’s life-improving charities, as those charities are deemed effective on the assumption they increase wealth and increasing wealth does good overall. To illustrate, the Schistosomiasis Control Initiative (SCI), a charity with treats children for intestinal worms, is a top-rated GiveWell charity that, in fact, GiveWell consider more cost-effective at doing good than GiveDirectly. We might think the worries about the ineffectiveness of poverty alleviation would not apply here as SCI provides a physical health treatment. Yet, although SCI provides a health intervention, GiveWell claim that only 2% of SCI’s impact comes from ‘short-term health gains’. The remaining 98% arises from ‘eventual income and consumption gains’: dewormed children earn more in later life, and their well-being rises as a result of this additional income. Hence, the same doubts extend to SCI as well.

An objection here is that we should expect interventions which help people earn their own money - such as by improving their health - to increase happiness by more than cash transfers, which simply give money to them.24 It seems unlikely this would be true: it’s generally argued the merit of cash transfers is people can invest the money and use it to earn more for themselves later. Hence the long-term value of cash transfers is from earned income too.

Now we turn to StrongMinds, a mental health charity that provides interpersonal group therapy to women in Uganda. As the LS analysis of Clark et al. (2018) suggests treating mental health is among the most cost-effective way for developed-world governments to increase happiness, it is the natural first place to look when searching for an effective charitable intervention. There is no research which has directly measured the LS impact of treating mental health, so I estimate its effectiveness using other data.25 To save space I have put this analysis into a spreadsheet . I infer that the treatment effect is 0.2 LSPs per year for 4 years.[41] StrongMinds say their per-participant costs are $102 (StrongMinds Q1.2018 report). That suggests the impact is 0.8 LSPs (4 years * 0.2 LS gain) per $102, or 8 LSPs/$1000 (rounding up from 7.84). There is not space here to go into the details of mental health treatments or argue they are effective.26

Through the LS lens, StrongMinds is more effective than GiveDirectly (and other poverty-alleviating charities) simply because it seems to clearly increase net happiness.27 It’s possible another charity or organisation will be much more effective at increasing happiness than StrongMinds, but that is a topic I plan to cover elsewhere.28

For those interested in increasing happiness, this result is important. Using the empirical data on happiness illuminates a new category of intervention - treating mental health - that effective altruists have so far overlooked and now appears more cost-effective that alleviating poverty.29 Part of this must be due to effective altruists’ historic reliance on health metrics - QALYs and DALYs - which seem to underrate the happiness impact of mental health conditions relative to physical ones. I discuss this in Annex A.

6.2 Life-saving charities

As noted in the introduction, the key question in GiveWell’s cost effectiveness evaluation is “how many years of doubled consumption are as morally valuable as saving the life of an under-5-year old child?” This is what they need to compare the cost-effectiveness of life-saving to life-improving charities; I stated that it combines judgements about facts and judgements about value to answer this question.

Using LS scores we can take a different approach to making this comparison, which is to work out the cost-effectiveness of life-saving interventions in LSPs. First we need to know the cost to save a life. According to GiveWell’s estimates, the Against Malaria Foundation (AMF) saves a life for around $3,500 (i.e. prevents a premature death).30

Second, we need the number of years that person would have lived for. Suppose AMF grants 60 counterfactual years of life.

Third, we need to establish how many net LSPs the person gains per year - how much better their lives are than the ‘neutral point’ equivalent to being dead. Average life satisfaction in Kenya, where AMF operates, is 4.4/10 (Helliwell, Layard and Sachs 2017, p28) . Now we run into a problem. Life satisfaction surveys don’t ask people to specify what point on the 0 to 10 scale they would consider equivalent to not being alive. 0 is labelled not at all’ and 10 ‘completely satisfied’. Intuitively, the midpoint in the scale, 5, would be the neutral point. Yet, if that’s true, then saving lives through AMF would in, fact, be bad. 4.4 is the below the neutral point so AMF would be prolonging bad lives, lives worth not living.31

Let’s suppose instead the neutral point is 4. If this is so, saving the child is worth 0.4 life satisfaction points a year for 60 years, thus 24 LSPs (0.4 x 60).

Given the $3,500 cost, we can calculate cost-effectiveness as 6.9 LSPs/$1,000. Earlier, I estimated StrongMinds’ cost-effectiveness was around 8LSPs/$1000. Hence, we can now compare our life-saving and life-improving interventions in the same units of cost-effectiveness.

A problem for our analysis is that these cost-effectiveness numbers are highly dependent on an (so far) arbitrary decision about where the neutral point goes. If someone instead set the neutral point at 3, which intuitively seems too low, then AMF’s cost-effectiveness would leap to 24.4LSPs/$1,000 and it would be more cost-effective than StrongMinds.

How could we settle where the neutral point is? Two strategies seem possible. First, we could find out at what LS scores individuals report their lives are neutral on experience measures of SWB. Second, we could poll people to ask at what LS score out of 10 they would be indifferent between living with that score for the rest of their life or dying. I am unaware of any analysis which has been conducted along either lines. Thus, for the moment, this unfortunately remains a point of armchair conjecture.

Thus, using the LS scores, we can establish the net happiness of different outcomes. This is a question of fact, and while we are able to make much of the calculation without relying on our subjective judgements, we have had to do so regarding the location of the neutral point.

However, what is a moral judgement, and must be made implicitly or explicitly, is what the badness of death is. On the ‘life comparative’ account of the badness of death, the value of saving a life is the total well-being the person would have had if they’d lived. The numbers I’ve produced above implicitly assumed this was the correct view.

One popular, alternative view about the badness of death, and the view GiveWell staffers take, is the Time-Relative Interest Account (TRIA).32 According to TRIA, it’s more important to save (say) 20-year olds than 2-year olds even though saving the 2-year old would, we suppose, cause around 18 more years of life to be lived. The reason to count the 2-year old for less is that very young children, being relatively underdeveloped, will have a weaker interest in continuing to live than a fully-developed 20-year old. We can see from the the ‘moral weights’ tab of GiveWell’s cost-effectiveness analysis GiveWell staffers seem to adopt TRIA:33 the value to save an over-5 year old is, depending on the staff member, 100% to 400% times that of saving an under-5 year old; the median weight is 200%.34

Note that advocates of TRIA will still need to know what the total well-being the person would have had: TRIA requires that as an input the equation where this is then discounts by the age of the person to determine the value of saving the life (ascertaining this discount is a moral judgement).

Let’s suppose we are TRIA advocates and now reduce the cost-effectiveness of AMF by half. If the neutral point is 3, then the cost-effectiveness of AMF is 12.2 LSPs/$1,000, and only about 50% more cost-effective than the estimate for StrongMinds.35 If the neutral point is 4, AMFis 3.5 LSPs/$1,000 and thus less cost-effective.

Further complications arise when we try to account for the ‘social value’ of saving lives - the impact saving a life has on everyone apart from the saved individual. We can divide this up into: (1) the effects of the death on friends and family; (2) concerns about under- or overpopulation;36 (3) the meat eater problem, the impact saving lives has on increasing animal suffering due to meat consumption.37 (2) and (3) are important but too much a diversion to our discussion here on using LS measures. Figures for (1) could be derived using LS data. Oswald and Powdthawee (2008) estimate that, shortly after the effect, the death of a child causes a 0.6 loss (on a 0-10 LS scale), a spouse a 1.3 loss, and a parent and 0.4 loss. Further work would needed to assess what the total counterfactual LS impact of saving a lives on friends and family is over time, and to adjust these figures for the developing country setting.38

As I’ve explained the LS approach and shown how it reaches different results from GiveWell, perhaps the natural thing to do would be to explain what GiveWell’s method is and whether the differences emerge from disagreements about value or disagreements about facts. Unfortunately, this is not straightforward to do as GiveWells’s key metric (again - trading off years of doubled consumption against the value of saving a under-5-year old) combines judgements of facts and value. I understand some GiveWell staff incorporate SWB surveys into their judgements, but I am not aware any staff member who solely basing their analysis on. What’s more, as GiveWell take the median answer of their staff, the ‘GiveWell view’ is a composite held by no one in particular. Hence it’s impossible to know exactly where the disagreements lie. My approach has been to make it clear, on the best available evidence about happiness, what we can say about the happiness impact of various outcomes, information everyone, I expect, will need to take into account. Readers who do not believe the best outcome is the one with the largest sum of happiness will need to adjust the analysis accordingly.

To summarise this section: it is now practically possible to use self-reported life satisfaction scores to determine how much various outcomes increase happiness. I applied the LS approach to evaluating the cost-effectiveness of GiveWell’s top charities and showed poverty-alleviating charities are unpromising compared to mental health interventions; empirical and philosophical questions remain over comparing the value of life-saving to life-improving interventions.

7. What should effective altruists do next?

Relying on life satisfaction (and other SWB measures) to tell us how to maximise human happiness is a new approach for effective altruists. If - as I have argued we should - we think is the correct method and we accept its results, then it challenges the current assumptions within EA about how to increase happiness. Most obviously, it suggests looking at the best ways to improve mental health, which is currently not regarded as a priority. The above analysis, comparing developing world charities, is just the first step in ascertaining how individuals can maximise happiness with their time and money. Much more work is required. We will need new evaluations of interventions and charities. We will need to identify the relevant and useful players within the happiness-increasing space and build a community around it. We will need to think what high-impact careers look like for those who want to maximise human welfare.39 We will need to develop a research strategy and determine what the priorities on it are.

This is a large challenge and after EAGxNetherlands in 2018 a small ‘Human Welfare Task Force’ (HWTF) formed to think how we might take this forward. It currently consists of myself, Alex Lintz, Denisa Pop, Robin van Dalen, Siebe Rozendal, Peter Brietbart and Jessica van Haften. If you wish to be involved, please email siebe[at]eagroningen.org and michael.plant[at]philosophy.ox.ac.uk.

If you agree with the Manifesto, here are some other concrete actions you can take:

● Arrange a local EA gathering where you discuss this issue, possibly watching my talk on Maximising World Happiness from EA Global London 2017.

● Get yourself up to speed on the latest research. I’ve produced a Reading List: Happiness for Effective Altruists and Other Humans .

● We have started working on a Human Welfare Research Agenda of questions we think need answering. You are welcome to make suggestions or pick something from the list to begin investigating. If you would like to investigate something, please get in touch so we can coordinate more effectively. We are also collecting some other research documents in our Google Drive folder .

● Join the Facebook group Effective Altruism, Mental Health and Happiness .

● If you currently donate to anti-poverty charities, you could switch your donation to an effective mental health charity instead. That said, given the uncertainty about what the best way to promote human welfare is, you may want to wait for further information. We are considering the possibility of setting up a research organisation focus on human happiness. If you’re interested in funding research into this area, please email me.



Annex A: How good are QALYs and DALYs as proxies for happiness?

Effective altruists have tended to use health metrics - QALYs and DALYs - as the proxy for WALYs. However, these standard health metrics are misleading proxies for happiness. For ease, I quote at length from Clarke et al. (2018, p85) :

In the QALY system, the impact of a given illness in reducing the quality of life is measured using the replies of patients to a questionnaire known as the EQ5D. Patients with each illness give a score of 1, 2, or 3 to each of five questions (on Mobility, Self-care, Usual Activities, Physical Pain, and Mental Pain). To get an overall aggregate score for each illness a weight has to be attached to each of the scores. For this purpose members of the public are shown 45 cards on each of which an illness is described in terms of the five EQ 5D dimensions. For each illness members of the public are then asked,“Suppose you had this illness for ten years. How many years of healthy life would you consider as of equivalent value to you?” The replies to this question provide 45N valuations, where there are N respondents. The evaluations can then be regressed on the different EQ5D dimensions. These “Time Trade-Off” valuations measure the proportional Quality of Life Lost (measured by equivalent changes in life expectancy) that results from each EQ5D dimension.

As can be seen, these QALY values reflect how people who have mostly never experienced these illnesses imagine they would feel if they did so. A better alternative is to measure directly how people actually feel when they actually do experience the illness.

The result would be very different. Figure [4] contrasts the outcomes from these two different approaches. The existing QALY weights are shown by the shaded bars of Figure [4]. This scale has been normalized so that the bars can be compared with those from a regression of life-satisfaction on the same variables. This latter regression is shown in the black bars in the figure—the magnitudes here are not β-statistics but the absolute impact of each variable on life-satisfaction (0–1). As can be seen from the lower part of the figure, the public hugely underestimated by how much mental pain (compared with physical pain) would reduce their satisfaction with life.

Figure 4. How life satisfaction (0-1) is affected by the EQ5D, compared with weights used in QALYs

QALYs are not a very good guide to what makes people satisfied because they are based on people’s preferences over how bad they imagine various health states are, rather than how bad they are when they experience them, and as noted earlier, we are not very good at imagining what makes us or others happy. To highlight a particularly outstanding discrepancy, Dolan and Metcalfe (2012, from whom the above figure 4 is derived) report subjects agreed to hypothetically give up as many years of their remaining life, about 15%, to be cured of ‘some difficulty walking’ as they would to be cured of ‘moderate anxiety or depression.’ However, from SWB measures ‘moderate anxiety or depression’ is associated with 10 times a greater loss to life satisfaction, and 18 times a greater loss to daily affect, than ‘some difficulty walking’ is. This seems compelling evidence, if we need any, that if we rely on people’s preferences about imagined futures we will get the wrong answers about what makes individuals happy. The explanation here is that, when imagining the future, we fail to anticipate that our ‘psychological immune system’ will ‘kick in’ and cause us to adapt to some circumstances but not others: what Gilbert et al. (2009) call ‘immune neglect’. Conditions such as mobility impairment are things we stop paying attention to, whereas mental illnesses are comparative ‘full-time’ and continue to affect our subjective experiences.

I’m unaware of any studies comparing DALYs and SWB measures directly, but given how DALYs are constructed - typically by asking experts for ratings - we would expect the same problems to occur. See Sassi (2006) for a comparison of the methodologies for QALYs and DALYs.

The implication of this analysis is that we should substantially reduce how cost-effective physical health interventions are compared to mental health interventions, assuming we’d previously judged them by QALYs and DALYs as Giving What We Can did in their reports into mental health (GWWC 2015 , 2016 ). Thus, unless we find physical health interventions that are incredibly cheap compared to mental health treatments, we should be sceptical physical health interventions will turn out to be comparatively more cost-effective. The possible exception would be using opiates to treat severe pain: pain is clearly very bad for well-being and opiates can be very cheap (Knaul et al. 2018) .

Perhaps the reason the effective altruism has largely overlooked mental illness largely because of the movement’s early reliance on QALYs/DALYs as an approximation of well-being. Given how much QALYs underrate the badness of mental health, it’s not much of a surprise individuals using those metrics would be led to the (false) conclusion mental health is comparatively unimportant.