Abstract Objective The assessment of response to lithium maintenance treatment in bipolar disorder (BD) is complicated by variable length of treatment, unpredictable clinical course, and often inconsistent compliance. Prospective and retrospective methods of assessment of lithium response have been proposed in the literature. In this study we report the key phenotypic measures of the “Retrospective Criteria of Long-Term Treatment Response in Research Subjects with Bipolar Disorder” scale currently used in the Consortium on Lithium Genetics (ConLiGen) study. Materials and Methods Twenty-nine ConLiGen sites took part in a two-stage case-vignette rating procedure to examine inter-rater agreement [Kappa (κ)] and reliability [intra-class correlation coefficient (ICC)] of lithium response. Annotated first-round vignettes and rating guidelines were circulated to expert research clinicians for training purposes between the two stages. Further, we analyzed the distributional properties of the treatment response scores available for 1,308 patients using mixture modeling. Results Substantial and moderate agreement was shown across sites in the first and second sets of vignettes (κ = 0.66 and κ = 0.54, respectively), without significant improvement from training. However, definition of response using the A score as a quantitative trait and selecting cases with B criteria of 4 or less showed an improvement between the two stages (ICC 1 = 0.71 and ICC 2 = 0.75, respectively). Mixture modeling of score distribution indicated three subpopulations (full responders, partial responders, non responders). Conclusions We identified two definitions of lithium response, one dichotomous and the other continuous, with moderate to substantial inter-rater agreement and reliability. Accurate phenotypic measurement of lithium response is crucial for the ongoing ConLiGen pharmacogenomic study.

Citation: Manchia M, Adli M, Akula N, Ardau R, Aubry J-M, Backlund L, et al. (2013) Assessment of Response to Lithium Maintenance Treatment in Bipolar Disorder: A Consortium on Lithium Genetics (ConLiGen) Report. PLoS ONE 8(6): e65636. https://doi.org/10.1371/journal.pone.0065636 Editor: Kazutaka Ikeda, Tokyo Metropolitan Institute of Medical Science, Japan Received: January 24, 2013; Accepted: April 26, 2013; Published: June 19, 2013 This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication. Funding: The work on assessment of lithium response has been supported by a grant from Canadian Institutes of Health Research #64410 to MA. MG-S was supported by Unitatea Executiva pentru Finantarea Invatamantului Superior, a Cercetarii, Dezvoltarii si Inovarii (UEFISCDI), Bucharest, Romania (grant no. 89/2012). JMA and AN were supported by a grant from the Swiss National Foundation (#32003B_125469/1 to JM Aubry). ConLiGen is in part supported by funds from the Intramural Research Program of the National Institute of Mental Health (NIMH) at the National Institutes of Health (NIH), Department of Health and Human Services, United States Government. It is further supported by a grant from the Deutsche Forschungsgemeinschaft to MR, MB, and TGS (RI 908/7-1). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: Co-authors James B. Potash, Andreas Reif, and Bernard T. Baune are PLOS ONE Editorial members. This does not alter the authors’ adherence to all the PLOS ONE policies on sharing data and materials.

Introduction Bipolar disorder (BD) is a lifelong and severe psychiatric illness characterized by recurrences of episodes of depression and hypomania/mania [1]. Lithium is among the first-line maintenance treatments for BD [2], [3], preventing relapses and recurrences of opposite polarity. In addition, lithium decreases the risk of suicidal behaviour and all-cause mortality in mood disorders [4]–[6]. Naturalistic analyses show that approximately one third of BD patients achieve complete remission on lithium [7]–[14]. Lithium-responsive BD patients have distinct clinical features, such as episodicity of clinical course [15], absence of rapid cycling [16], and a family history of BD [17], corresponding to the BD “core phenotype” [18]. Despite a significant genetic component for lithium-responsive BD [12], [19], pharmacogenetic studies have not produced replicated results [20], [21]. One possible explanation for the lack of conclusive pharmacogenetic findings is the varying definition of lithium response across the studies. Indeed, the assessment of lithium maintenance treatment response, and consequently the definition of the phenotype under study, is complicated by factors inherent to the natural history of BD. The irregular clinical course of BD [22] as well as variable treatment adherence [23] are only few of the factors that contribute to the complexity in assessing the response to lithium maintenance treatment. To reduce the impact of the clinical heterogeneity of BD in pharmacogenetics (and possibly to define genetically more homogeneous subgroups of BD patients), researchers have proposed to select prospectively followed patients on lithium monotherapy with unequivocal clinical response [24], [25]. However, this may not be practical if large patient samples are needed. In such cases, we need to rely on retrospective evaluation of treatment response. Several such methods have been described in the literature including the Affective Morbidity Index (AMI) [26] and the Illness Severity Index [27]. The AMI takes into account the duration and the severity of an episode, the latter scored on a 4-point scale (0 = no conspicuous affective disturbance, 1 = mild depression or mania, 2 = moderate depression or mania, 3 = severe depression or mania). The area under the curve can be calculated from these two variables and compared between defined treatment periods. Similarly, the Illness Severity Index measures the efficacy of lithium treatment in controlling mood episodes. It is defined as the frequency of affective episodes prior to starting lithium adjusted for age at the time lithium was started [27]. However, changes of affective morbidity might be not only a result of the treatment, but could be due to other factors. In the Consortium on Lithium Genetics (ConLiGen, www.ConLiGen.org) study [28], we adopted the “Retrospective Criteria of Long-Term Treatment Response in Research Subjects with Bipolar Disorder” as the principal method of evaluation of the response to lithium [12], [13]. In addition to measuring the degree of clinical improvement, this scale weighs clinical factors considered relevant in determining whether the observed clinical change is in fact due to the lithium treatment. Since ConLiGen is an international multi-centre collaboration, it has been crucial to assess the key phenotypic measures and the response to long-term lithium treatment reliability across the participating research groups. Here we present: 1) the results of the reliability analysis of response to lithium treatment across the participating centres, and 2) the distributional properties of the scale scores. These two sets of findings have been instrumental in obtaining stringent phenotypic definitions of lithium response. These analyses are of particular importance in light of the genome-wide association study (GWAS) currently being undertaken by ConLiGen.

Materials and Methods Assessment of Clinical Response to Lithium Treatment The response to lithium treatment was measured using a previously published and validated rating scale: the “Retrospective Criteria of Long-Term Treatment Response in Research Subjects with Bipolar Disorder” [12], [28]. Briefly, this scale quantifies the degree of improvement in the course of treatment (A criterion or A score) expressed as a composite measure of change in frequency and severity of mood symptoms. The A score is weighed against 5 factors (B criteria) which allow one to determine if the observed improvement is a result of the treatment rather than a spontaneous improvement or an effect of additional medication. Specifically, the B criteria consider: the number of episodes before/off the treatment (B1), the frequency of episodes before/off the treatment (B2), the duration of the treatment (B3), the compliance during period(s) of stability (B4) and the use of additional medication during the period of stability (B5). The total score (TS) is obtained by subtracting the B score from the A score. Analysis of the Inter-rater Agreement and Reliability of the Assessment of Lithium Response The agreement and reliability of the assessment of lithium response between raters of 29 ConLiGen participating centres was measured using a two-stage case-vignette rating procedure (Table 1). Specifically, the study protocol had three phases: 1) twelve standardized case vignettes prepared by investigators (M.A., J.G., C.S.) at Dalhousie University were circulated and rated by 70 investigators; 2) annotated first-round vignettes and rating guidelines were circulated for training purposes after the first stage; 3) sixteen additional more complex vignettes prepared by senior researchers at Dalhousie University, Johns Hopkins University School of Medicine, National Institute of Mental Health (NIMH) and Academia Sinica of Taiwan (M.A., J.G., J.P., T.G.S., F.M., A.C.) were circulated and rated by 48 investigators at the participating sites. The first set of vignettes was based exclusively on BD patients who had been prospectively followed in a specialty program and with detailed clinical information on the course of illness and treatment history. The second set of vignettes was heterogeneous and included patients treated in various settings, some with limited clinical details assessed cross-sectionally. Since raters had no prior knowledge of the rating scale, this design allowed us to estimate the impact of training on agreement and reliability of lithium response assessment. The rating procedure was performed from April 2009 to October 2012. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 1. Number of raters from the Consortium on Lithium Genetics (ConLiGen) centres participating in the two-stage case-vignette rating procedure for inter-rater reliability and agreement. https://doi.org/10.1371/journal.pone.0065636.t001 The degree of concordance of lithium response definition was assessed with Cohen’s kappa (κ) [29] and intra-class correlation (ICC) coefficient [30]. These analytical methods were applied to the dichotomous and continuous definition of lithium response, respectively. The κ statistics (multiple raters with two outcomes) were calculated with 95% confidence interval (CI) for each cut off point of the TS scale in the range from 3 (non response to lithium) to 8 (full response to lithium). Interpretation of the strength of agreement was made according to Landis and Koch: poor (κ <0.00), slight (0.00–0.20), fair (0.21–0.40), moderate (0.41–0.60), substantial (0.61–0.80), almost perfect (0.81–1.00) [31]. The quantitative scores of the treatment response scale were analyzed in the first (ICC 1 ) and second (ICC 2 ) stage of ratings. Specifically, we analyzed the TS (weighted clinical improvement), the A score (uncorrected clinical improvement), the B score (quantification of confounders), and the A score when B score ≤4. The latter measure allows the identification of “valid cases” through selection at the B criteria. Subjects with B score ≤4 are likely to have a clinical improvement causally related to lithium treatment. The ICC was tested with the two-way random effects model, that assumes a random sample of K investigators selected from a larger population, and each rates N targets (i.e., case vignettes) altogether, and the two-way mixed effects model, with each target rated by each of the same K investigators, who are the only ones of interest. For both models we calculated the single and average measure reliability. Analysis of the Distributional Properties of the Treatment Response Scale For the analysis of the distributional properties, we accessed TS data of 1,308 BD patients from the NIMH centralized ConLiGen phenotypic dataset. Mixture analysis: frequentist and Bayesian approach. We used mixture analysis to test whether we could identify subgroups of patients according to the degree of response to lithium as expressed by TS. The choice of the mixture model that best fit the distribution of TS was made according to the Akaike’s and Schwarz’s Bayesian information criteria (AIC and BIC, respectively). The lower values of these two criteria indicated the most parsimonious model that best fit the empirical function of total score distribution. The analysis was performed using the “NMixEM” function implemented in the MixAk package [32] of R software (version 2.13.2). To verify the findings from the frequentist mixture analysis, we performed the Bayesian mixture analysis employing a minimum message length approach (MML) [33]. Specifically, we used the Snob software [34] to test whether the distribution results from a union of a number of “classes”, where the distributions “within-classes” are homogeneous and have a simple form, but vary significantly “between-classes”. The best fitting model was indicated as the most parsimonious model (i.e., the one with the lower cost expressed in nits, a specific measure unit conventionally used to express the length message). The analysis was performed using a measurement error equal to 2.5 empirically estimated by plotting the distribution of TS. Cut off point calculation. Cut off points were derived using the theoretical TS function and calculating each data point’s probability of belonging to each class. Specifically, once the mixture model parameters were estimated, we calculated the posterior probability of any data point x belonging to the i-th class as where ω is the weight, μ is the mean, σ is the standard deviation. The resulting probabilities were then compared in order to establish which class the data point belonged to.

Discussion The purpose of this study was to assess the key phenotypic measures of response to lithium treatment in the large international collaborative Consortium on Lithium Genetics. To this end, two main analyses have been carried out: the inter-rater agreement and reliability of lithium response definition across the ConLiGen participating sites, and the analysis of the distributional properties of the lithium treatment response scale [12]. We found that two definitions of lithium response, one dichotomous and the other continuous had moderate to substantial inter-rater agreement and reliability. Specifically, the two-stage case vignettes inter-rater reliability analysis pointed to the measure of clinical improvement under lithium treatment expressed by the A score and with selection of “valid cases” through a total B score ≤4. This phenotypic definition of lithium response had a substantial inter-rater reliability in the first stage of ratings (ICC 1 = 0.71) with further improvement in the second stage (ICC 2 = 0.75). Regarding the dichotomous definition of lithium response, a scale TS ≥7 was identified as the best cut off as shown by inter-rater agreement κ scores in the first (κ = 0.66) and second (κ = 0.54) stages of case vignette ratings. Further, the analysis of the distributional properties of the treatment response scale further supported this dichotomous definition. In addition, this same measure of lithium response has been previously proposed in several clinical and genetic papers [12], [13], [35], [36]. Some methodological considerations need to be made. For the analysis of the distributional properties, we applied mixture modeling, a method that has been extensively used in psychiatry for the identification of patient subgroups, reducing phenotypic heterogeneity and ultimately helping genetic research [37]–[39]. It should be noted that this method is exploratory and it does not identify the factors determining the differences between the identified subgroups [40]. A validation of the model can be obtained by comparison of the characteristics of each subgroup. In the ConLiGen study, we plan to use the clinical correlates of lithium response as external validators of the phenotypic measure suggested by the mixture modeling. Such analysis will test and compare the direction and magnitude of the association of a number of clinical variables with lithium response in its dichotomous and continuous definition. Notably, the analysis of inter-rater reliability and agreement has involved investigators belonging to different research groups with different clinical backgrounds and training. Nevertheless, the use of standardized case vignettes and the training procedures has produced moderate to substantial agreement in the assessment of lithium response. These findings are of importance, given the evidence that even in the context of inpatient unit settings the inter-rater agreement can be unsatisfactory [41]. We performed a two-stage case-vignettes procedure aimed at testing the effect of training on the assessment of lithium response. Contrary to our expectations, we only detected improvement in the inter-rater reliability of lithium response expressed by the A score and with selection of “valid cases” through a total B score ≤4, but not in that expressed by TS or A score. Arguably, the second set of vignettes described more complicated clinical cases with comorbidities, lack of compliance and multiple treatments, all factors that could have influenced the scoring of the B criteria. Indeed, the ICC for the total B score decreased noticeably in the second stage of ratings, implying an increased variability in rating that impacted the discrimination among cases [42]. This explanation is corroborated by the finding of the higher ICC 2 of A score with total B score ≤4. By applying this cut-off we decreased the assessment variability ultimately increasing the discrimination among cases. Further, these findings confirm that patients with short duration of lithium treatment, poor compliance, and concomitant medications are unlikely to be assessed reliably. This argues against the inclusion of such complex, non-standard cases in pharmacogenomic studies of lithium response. Finally, the higher inter-rater agreement and reliability found in the first set of vignettes suggests that the assessment of lithium response is reliable if sufficient clinical details are available. On the other hand if the information is limited, additional rater training will be of little help. In conclusion, our findings support the use of two definitions of lithium response for the pharmacogenomic GWAS currently being performed by ConLiGen. Accurate phenotypic definitions of treatment response are crucial in pharmacogenomic studies [43], [44]. Heterogeneity in the phenotype definition of treatment response can be a problem especially when in the context of psychiatric disorders. In the absence of other reliable clinical measures of response to lithium, this study has suggested two plausible phenotypic definitions that await application and validation in other samples.

Author Contributions Conceived and designed the experiments: M. Alda TGS FJM MB MR. Performed the experiments: M. Manchia M. Alda RA JMA LB CEMB BTB FB SB CBP EB CVC ATAC CC S. Clarke PMC CD MDZ JRD BE PF LF MAF JF SG JG FSG PG OG R. Hashimoto JH R. Hoban SJ JPK LK TK JRK SKS SK PHK IK GL CL ML SGL CALJ M. Maj AM LM TM PBM FM PM AN MN TN CO UO NO RHP AP JBP DRE AR ER SR JKR MS PRS OKS BS FS MGS GS LRS CS JWS AS TS PS SKT AT AW DZ MB MR TGS M. Adli. Analyzed the data: M. Manchia M. Alda. Contributed reagents/materials/analysis tools: NA JMB S. Cichon SDD UH LH GAR GT JS NRW PPZ. Wrote the paper: M. Manchia M. Alda.