Conclusions Substantial variation was observed in the speed with which individual NHS general practices responded to warranted changes in clinical practice. Changes in prescribing behaviour were detected automatically and robustly. Detection of structural breaks using indicator saturation methods opens up new opportunities to improve patient care through audit and feedback by moving away from cross sectional analyses, and automatically identifying institutions that respond rapidly, or slowly, to warranted changes in clinical practice.

Results Substantial heterogeneity was found between institutions in both timing and steepness of change. The range of time delay before a change was implemented was large (interquartile range 2-14 months (median 8) for Cerazette, and 5-29 months (18) for UTI). Substantial heterogeneity was also seen in slope following a detected change (interquartile range 2-28% absolute reduction per month (median 9%) for Cerazette, and 1-8% (2%) for UTI). When changes were implemented, the magnitude of change showed substantially less heterogeneity (interquartile range 44-85% (median 66%) for Cerazette and 28-47% (38%) for UTI).

Main outcome measures In each practice the following were measured: the timing of the largest changes, steepness of the change slope (change in proportion per month), and magnitude of the change for two example time series (expiry of the Cerazette patent in 2012, leading to cheaper generic desogestrel alternatives becoming available; and a change in antibiotic prescribing guidelines after 2014, favouring nitrofurantoin over trimethoprim for uncomplicated urinary tract infection (UTI)).

Objectives To determine how clinicians vary in their response to new guidance on existing or new interventions, by measuring the timing and magnitude of change at healthcare institutions.

We therefore set out to determine how clinicians vary in the timing of their response to new guidance. To achieve this objective, we repurposed and adapted statistical break detection techniques based on indicator saturation for use in medical time series data. Here, we report the deployment of these methods to assess variation in speed of adoption for two examples of warranted change in clinical practice: firstly, the move from branded to generic versions of the oral contraceptive, desogestrel, in 2012, saving the health service about £10m (€11.1m; $12.3m) a year 23 ; and secondly, the change from trimethoprim to nitrofurantoin as the firstline antibiotic for treating uncomplicated urinary tract infection (UTI) at various time points after 2014.

Assessing variation between institutions in timing of implementation for new clinical behaviours requires a systematic and robust method to identify when institutions have made a change. As it is not feasible to manually review thousands of time series charts to determine when meaningful change has occurred, this review must be done computationally. Statistical methods for the detection of structural change (known as break detection) provide a robust method of detecting the timing of changes in time series data without imposing an intervention or change date a priori. 20 These techniques have previously been applied to a diverse range of applications, including economic and climate modelling. 21 22

Diffusion of innovations has received some attention in healthcare 4 but research so far has primarily focused on case studies, 5 narrative descriptions of clinicians’ responses to change in guidance, 6 interviews, 7 8 9 and theoretical frameworks. 10 Previous quantitative work assessing implementation of new practices has typically relied on measuring change at the level of a whole population, using techniques such as interrupted time series analysis 11 12 or static measures of variation in care at one point in time through atlases of variation and regression analyses. 13 14 15 16 17 18 19

Medicine is characterised by the development of new interventions, and new information on existing interventions. This progress requires that clinical practice changes in response to updated evidence on effectiveness, safety, and cost. The diffusion of innovation is a longstanding area of research, originating with 1950s work on agriculture 1 and antibiotics. 2 Previous work has largely focused on narrative descriptions, discussing the nature of the innovation (its relative advantage, compatibility, and complexity to implement); the channels through which the innovation is communicated; and the so-called social system that is involved in implementing the innovation. 3 Previous quantitative work has relied on the manual characterisation of individuals and organisations as either adopting, or not adopting, a new intervention. 1 2 Typically, the rate of adoption is variable over time, starting with a small number of initial early adopters followed by a large number of institutions rapidly adopting the change, and then followed by a slower rate while so-called laggards adopt the change over a longer period. 3

We run OpenPrescribing.net, an openly accessible data explorer for all NHS England primary care prescribing data, which receives a large volume of user feedback from professionals, patients, and the public. This feedback is used to refine and prioritise our informatics tools and research activities. Patients were not formally involved in developing this specific study design.

Data management was carried out using SQL (in Google BigQuery), Python, and R. Break detection was implemented using the R package gets. 20 Complete code and data are provided online on Github ( https://github.com/ebmdatalab/change_detection/releases/tag/0.1 ), and code is also available as a python library ( https://pypi.org/project/change_detection/ ).

Multiple break points might be detected in one practice: we therefore limited the model to report the steepest contiguous segment contributing at least 50% to the total level change.

Slope: the steepness of the detected changes measures the pace of change per month within a practice (sudden or gradual) once change has begun.

Magnitude: the magnitude of change describes the extent to which each practice reduces the prescribing of the non-favoured drug treatment. This measure is calculated by subtracting the proportion of unfavourable prescribing at the end of the study time series from the proportion of unfavourable prescribing at the start time of the first detected change.

Timing: the timing of a change in behaviour is measured as the start of the steepest negative (downward) shift in a time trend of prescribing behaviour during the time series. This measure captures how long it takes a practice to begin to show a substantial change in behaviour in relation to a stimulus (in these examples, a medicine patent expiry and a change in clinical guidance).

To assess whether the methods for break detection were operating as expected, graphs of the time series for each individual practice were manually inspected, and plotted along with the fitted regression model and detected changes. One hundred randomly sampled graphs from each time series were inspected in detail by two blinded researchers independently to ensure that the automatically detected break points overall reflected a true change in prescribing behaviour, with each giving a narrative description of any issues raised. All remaining graphs were rapidly reviewed to check for gross errors in automated detection.

We used trend indicator saturation, 20 a modified version of indicator saturation, 25 in each practice’s time series to determine any statistically significant change in prescribing behaviour. 25 We formulated the detection of breaks as a model selection problem where a time series regression model of the prescribing behaviour is saturated with a full set of step functions interacted with a linear time trend. We selected over these break functions at every point in time, removing all non-significant breaks at a chosen level of significance (in this case, P=0.000001) to tightly control the false positive rate. Step shift (or cliff-like) changes in behaviour can be approximated by a single breaking trend with a high coefficient on the slope while gradual, smooth transition behaviour 26 can be approximated through a series of multiple broken linear trends with smaller slope coefficients.

We measured the proportion of trimethoprim prescriptions as a proportion of total trimethoprim and nitrofurantoin prescriptions. A decrease in this proportion would correspond to an improvement in this measure. The time series for this measure ran from June 2013 to June 2018. This timing was done to centre the data on the time period surrounding the following interventions: the change in antibiotic prescribing guidance in October 2014; followed by the introduction of a “quality premium” financial incentive, which was announced in October 2016 and implemented in April 2017.

We measured the total proportion of desogestrel prescriptions that were prescribed as the branded Cerazette. A decrease in this proportion would correspond to an improvement in this measure. The time series for this measure ran from October 2010 to December 2015. This timing was chosen to centre the data on the time period surrounding the expiry of the Cerazette patent in December 2012. Before the patent expiry, it was still possible to prescribe desogestrel generically but all dispensing would be of branded Cerazette.

The monthly prescribing datasets, published by the NHS Business Services Authority, contain one row for each treatment and dose, in each prescribing organisation in NHS primary care in England, describing the number of prescriptions issued and the total cost. To extract data on standard general practices, we limited them to institutions with setting code 4: general practices, 24 excluding all other organisations, such as dentists, prisons, and walk-in centres. We excluded data for a measure in any practice where the time series had more than half of its values missing. Missing values were either caused by small numbers leading to months where the denominator was 0 or by the practice not being open for part of the time series (eg, due to closing). We also excluded practices where prescribing did not vary during follow-up time because practices where the proportion of prescriptions stayed constant throughout the sample cannot have any change points.

The level of heterogeneity in the magnitude of change was less than that for timing or slope of the change (third panels in fig 2 and fig 3 ). Heterogeneity was variable over time for the desogestrel measure ( fig 2 ) but uniform over time for the trimethoprim/nitrofurantoin measure ( fig 3 ).

The slope of the detected change was also highly variable between general practices (second panels in fig 2 and fig 3 ), especially for the desogestrel measure, which showed a greater than 10-fold difference in the slope of change between the practice at the 25th centile and the 75th centile ( table 1 ). For the desogestrel measure, the steepness of the change was substantially greater following expiry of the Cerazette patent, indicating that those practices changing later typically did so more rapidly. The mean slope of the detected change for the trimethoprim/nitrofurantoin measure was generally much lower, indicating slower change in practice; the mean slope only substantially increased following implementation of the quality premium financial incentive in April 2017. The relation between timing and slope of change is illustrated in supplementary figures S1 and S2.

Response of general practices to a change in antibiotic prescribing guidance (from trimethoprim to nitrofurantoin for uncomplicated urinary tract infection). Top panel=number of practices with their largest detected downward change in each month. Second panel=mean slope of the detected change for all practices changing in that month. Third panel=mean magnitude of detected change for all practices changing in that month. Bottom panel=median trimethoprim prescribing as a proportion of all total trimethoprim and nitrofurantoin prescribing (solid line), along with deciles (dashed lines) and extreme percentiles (dotted lines)

Response of general practices to the patent expiry of Cerazette and subsequent price change for desogestrel. Top panel=number of practices with their largest detected downward change in each month. Second panel=mean slope of the detected change for all practices changing in that month. Third panel=mean magnitude of the detected change for all practices changing in that month. Bottom panel=median Cerazette prescribing as a proportion of all desogestrel prescribing (solid line), along with deciles (dashed lines) and extreme percentiles (dotted lines)

For both measures, heterogeneity was considerable between practices in the timing of their largest response to the warranted change in practice. The top panels of figure 2 and figure 3 show the distribution of the largest detected changes for each measure. Changes were detected across the whole range of the time series. Practices tended to respond more quickly, and with less variation, for the desogestrel measure than the UTI antibiotic measure. For the desogestrel measure, the largest peak in detected changes occurred a few months after expiry of the Cerazette patent. In contrast, relatively few changes were detected in the months following the UTI antibiotic guidance change, with the peak in detected changes not occurring until after the announcement of the quality premium financial incentive.

Summary of detected changes in prescription behaviour for two prescribing measures, across all general practices (move from branded to generic versions of desogestrel in 2012 and change from trimethoprim to nitrofurantoin as the firstline antibiotic for treating uncomplicated urinary tract infection (UTI) at various time points after 2014)

Table 1 summarises the detected heterogeneity in prescribing behaviour across all practices, for both measures, with summary statistics over the three estimated measures. For the desogestrel and UTI antibiotic measures, 1711 (22%) and 1380 (18%) practices showed no significant downward changes, respectively.

During the process of manual inspection of 200 randomly selected graphs, a bug was found and fixed whereby if the initial variance of the time series was very low (eg, if a practice prescribed 100% branded Cerazette for many months initially) the technique would become hypersensitive to change, leading to inappropriate detection. We fixed this problem by tweaking one of the parameters of the change detection algorithm away from the default (the maximum size of the block partitioning 20 ). The algorithm was otherwise found to be operating as expected: of 200 time series reviewed, we found two cases of suboptimal detection and four cases of arguable/borderline suboptimal detection. The time series examined, and manual checking datasheet, can be seen in supplementary files A and B.

Examples of practice time series and measured attributes for desogestrel change (from branded to generic). Proportion of Cerazette relative to total desogestrel prescribed of four representative general practices is shown in solid purple lines. Fitted model and detected breaks using trend indicator saturation are shown in pink dashed lines. Commencement of the largest negative shift is marked with a vertical dashed blue line; additional breaks are indicated by changes in the slope of the pink dashed line. The measured slope is highlighted in pink shaded areas, and the pre-break level and final level at the end of the sample are indicated by horizontal orange dotted lines

Figure 1 shows examples of practice time series for the desogestrel measure, illustrating the three indicators of change. The timings of detected breaks (the steepest substantial negative shift) are marked as a vertical dashed blue line. The segments over which the average slope is calculated are shaded in the figure. The magnitude of change is calculated as the difference between the horizontal dotted orange lines. Figure 1A shows a practice where a steep, cliff-like change is detected, followed by a change to a more gradual decline while figure 1B shows a single gradual detected change. Figure 1C shows a practice where an early gradual change is detected followed by a steeper change: as above, for our descriptive analysis, we report timing, slope, and magnitude for the break point contributing to the largest change in practice. For the practice in figure 1D , no changes were detected that reached the necessary significance level (P=0.000001).

A total of 8078 practices were included in the study overall; 259 practices were excluded from the desogestrel analyses and 398 from the UTI antibiotics analysis because of incomplete time series. One practice was removed from the desogestrel measure because of every value being 1.0. Practices were dropped mainly because of missing values, which is typically a consequence of low prescribing volume. Excluded practices were typically much smaller: mean patient list size for excluded practices was 1861 for the desogestrel measure and 3408 for the UTI antibiotic measure (while the national mean list size was 7078).

Discussion

Summary The indicator saturation method was successfully implemented to detect meaningful changes in clinical practice. Among general practices in the English health system, we described substantial heterogeneity in the timing and slope of warranted changes in clinical practice following changes in price and clinical guidance on two commonly prescribed treatments: an oral contraceptive and the choice of antibiotic for UTI.

Interpretation The changes measured in this study were highly warranted from a cost effectiveness or clinical perspective, as illustrated by the fact that most practices eventually showed a substantial change in clinical practice. However, the distribution of the measures of timing, slope of change and, to a lesser extent, magnitude, showed high variation and skewness. While a large proportion of practices showed a significant shift away from branded Cerazette in early 2013, a quarter did not show their most substantial change for 14 months (February 2014), with the slowest 10% changing at least a further 6 months later (September 2014), exposing the health system to substantial avoidable costs. The spread of timing of changes was more pronounced for the trimethoprim/nitrofurantoin measure, with a quarter of practices not making their largest change until 29 months after the guidance was released and 10% not changing until at least 32 months after the release, exposing patients to suboptimal care. The slower dissemination of the antibiotic guidance could be because the guidance was less clear, with some clinical judgment involved, rather than “always prescribe the generic,” as was the case with desogestrel. This variation between individual general practices in how they responded to a new warranted change in clinical practice was not limited to the timing of when the change began; variation was also seen in the slope of the change, or how rapidly that change was implemented after the change began. For example, the highest quarter of practices for slope of response reduced their proportion of branded Cerazette prescribing swiftly, by at least 26% in one month while the lowest quarter of practices for slope of response reduced branded prescribing gradually, by less than 2% per month. We also saw some indication (fig 2, top and second panels) that practices implementing a change late tended to do so more rapidly than those who noticed the need for change earlier. This effect is perhaps due to an increased sense of urgency for practices that have noticed later. Regardless of the heterogeneity in timing and slope of change, the relative uniformity in the magnitude of change suggests that once practices implement a change, they are able to do so effectively, with most practices ultimately implementing a large change in practice.

Strengths and weaknesses Our data cover the complete prescribing data for all practices in England, not just a sample. The underlying data are highly accurate as they are based on prescription pharmacy claims used for very high tariff transactions within the health service, with all parties motivated to ensure complete and correct information. We accounted for variation in the prevalence of underlying conditions by measuring the proportion of “all” prescribing that is “undesirable” rather than, for example, the crude volume of “undesirable” prescribing (that is, we measured Cerazette as a proportion of all Cerazette and generic desogestrel prescribing and trimethoprim as a proportion of total trimethoprim and nitrofurantoin prescribing). The indicator saturation approach to detect breaks successfully detected change in prescribing behaviour and appears to be flexible across two different applications: the desogestrel measure had one unambiguous time point, after which prescribing generically was simply preferable; the nitrofurantoin/trimethoprim guidance, in contrast, was communicated to clinicians through various different routes at different times, and was a change in practice that required ongoing clinical judgment, because prescribing nitrofurantoin rather than trimethoprim might not always be correct for all patients.

Findings in context To our knowledge, this is the largest study conducted on diffusion of change in medical practice, by a substantial margin. The largest previous study monitored 95 practitioners in Denmark, covering a population of 490 000 citizens, compared with our study covering a population of 55 million.5This Danish study assessed only one crude outcome metric (time to first prescription of a new antibiotic) whereas we were able to harness novel computational methods to automatically detect more detailed changes in clinical practice, across many institutions (about 8000 practices), and for more complex and generalisable clinical behaviours than a first ever prescription of a new medicine. The previous absence of computational techniques, such as indicator saturation, explains why most previous work on diffusion of change is either small scale or focused purely on narrative descriptions (as discussed in the Introduction): without automation, it is extremely labour intensive to manually categorise whether, and when, a large number of institutions have modified their clinical practice in response to a warranted change, across a large number of patients.

Policy implications We can identify two sets of policy implications from this work: the fact that substantial heterogeneity was detected in response to warranted changes in clinical practice; and the potential for better metrics and feedback to clinicians through the application of break detection methodology to clinical data.

Variation in speed of implementation For both of the prescribing measures studied in this analysis, we observed substantial heterogeneity in timing and slope of warranted change but almost all practices ultimately showed substantial changes in clinical practice. In lay terms, most practices changed their behaviour but some changed much later than others; and some practices showed rapid, coordinated change, while others changed only gradually. This heterogeneity is problematic: it exposes health systems to substantial avoidable costs and exposes patients to suboptimal clinical care. Although expecting all practices to respond immediately and adopt optimal prescription behaviour might be unrealistic, the fact that some practices changed both early and rapidly suggests that rapid timely change is possible. Further work is required to explore the reasons for some practices being slow to implement prescribing changes. We have previously written on the importance—and comparative neglect—of systems to disseminate knowledge to clinicians and patients, and social structures to audit and assess the implementation of warranted changes in practice.627

Novel applications of indicator saturation The automation of change detection also presents new opportunities for better use of data in audit and feedback on clinical practice, which has been shown in systematic review data to solicit modest but cost effective improvements in clinical practice.28 Such audits currently rely on a static snapshot of clinical practice. Indicator saturation methods raise the potential for more sophisticated metrics—for example, describing whether an individual clinician or institution tends to respond rapidly or slowly to changes in price, evidence, or safety across a range of different elements of clinical practice. This in turn could improve the targeting of resources to support those who are responding slowly across a range of warranted changes. Automated change detection also permits new approaches to interrogate which interventions are most impactful at soliciting change in clinical practice, both in terms of timing for initial change and rapid coordinated change. For example, in figure 3, the financial incentive is clearly associated with the largest number of practices initiating change in one month. These new methods might also help to distinguish between warranted and unwarranted variation in care, itself an ongoing challenge for all work on variation in clinical practice: specifically, whether an observed variation is driven by variation in patients’ clinical needs and preferences (warranted variation) or variation in their clinicians’ knowledge, preferences, and service availability (unwarranted variation). A clinician presented with evidence that they are currently an outlier for a new desired change in clinical practice might argue that their patients are unusual and warrant clinical decisions that deviate from best practice guidelines. However, if indicator saturation methods show previous warranted changes in clinical practice that were ultimately implemented by this clinician, but three years later than their peers, then this is stronger evidence that current deviation from best practice is driven by the clinician’s knowledge or choices, rather than their patients’ needs or preferences. Lastly, the potential to automate detection in timing and slope of change using indicator saturation presents an immediate opportunity to produce automated metrics on timing of change for individual clinicians and institutions. OpenPrescribing.net is an openly accessible service for detailed exploration of NHS England prescribing data by practice and by month, run by our team, with 14 000 unique users each month. We are currently developing novel measures driven by indicator saturation to describe whether practices and clinical commissioning groups overall tend to implement warranted changes in clinical practice earlier or later than their peers, for deployment and impact evaluation in our large pool of users.

Redeploying this method elsewhere This method is highly flexible in terms of the type and quality of data that it can be applied to. Both noisy data and data with missing values can be used, as shown by the examples described in this study. Here, we showed time series with a data point once per month, with a total of about 60 data points, but data of any length and frequency can also be used. In this study, we excluded practices where more than half of the data points were missing because it was unlikely that meaningful changes would be detected and because the false positive rate could be higher, but missing data can be handled flexibly according to the specific use case. We chose a conservative P value of 0.000001 to ensure a low false positive rate and to increase confidence in the detected breaks actually reflecting underlying changes. In much longer time series (eg, in a sample of 1000 observations), we would expect the probability of a false positive to be 1000×0.000001=0.001 changes detected spuriously on average. However, simulation with small samples (<200) shows that the false positive rate of trend indicator saturation can lie above the chosen P value.20 Consequently, for time series with more data points, a higher P value might be more appropriate. We implemented break detection in this study using the R package gets; it can also be implemented in the econometric package PcGive.29 Although not demonstrated in our examples, because the break detection approach is based on regression modelling, it is straightforward to include additional covariates such as seasonal cycles, autoregressive lags to capture persistence, or additional static explanatory variables.20 We chose to present summary statistics for the largest detected change because it represented the most important and coordinated change. However, our break detection approach could be used in different ways for different clinical and research problems—for example, focusing on the first detected change in clinical practice or the first change to reach a prespecified threshold, depending on specific needs.