Readers’ note This article is a living systematic review that will be updated to reflect emerging evidence. Updates may occur for up to two years from the date of original publication. This version is update 2 of the original article published on 7 April 2020 (BMJ 2020;369:m1328), and previous updates can be found as data supplements ( https://www.bmj.com/content/369/bmj.m1328/related#datasupp ).

Conclusion Prediction models for covid-19 are quickly entering the academic literature to support medical decision making at a time when they are urgently needed. This review indicates that proposed models are poorly reported, at high risk of bias, and their reported performance is probably optimistic. Hence, we do not recommend any of these reported prediction models for use in current practice. Immediate sharing of well documented individual participant data from covid-19 studies and collaboration are urgently needed to develop more rigorous prediction models, and validate promising ones. The predictors identified in included models should be considered as candidate predictors for new models. Methodological guidance should be followed because unreliable predictions could cause more harm than benefit in guiding clinical decisions. Finally, studies should adhere to the TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) reporting guideline.

Results 14 217 titles were screened, and 107 studies describing 145 prediction models were included. The review identified four models for identifying people at risk in the general population; 91 diagnostic models for detecting covid-19 (60 were based on medical imaging, nine to diagnose disease severity); and 50 prognostic models for predicting mortality risk, progression to severe disease, intensive care unit admission, ventilation, intubation, or length of hospital stay. The most frequently reported predictors of diagnosis and prognosis of covid-19 are age, body temperature, lymphocyte count, and lung imaging features. Flu-like symptoms and neutrophil count are frequently predictive in diagnostic models, while comorbidities, sex, C reactive protein, and creatinine are frequent prognostic factors. C index estimates ranged from 0.73 to 0.81 in prediction models for the general population, from 0.65 to more than 0.99 in diagnostic models, and from 0.68 to 0.99 in prognostic models. All models were rated at high risk of bias, mostly because of non-representative selection of control patients, exclusion of patients who had not experienced the event of interest by the end of the study, high risk of model overfitting, and vague reporting. Most reports did not include any description of the study population or intended use of the models, and calibration of the model predictions was rarely assessed.

Objective To review and appraise the validity and usefulness of published and preprint reports of prediction models for diagnosing coronavirus disease 2019 (covid-19) in patients with suspected infection, for prognosis of patients with covid-19, and for detecting people in the general population at increased risk of becoming infected with covid-19 or being admitted to hospital with the disease.

We aimed to systematically review and critically appraise all currently available prediction models for covid-19, in particular models to predict the risk of developing covid-19 or being admitted to hospital with covid-19, models to predict the presence of covid-19 in patients with suspected infection, and models to predict the prognosis or course of infection in patients with covid-19. We included model development and external validation studies. This living systematic review, with periodic updates, is being conducted by the COVID-PRECISE (Precise Risk Estimation to optimise covid-19 Care for Infected or Suspected patients in diverse sEttings) group in collaboration with the Cochrane Prognosis Methods Group.

To mitigate the burden on the healthcare system, while also providing the best possible care for patients, efficient diagnosis and information on the prognosis of the disease is needed. Prediction models that combine several variables or features to estimate the risk of people being infected or experiencing a poor outcome from the infection could assist medical staff in triaging patients when allocating limited healthcare resources. Models ranging from rule based scoring systems to advanced machine learning models (deep learning) have been proposed and published in response to a call to share relevant covid-19 research findings rapidly and openly to inform the public health response and help save lives. 5 Many of these prediction models are published in open access repositories, ahead of peer review.

The novel coronavirus disease 2019 (covid-19) presents an important and urgent threat to global health. Since the outbreak in early December 2019 in the Hubei province of the People’s Republic of China, the number of patients confirmed to have the disease has exceeded 8 963 350 in 188 countries, and the number of people infected is probably much higher. More than 468 330 people have died from covid-19 (up to 22 June 2020). 1 Despite public health responses aimed at containing the disease and delaying the spread, several countries have been confronted with a critical care crisis, and more countries could follow. 2 3 4 Outbreaks lead to important increases in the demand for hospital beds and shortage of medical equipment, while medical staff themselves could also get infected.

It was not possible to involve patients or the public in the design, conduct, or reporting of our research. The study protocol and preliminary results are publicly available on https://osf.io/ehc47/ and medRxiv.

Data extraction of included articles was done by two independent reviewers (from LW, BVC, GSC, TPAD, MCH, GH, KGMM, RDR, ES, LJMS, EWS, KIES, CW, AL, JM, TT, JAAD, KL, JBR, LH, CS, MS, MCH, NS, NK, SMJvK, JCS, PD, CLAN, RW, GPM, IT, JYV, DLD, JW, FSvR, PH, VMTdJ, and MvS). Reviewers used a standardised data extraction form based on the CHARMS (critical appraisal and data extraction for systematic reviews of prediction modelling studies) checklist 14 and PROBAST (prediction model risk of bias assessment tool) for assessing the reported prediction models. 15 We sought to extract each model’s predictive performance by using whatever measures were presented. These measures included any summaries of discrimination (the extent to which predicted risks discriminate between participants with and without the outcome), and calibration (the extent to which predicted risks correspond to observed risks) as recommended in the TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) statement. 16 Discrimination is often quantified by the C index (C index=1 if the model discriminates perfectly; C index=0.5 if discrimination is no better than chance). Calibration is often quantified by the calibration intercept (which is zero when the risks are not systematically overestimated or underestimated) and calibration slope (which is one if the predicted risks are not too extreme or too moderate). 17 We focused on performance statistics as estimated from the strongest available form of validation (in order of strength: external (evaluation in an independent database), internal (bootstrap validation, cross validation, random training test splits, temporal splits), apparent (evaluation by using exactly the same data used for development)). Any discrepancies in data extraction were discussed between reviewers, and remaining conflicts were resolved by LW and MvS. The online supplementary material provides details on data extraction. We considered aspects of PRISMA (preferred reporting items for systematic reviews and meta-analyses) 18 and TRIPOD 16 in reporting our article.

We searched databases repeatedly up to 5 May 2020 (supplementary table 1). All studies were considered, regardless of language or publication status (preprint or peer reviewed articles; updates of preprints are only included and reassessed after publication in a peer reviewed journal). We included studies if they developed or validated a multivariable model or scoring system, based on individual participant level data, to predict any covid-19 related outcome. These models included three types of prediction models: diagnostic models for predicting the presence or severity of covid-19 in patients with suspected infection; prognostic models for predicting the course of infection in patients with covid-19; and prediction models to identify people at increased risk of covid-19 in the general population. No restrictions were made on the setting (eg, inpatients, outpatients, or general population), prediction horizon (how far ahead the model predicts), included predictors, or outcomes. Epidemiological studies that aimed to model disease transmission or fatality rates, diagnostic test accuracy, and predictor finding studies were excluded. Starting with the second update, retrieved records were initially screened by a text analysis tool developed by artificial intelligence to prioritise sensitivity (supplementary material). Titles, abstracts, and full texts were screened for eligibility in duplicate by independent reviewers (pairs from LW, BVC, MvS) using EPPI-Reviewer, 13 and discrepancies were resolved through discussion.

We searched PubMed and Embase through Ovid, bioRxiv, medRxiv, and arXiv for research on covid-19 published after 3 January 2020. We used the publicly available publication list of the covid-19 living systematic review. 6 This list contains studies on covid-19 published on PubMed and Embase through Ovid, bioRxiv, and medRxiv, and is continuously updated. We validated whether the list is fit for purpose (online supplementary material) and further supplemented it with studies on covid-19 retrieved from arXiv. The online supplementary material presents the search strings. Additionally, we contacted authors for studies that were not publicly available at the time of the search, 7 8 and included studies that were publicly available but not on the living systematic review 6 list at the time of our search. 9 10 11 12

One study presented a small external validation (27 participants) that reported satisfactory predictive performance of a model originally developed for avian influenza H7N9 pneumonia. However, patients who had not recovered at the end of the study period were excluded, which again led to a selection bias. 23 Another study was a small scale external validation study (78 participants) of an existing severity score for lung computed tomography images with satisfactory reported discrimination. 54 Three studies validated existing early warning or severity scores to predict in-hospital mortality or deterioration. 85 96 108 They had satisfactory discrimination but less than the recommended number of events for validation 137 138 or unclear sample sizes, excluded patients who remained in hospital at the end of the study period, or had an unclear study design.

Twenty five models were developed and externally validated in the same study (in an independent dataset, excluding random training test splits and temporal splits). 7 12 26 42 43 51 52 59 67 77 81 83 84 91 95 100 102 110 112 113 116 119 However, in 11 of these models, the datasets used for the external validation were likely not representative of the target population, 7 12 26 42 59 91 100 102 116 and in one study, data from before the covid-19 crisis were used. 113 Consequently, predictive performance could differ if the models are applied in the targeted population. In one study, commonly used performance statistics for prognosis (discrimination, calibration) were not reported. 42 Gozes, 52 Fu, 51 Chassagnon, 77 Hu, 84 Kurstjens, 95 and Vaid 112 had satisfactory predictive performance on an external validation set, but it is unclear how the data for the external validation were collected (eg, whether the patients were consecutive), and whether they are representative. Wang, 43 Barda, 67 Guo, 83 Tordjman, 110 and Gong 119 obtained satisfactory discrimination on probably unbiased validation datasets, but each of these had fewer than the recommended number of events for external validation (100). 137 138 Diaz-Quijano externally validated a diagnostic model in a large registry with reasonable discrimination, but many patients had to be excluded because no polymerase chain reaction (PCR) testing was performed. 81

All but one of these studies 50 were at high risk of bias for the analysis domain ( table 4 ). Many studies had small sample sizes ( table 1 , table 2 , table 3 ), which led to an increased risk of overfitting, particularly if complex modelling strategies were used. Three studies did not report the predictive performance of the developed model, and four studies reported only the apparent performance (the performance with exactly the same data used to develop the model, without adjustment for optimism owing to potential overfitting). Only 13 studies assessed calibration, 7 12 22 43 50 67 69 78 83 108 116 117 119 but the method to check calibration was probably suboptimal in two studies. 12 119

Fifty three of the 107 studies had a high risk of bias for the participants domain ( table 4 ), which indicates that the participants enrolled in the studies might not be representative of the models’ targeted populations. Unclear reporting on the inclusion of participants prohibited a risk of bias assessment in 26 studies. Fifteen of the 107 studies had a high risk of bias for the predictor domain, which indicates that predictors were not available at the models’ intended time of use, not clearly defined, or influenced by the outcome measurement. One diagnostic imaging study used a simple scoring rule and was scored at low predictor risk of bias. The diagnostic model studies that used medical images as predictors in artificial intelligence were all scored as unclear on the predictor domain. The publications often lacked clear information on the preprocessing steps (eg, cropping of images). Moreover, complex machine learning algorithms transform images into predictors in a complex way, which makes it challenging to fully apply the PROBAST predictors section for such imaging studies. Most studies used outcomes that are easy to assess (eg, death, presence of covid-19 by laboratory confirmation). Nonetheless, there was cause for concern about bias induced by the outcome measurement in 19 studies, for example due to the use of subjective or proxy outcomes (eg, non covid-19 severe respiratory infections).

Study participants were often excluded because they did not develop the outcome at the end of the study period but were still in follow-up (that is, they were in hospital but had not recovered or died), yielding a highly selected study sample. 7 21 22 23 44 96 98 100 Additionally, only six studies accounted for censoring by using Cox regression 20 42 70 83 88 or competing risk models. 62 Some studies used the last available predictor measurement from electronic health records (rather than measuring the predictor value at the time when the model was intended for use). 22 67 100

Generally, studies did not clearly report which patients had imaging during clinical routine, and it was unclear whether the selection of controls was made from the target population (that is, patients with suspected covid-19). Often studies did not clearly report how regions of interest were annotated. Images were sometimes annotated by only one scorer without quality control. 26 28 47 52 55 91 92 93 Careful description of model specification and subsequent estimation were lacking, challenging the transparency and reproducibility of the models. Studies used different deep learning architectures, some were established and others specifically designed, without benchmarking the used architecture against others.

Controls are probably not representative of the target population for a diagnostic model (eg, controls for a screening model had viral pneumonia). 12 41 45 78 102 The test used to determine the outcome varied between participants, 12 41 95 or one of the predictors (eg, fever) was part of the outcome definition. 10

These models were based on proxy outcomes to predict covid-19 related risks, such as presence of or hospital admission due to severe respiratory disease, in the absence of data of patients with covid-19. 8 90

All studies were at high risk of bias according to assessment with PROBAST ( table 1 , table 2 , and table 3 ), which suggests that their predictive performance when used in practice is probably lower than that reported. Therefore, we have cause for concern that the predictions of the proposed models are unreliable when used in other people. Box 2 gives details on common causes for risk of bias for each type of model.

Reported C indices for other outcomes varied between 0.72 and 0.96. Singh et al and Zhang et al also evaluated calibration externally (in new patients). Singh showed that the Epic Deterioration Index overestimated the risk or a poor outcome, while the poor outcome model by Zhang et al underestimated the risk of a poort outcome. 108 116

The studies that developed models to predict progression to a severe or critical state reported C indices between 0.73 and 0.99. Three of these studies also reported good calibration, but this was evaluated internally (eg, bootstrapped) 88 or in an unclear way. 83 119

Studies that predicted mortality reported C indices between 0.68 and 0.98. Some studies also evaluated calibration. 7 67 116 When applied to new patients, the model by Xie et al yielded probabilities of mortality that were too high for low risk patients and too low for high risk patients (calibration slope >1), despite excellent discrimination. 7 The mortality model by Zhang et al also showed miscalibrated (overfitted and underestimated) risks at external validation, 116 while the model by Barda et al showed underfitting. 67

Of these models, 23 estimated mortality risk and eight aimed to predict progression to a severe or critical state ( table 3 ). The remaining studies used other outcomes (single or as part of a composite) including recovery, length of hospital stay, intensive care unit admission, intubation, (duration of) mechanical ventilation, and acute respiratory distress syndrome. One study used data from 2015 to 2019 to predict mortality and prolonged assisted mechanical ventilation (as a non-covid-19 proxy outcome). 113

We identified 50 prognostic models ( table 3 ) for patients with a diagnosis of covid-19. The intended use of these models (that is, when to use them, and for whom) was often not clearly described. Prediction horizons varied between one and 30 days, but were often unspecified.

Sixty prediction models were proposed to support the diagnosis of covid-19 or covid-19 pneumonia (and some also to monitor progression) based on images. Most studies used computed tomography images or chest radiographs. Others used spectrograms of cough sounds 53 and lung ultrasound. 73 The predictive performance varied widely, with estimated C index values ranging from 0.81 to more than 0.99.

Nine studies aimed to diagnose severe disease in patients with covid-19: eight in adults with covid-19 with reported C indices between value of 0.80 and 0.99, and one in paediatric patients with reported perfect performance. 25 Predictors of severe covid-19 used more than once were comorbidities, liver enzymes, C reactive protein, imaging features, and neutrophil count.

We identified 22 multivariable models to diagnose covid-19. Most models targeted patients with suspected covid-19. Reported C index values ranged between 0.65 and 0.99. A few models also evaluated calibration and reported good results. 69 78 117 The most frequently used diagnostic predictors (at least 10 times) were flu-like signs and symptoms (eg, shiver, fatigue), imaging features (eg, pneumonia signs on computed tomography scan), age, body temperature, lymphocyte count, and neutrophil count ( table 2 ).

We identified four models that predicted risk of covid-19 in the general population. Three models from one study used hospital admission for non-tuberculosis pneumonia, influenza, acute bronchitis, or upper respiratory tract infections as proxy outcomes in a dataset without any patients with covid-19. 8 Among the predictors were age, sex, previous hospital admissions, comorbidity data, and social determinants of health. The study reported C indices of 0.73, 0.81, and 0.81. A fourth model used deep learning on thermal videos from the faces of people wearing facemasks to determine abnormal breathing (not covid related) with a reported sensitivity of 80%. 90

Several studies made their code available on GitHub. 8 11 34 35 38 47 55 65 66 67 68 70 73 86 92 98 101 104 105 109 Seventy four studies did not include any usable equation, format, code, or reference for use or validation of their prediction model.

To assist in the prognosis of mortality, a nomogram, 7 a decision tree, 22 a score system, 70 online tools, 80 84 96 98 131 132 133 134 and a computed tomography based scoring rule are available in the articles. 23 Other online tools predict in-hospital death and the need for prolonged mechanical ventilation, 113 135 or in-hospital death and a composite of poor outcomes. 116 136 Additionally nomograms, 88 119 sumscores 83 88 and a model equation 60 are available to predict progression to severe covid-19.

Five artificial intelligence models to assist with diagnosis based on medical images are available through web applications. 24 27 30 73 91 126 127 128 129 130 One model is deployed in 16 hospitals, but the authors do not provide any usable tools in their study. 33 Two papers includes a severity scoring system to classify patients based on images. 54 72

Several sum scores, 31 95 110 117 and model equations 81 102 are available to support the diagnosis. Graphical diagnostic aids include nomograms 43 78 117 and a decision tree. 74 The “COVID-19 diagnosis aid” app is available on iOS and android devices to diagnose covid-19 in asymptomatic patients and those with suspected disease. 12 Additionally, online tools are available. 10 45 74 95 123 124 125 Classification in terms of disease severity can be done using a published equation. 114 A decision tree to detect severe disease for paediatric patients with confirmed covid-19 is also available in an article. 25

Several studies presented their models in a format for use in clinical practice. However, because all models were at high risk of bias, we do not recommend their routine use before they are properly externally validated.

Table 1 , table 2 , and table 3 give an overview of the 145 prediction models reported in the 107 identified studies. Supplementary table 2 provides modelling details and box 1 discusses the availability of models in a format for use in clinical practice.

Among the studies that developed prognostic models to predict mortality risk in people with confirmed or suspected infection, the percentage of deaths varied between 1% and 59% ( table 3 ). This wide variation is partly because of substantial sampling bias caused by studies excluding participants who still had the disease at the end of the study period (that is, they had neither recovered nor died). 7 21 22 23 44 96 98 100 Additionally, length of follow-up could have varied between studies (but was rarely reported), and there might be local and temporal variation in how people were diagnosed as having covid-19 or were admitted to the hospital (and therefore recruited for the studies). Among the diagnostic model studies, only nine reported on the prevalence of covid-19 and used a cross sectional or cohort design; the prevalence varied between 17% and 79% ( table 2 ). Because 58 diagnostic studies used either case-control sampling or an unclear method of data collection, the prevalence in these diagnostic studies might not have been representative of their target population.

Based on 59 studies that reported study dates, data were collected between 8 December 2019 and 21 April 2020. Four studies reported median follow-up time (4.5, 8.4, 15, and 18 days), 20 37 83 108 while another study reported a follow-up of at least five days. 42 Some centres provided data to multiple studies and several studies used open Github 120 or Kaggle 121 data repositories (version or date of access often unspecified), and so it was unclear how much these datasets overlapped across our identified studies (supplementary table 2). One study 25 developed prediction models for use in paediatric patients. The median age in studies on adults varied from 34 to 68 years, and the proportion of men varied from 35% to 75%, although this information was often not reported at all (supplementary table 2).

Forty five studies used data on patients with covid-19 from China (supplementary table 2), six from Italy, 32 39 72 74 76 79 three from Brazil, 69 81 109 three from France, 71 77 110 three from the United States, 96 108 112 two from South Korea, 63 80 one from Belgium, 82 one from the Netherlands, 95 one from the United Kingdom, 75 one from Israel, 67 one from Mexico, 70 and one from Singapore. 40 Twenty two studies used international data (supplementary table 2) and two studies used simulated data. 35 41 Three studies used proxy data to estimate covid-19 related risks (eg, Medicare claims data from 2015 to 2016). 8 90 113 Twelve studies were not clear on the origin of covid-19 data (supplementary table 2).

Discussion

In this systematic review of prediction models related to the covid-19 pandemic, we identified and critically appraised 107 studies that described 145 models. These prediction models can be divided into three categories: models for the general population to predict the risk of having covid-19 or being admitted to hospital for covid-19; models to support the diagnosis of covid-19 in patients with suspected infection; and models to support the prognostication of patients with covid-19. All models reported moderate to excellent predictive performance, but all were appraised to have high risk of bias owing to a combination of poor reporting and poor methodological conduct for participant selection, predictor description, and statistical methods used. Models were developed on data from different countries, but the majority used data from China or public international data repositories. With few exceptions, the available sample sizes and number of events for the outcomes of interest were limited. This is a well known problem when building prediction models and increases the risk of overfitting the model.139 A high risk of bias implies that the performance of these models in new samples will probably be worse than that reported by the researchers. Therefore, the estimated C indices, often close to 1 and indicating near perfect discrimination, are probably optimistic. The majority of studies developed new models, only 27 carried out an external validation, and calibration was rarely assessed.

We reviewed 57 studies that used advanced machine learning methodology on medical images to diagnose covid-19, covid-19 related pneumonia, or to assist in segmentation of lung images. The predictive performance measures showed a high to almost perfect ability to identify covid-19, although these models and their evaluations also had a high risk of bias, notably because of poor reporting and an artificial mix of patients with and without covid-19. Therefore, we do not recommend any of the 145 identified prediction models to be used in practice.

Challenges and opportunities The main aim of prediction models is to support medical decision making. Therefore, it is vital to identify a target population in which predictions serve a clinical need, and a representative dataset (preferably comprising consecutive patients) on which the prediction model can be developed and validated. This target population must also be carefully described so that the performance of the developed or validated model can be appraised in context, and users know which people the model applies to when making predictions. Unfortunately, the studies included in our systematic review often lacked an adequate description of the study population, which leaves users of these models in doubt about the models’ applicability. Although we recognise that all studies were done under severe time constraints, we recommend that any studies currently in preprint and all future studies should adhere to the TRIPOD reporting guideline16 to improve the description of their study population and their modelling choices. TRIPOD translations (eg, in Chinese and Japanese) are also available at https://www.tripod-statement.org. A better description of the study population could also help us understand the observed variability in the reported outcomes across studies, such as covid-19 related mortality and covid-19 prevalence. The variability in prevalence could in part be reflective of different diagnostic standards across studies. Note that the majority of diagnostic models use viral nucleic acid test results as the gold standard, which may have unacceptable false negative rates. Covid-19 prediction problems will often not present as a simple binary classification task. Complexities in the data should be handled appropriately. For example, a prediction horizon should be specified for prognostic outcomes (eg, 30 day mortality). If study participants have neither recovered nor died within that time period, their data should not be excluded from analysis, which most reviewed studies have done. Instead, an appropriate time to event analysis should be considered to allow for administrative censoring.17 Censoring for other reasons, for instance because of quick recovery and loss to follow-up of patients who are no longer at risk of death from covid-19, could necessitate analysis in a competing risk framework.140 A prediction model applied in a new healthcare setting or country often produces predictions that are miscalibrated141 and might need to be updated before it can safely be applied in that new setting.17 This requires data from patients with covid-19 to be available from that system. Instead of developing and updating predictions in their local setting, individual participant data from multiple countries and healthcare systems might allow better understanding of the generalisability and implementation of prediction models across different settings and populations. This approach could greatly improve the applicability and robustness of prediction models in routine care.142143144145146 The evidence base for the development and validation of prediction models related to covid-19 will quickly increase over the coming months. Together with the increasing evidence from predictor finding studies147148149150151152153 and open peer review initiatives for covid-19 related publications,154 data registries120121155156157 are being set up. To maximise the new opportunities and to facilitate individual participant data meta-analyses, the World Health Organization has released a new data platform to encourage sharing of anonymised covid-19 clinical data.158 To leverage the full potential of these evolutions, international and interdisciplinary collaboration in terms of data acquisition, model building and validation is crucial.

Study limitations With new publications on covid-19 related prediction models rapidly entering the medical literature, this systematic review cannot be viewed as an up-to-date list of all currently available covid-19 related prediction models. Also, 87 of the studies we reviewed were only available as preprints. These studies might improve after peer review, when they enter the official medical literature; we will reassess these peer reviewed publications in future updates. We also found other prediction models that are currently being used in clinical practice without scientific publications,159 and web risk calculators launched for use while the scientific manuscript is still under review (and unavailable on request). These unpublished models naturally fall outside the scope of this review of the literature.160 As we have argued extensively elsewhere,161 transparent reporting that enables validation by independent researchers is key for predictive analytics, and clinical guidelines should only recommend publicly available and verifiable algorithms.

Implications for practice All 145 reviewed prediction models were found to have a high risk of bias, and evidence from independent external validation of the newly developed models is currently lacking. However, the urgency of diagnostic and prognostic models to assist in quick and efficient triage of patients in the covid-19 pandemic might encourage clinicians and policymakers to prematurely implement prediction models without sufficient documentation and validation. Earlier studies have shown that models were of limited use in the context of a pandemic,162 and they could even cause more harm than good.163 Therefore, we cannot recommend any model for use in practice at this point. The current oversupply of insufficiently validated models is not useful for clinical practice. Future studies should focus on validating, comparing, improving, and updating promising available prediction models, rather than developing new ones.17 For example, Diaz-Quijano developed and externally validated a diagnostic model using Brazilian surveillance data with reasonable discrimination, but many patients had to be excluded because no PCR testing was performed, hence this model needs further validation.17 Two other models to diagnose covid-19 also showed promising discrimination at external validation in small unselected cohorts.43110 An externally validated model that used computed tomography based total severity scores showed good discrimination between patients with mild, common, and severe-critical disease.54 Two models to predict progression to severe covid-19 within two weeks showed promising discrimination when validated externally on unselected cohorts.83119 Another model discriminated well between survivors and non-survivors among confirmed cases, but the prediction horizon was not specified, and the study had many missing values for key parameters.67 Because reporting in each of these studies was insufficiently detailed and the validation was in datasets with fewer than 100 events in the smallest outcome category, validation in larger, international datasets is needed. Such external validations should assess not only discrimination, but also calibration and clinical utility (net benefit).141146163 Owing to differences between healthcare systems (eg, Chinese and European) in when patients are admitted to and discharged from hospital, as well as the testing criteria for patients with suspected covid-19, we anticipate most existing models will be miscalibrated, but this can usually be solved by updating and adjustment to the local setting. When creating a new prediction model, we recommend building on previous literature and expert opinion to select predictors, rather than selecting predictors in a purely data driven way.17 This is especially important for datasets with limited sample size.164 Based on the predictors included in multiple models identified by our review, we encourage researchers to consider incorporating several candidate predictors. Common predictors include age, body temperature, lymphocyte count, and lung imaging features. Flu-like signs and symptoms and neutrophil count are frequently predictive in diagnostic models, while comorbidities, sex, C reactive protein, and creatinine are frequently reported prognostic factors. By pointing to the most important methodological challenges and issues in design and reporting of the currently available models, we hope to have provided a useful starting point for further studies aiming to develop new models, or to validate and update existing ones. This living systematic review has been conducted in collaboration with the Cochrane Prognosis Methods Group. We will update this review and appraisal continuously to provide up-to-date information for healthcare decision makers and professionals as more international research emerges over time.