Abstract Building on developments in machine learning and prior work in the science of judicial prediction, we construct a model designed to predict the behavior of the Supreme Court of the United States in a generalized, out-of-sample context. To do so, we develop a time-evolving random forest classifier that leverages unique feature engineering to predict more than 240,000 justice votes and 28,000 cases outcomes over nearly two centuries (1816-2015). Using only data available prior to decision, our model outperforms null (baseline) models at both the justice and case level under both parametric and non-parametric tests. Over nearly two centuries, we achieve 70.2% accuracy at the case outcome level and 71.9% at the justice vote level. More recently, over the past century, we outperform an in-sample optimized null model by nearly 5%. Our performance is consistent with, and improves on the general level of prediction demonstrated by prior work; however, our model is distinctive because it can be applied out-of-sample to the entire past and future of the Court, not a single term. Our results represent an important advance for the science of quantitative legal prediction and portend a range of other potential applications.

Citation: Katz DM, Bommarito MJ II, Blackman J (2017) A general approach for predicting the behavior of the Supreme Court of the United States. PLoS ONE 12(4): e0174698. https://doi.org/10.1371/journal.pone.0174698 Editor: Luís A. Nunes Amaral, Northwestern University, UNITED STATES Received: January 17, 2017; Accepted: March 13, 2017; Published: April 12, 2017 Copyright: © 2017 Katz et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: Data and replication code are available on Github at the following URL: https://github.com/mjbommar/scotus-predict-v2/. Funding: The author(s) received no specific funding for this work. Competing interests: All Authors are Members of a LexPredict, LLC which provides consulting services to various legal industry stakeholders. We received no financial contributions from LexPredict or anyone else for this paper. This does not alter our adherence to PLOS ONE policies on sharing data and materials.

Introduction As the leaves begin to fall each October, the first Monday marks the beginning of another term for the Supreme Court of the United States. Each term brings with it a series of challenging, important cases that cover legal questions as diverse as tax law, freedom of speech, patent law, administrative law, equal protection, and environmental law. In many instances, the Court’s decisions are meaningful not just for the litigants per se, but for society as a whole. Unsurprisingly, predicting the behavior of the Court is one of the great pastimes for legal and political observers. Every year, newspapers, television and radio pundits, academic journals, law reviews, magazines, blogs, and tweets predict how the Court will rule in a particular case. Will the Justices vote based on the political preferences of the President who appointed them or form a coalition along other dimensions? Will the Court counter expectations with an unexpected ruling? Despite the multitude of pundits and vast human effort devoted to the task, the quality of the resulting predictions and the underlying models supporting most forecasts is unclear. Not only are these models not backtested historically, but many are difficult to formalize or reproduce at all. When models are formalized, they are typically assessed ex post to infer causes, rather than used ex ante to predict future cases. As noted in [1], “the best test of an explanatory theory is its ability to predict future events. To the extent that scholars in both disciplines (social science and law) seek to explain court behavior, they ought to test their theories not only against cases already decided, but against future outcomes as well.” Luckily, the Court provides a new opportunity to test each year. Thousands of petitioners annually appeal their cases to the Supreme Court. In most situations, the Court decides to hear a case by granting a petition for a writ of certiorari. If that petition is granted, the parties then submit written materials supporting their position and later provide oral argument before the Court. After considering the case, each participating Justice ultimately casts his or her vote on whether to affirm or reverse the status quo (typically seen through the lens of a decision by the lower court or special master). Over the last decade, the Court has issued between 70-90 opinions per term for an average of approximately 700 Justice votes per term. While many questions could be evaluated, the Court’s decisions offer at least two discrete prediction questions: 1) will the Court as a whole affirm or reverse the status quo judgment and 2) will each individual Justice vote to affirm or reverse the status quo judgment? In this paper, we describe a prediction model answering these two questions as guided by three modeling goals: generality, consistency, and out-of-sample applicability. Building on developments in machine learning and the prior work of [1], [2] and [3], we construct a model to predict the voting behavior of the Court and its Justices in a generalized, out-of-sample context. As inputs, we rely upon the Supreme Court Database (SCDB) and some derived features generated through feature engineering. Our model is based on the random forest method developed in [4]. We predict nearly two centuries of historical decisions (1816-2015) and compare our results against multiple null (baseline) models. Using only data available prior to decision, our model outperforms all baseline models at both the Justice and Court observation level under both parametric and non-parametric tests. This performance is consistent with, and improves on, the general level of prediction demonstrated by prior work; however, our model is distinctive because it can be applied out-of-sample to the entire past and future of the Court, not just a single term. Finally, our conclusion suggests areas for future improvement and collaboration. Our results represent a significant advance for the science of quantitative legal prediction and portend a range of potential applications, such as those described in [5].

Research principles and prior work In this section, we describe the principles guiding our model construction and how we conducted our testing in light of prior work on the topic. Generality Leveraging the early work of [6], both [1] and [3] developed a classification tree model which was designed to predict the behavior of Supreme Court Justices for the 2002-2003 Supreme Court term. Their work represents a seminal contribution to the science of legal forecasting as their classification tree models not only performed well in absolute terms, but also matched or outperformed a number of subject matter experts. Despite its contribution to the field, however, the approach undertaken in [1] and [3] was limited in several important ways. For example, their model construction is only applicable to a single “natural court” with full participation, i.e., cases where all of a specific set of Justices are sitting. The natural court tested in their paper, following Justice Stephen G. Breyer’s appointment in 1994, was one of the longest periods without personnel changes on the Court, providing their models with an unusually large training sample. It is not possible, however, to evaluate their model in periods prior to 1994 or after 2005 following the replacements of Chief Justice William H. Rehnquist and Justices Sandra Day O’Connor, David H. Souter, and John Paul Stevens. As a result of these issues, the performance and nature of the model cannot necessarily be generalized to all Supreme Court cases during their test period, let alone cases before or after their tested natural court. Our first principle, generality, is based on these observations. As the composition of the Court changes case-by-case or term-by-term, either through recusal, retirement, or death, a prediction model should continue to generate predictions. The properties and performance of a prediction model should also be able to be studied across time and “abnormal” circumstances (e.g., cases with original jurisdiction or fewer than nine Justices). Therefore, our goal is to construct a model that is general—that is, a model that can learn online, in a manner similar to online learning models described in [7] and [8]. Consistency Second, we prefer the model to have consistent performance across time, case issues, and Justices. Similar to our motivation for generality, existing models have had significantly varying performance over time and across Justices. To support the case for a model’s future applicability, it should consistently outperform a baseline comparison. Both legal scholars and practicing lawyers have had difficulty leveraging prediction models [5]. Among other difficulties, qualitatively-oriented legal experts tend to suggest model improvements based on anecdote or their own untested mental model. However, if these ostensible improvements cannot be systematically inferred from data, or if their impact on the model is detrimental in other periods or for other Justices, then they ought not be included in a model engineered for consistency. While prediction models can be applied in many contexts, consistency can also be related to a risk preference in a repeated betting scenario. For example, instead of preferring the highest per-wager expected value (i.e., maximum accuracy), a bettor might prefer a wager with less volatility or long-term downside risk. Both consistency and generality can be seen as related to overfitting and the bias-variance trade-off. But in addition to the typical learning problems under a stationary system, we are faced with a more complex reality. Court outcomes are potentially influenced by a variety of dynamics, including public opinion as in [9], inter-branch conflict [10], both changing membership and shifting views of the Justices as explored in [11] [12], and judicial norms and procedures [13]. The classic adage “past performance does not necessarily predict future results” is very much applicable. For example, likely due to changes in norms, the number of cases per term has fallen from approximately 150 between 1950-1990 to fewer than 90 between 1990-2015. Consider another famous historical example, as explored in [14] and [15], when the aftermath of President Franklin D. Roosevelt’s attempted Court-packing plan in 1937 resulted in a significant turnover of Justices in years that followed. Each of these and other changes represents a challenge to a model engineered with consistency as a goal. Out-of-sample applicability Our third model principle is out-of-sample applicability. Namely, all information required for the model to produce an estimate should be knowable prior to the date of decision. This is in contrast with models like [2], which require partial knowledge about the outcome to predict the full outcome. This principle is arguably the most important, as it allows for the model to generate predictions in advance, i.e., predictions that can be applied usefully in the real world. While existing approaches like [1, 2] and [3] may honor one or two of these principles, none simultaneously achieve all three above, severely limiting their general applicability. Both [1] and [3] are predictive out-of-sample but fail to be general enough to apply widely or consistent when tested. By contrast, [2] is general across terms and consistent, but not predictive out-of-sample since it requires knowledge of some votes to predict others. As detailed further below, our approach is the first that satisfies all three of these criteria, and thus represents a significant advance in the science of quantitative legal prediction.

Data and feature engineering SCDB In order to build our model, we rely on data from the Supreme Court Database (SCDB) [16]. SCDB features more than two hundred years of high-quality, expertly-coded data on the Court’s behavior. Each case contains as many as two hundred and forty variables, including chronological variables, case background variables, justice-specific variables, and outcome variables. Many of these variables are categorical, taking on hundreds of possible values; for example, the issue variable can take 384 distinct values. These SCDB variables form the basis for both our features and outcome variables. SCDB is the product of years of dedication from Professor Harold Spaeth and many others. The database has been consistently subjected to reliability analysis and has been used in hundreds of academic studies (e.g., [11], [17], [18], [19], [20], [21], [22], [23]). While there are serious and important limits to SCDB, as detailed in [24], SCDB is the highest-quality and longest-duration database for Supreme Court decisions. There are currently two releases of SCDB: SCDB Modern and SCDB Legacy. The SCDB Modern release contains terms beginning in 1946, while the SCDB Legacy release contains terms beginning in 1791. When [25], an earlier pre-print version of this paper was released, SCDB Legacy had not yet been released. As SCDB Legacy represents more than a threefold increase in the length of simulation history and size of training data, we have re-run all model construction and analysis for the new data release; methods and results from [25] are thus superseded by this paper. Targets To model Supreme Court decisions, we need to define an outcome variable from SCDB corresponding to a decision. Typically, Court-watchers frame decisions as either affirming or reversing a lower court’s decision. This, however, is only consistent with cases heard on appeal. In some circumstances, the United States Supreme Court is the court of original jurisdiction, and there is therefore no lower court against which to frame reversal. In these cases, decisions are typically framed as either siding with the plaintiff(s) or defendant(s). In addition, the Court and its members may take technically-nuanced positions or the Court’s decision might otherwise result in a complex outcome that does not map onto a binary outcome. In order to build a general model that can handle all cases, we created a disposition coding map that defines a Justice vote as (i) Reversed, (ii) Affirmed, or (iii) Other, depending on a Justice’s vote and the SCDB’s caseDisposition variable. This disposition coding map is outlined in our Github repository [26]. Our mapping displays Justice vote values by column and Court caseDisposition values by row. The case outcome is defined as Reverse if there are more total Reverse votes than Affirm votes; notably, Other votes, which may include recusals or non-standard form decisions, are excluded from the vote aggregation. Table 1 below displays the distribution of Reverse, Affirm, and Other coding by Justice outcome and case outcome. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 1. Outcome distribution (1816-2015). https://doi.org/10.1371/journal.pone.0174698.t001 Features and feature engineering With the outcome variable specified, we proceed next to describe the SCDB features used and feature engineering we performed. SCDB contains a wide range of potential features, and the majority of these are categorical variables. In our study, we begin with the following features available from SCDB: Justice (ID), term, natural court, month of argument, petitioner, respondent, manner in which Court took jurisdiction, administrative action, court of origin and source of the case, lower court disagreement, reason for granting cert, lower court disposition, lower court direction, issue, and issue area. For each of these variables, we follow standard practice and convert the categorical variables into binary or indicator variables. For example, in the case of reason for granting cert, there are 13 categories used in SCDB. Therefore, the single certReason variable is converted to 13 binary or indicator variables—one for each possible option. In addition to simple feature encoding, we also engineer features that do not occur in SCDB as released. The first set of features that we engineer are related to the Circuit Court of Appeals from which the dispute arose. SCDB codes this data in the form of the case source and case origin, where the source corresponds to the opinion under review and the origin corresponds to the location of original filing. While there are over 130 unique courts that these variables may be coded as, scholars primarily group them by Circuit; Circuits have been shown to be a strong predictor of reversal during certain periods, as shown in [27]. Based on this guidance, we therefore developed a translation from each SCDB court ID to the corresponding Circuit. The coding maps from these origin and source courts to a new set of 16 categorical values, which are then binarized as the raw features above. The features engineered above can both be described as coarsened or collapsed. We move on next to features that are derived through arithmetic or interaction of one or more features. The first of this class is a set of chronologically-oriented features related to oral argument and case timing. These features include (i) whether or not oral arguments were heard for the case, (ii) whether or not there was a rehearing, and (iii) the duration between when the case was originally argued and a decision was rendered. These features are based on the qualitative observation that the length of time between argument and decision is related to the unanimity of the Court; for example, in the past three terms, the ten “fastest” decisions of each term have nearly all been unanimous 9-0. Item (iii) may seem at first to include future or out-of-sample knowledge. However, in practice, the predictions for a case may evolve as new information about the case is acquired prior to the decision being rendered. For example, when the Court announces that a case will have arguments heard, the delay feature may be set to zero initially. Once the argument date passes, the delay feature is then incremented periodically. After each time step that passes, the feature matrix for undecided cases is updated, and the resulting predictions may therefore change. Consistent with “online” learning approaches such as [7] and [8], this does not require out-of-sample information; it only requires that the data and algorithm be re-run at a specified frequency for any undecided cases in a term. Lastly, we engineer features that summarize the “behavior” of a Justice, the Court, the lower court, and differences between them. These features fall into three categories: (i) features related to the rate of reversal, (ii) features related to the left-right direction of a decision, and (iii) features related to the rate of dissent. These features can be thought of as conditional empirical probabilities. For example, (i) includes, at a given term and for a given justice, the historically-observed proportion of votes to reverse. Importantly, in addition to calculating these values for each justice, we also include difference terms between the Court as a whole and the individual justice. These difference terms are, qualitatively, the relative inclination of a Justice to reverse compared to the Court. We repeat these calculations for other justice-specific features including direction and agreement features, providing quantitative measures of left-right political preference and rate of dissent. In addition, we include a difference term between the lower court’s decision direction and the Justice’s historically-observed mean direction; this provides a measure of how far apart, ideologically, the Justice is from the lower court’s opinion on review (excepting original jurisdiction cases). Together, these features provide relative information about Courts’ and Justices’ political and procedural leanings; for example, we find that reversal rates vary significantly even in the last 35 years at both the Court and Justice level.

Model construction With features and outcome data defined, we proceed to discuss the construction of our model. While this section provides a general overview of modeling procedures, readers interested in the technical details should review the Github repository accompanying the paper, [26]; all source code and data required to reproduce the results presented are freely available there. The model is developed in Python and all methods described below, unless otherwise indicated, are from scikit-learn 0.18 [28]. The modeling process begins by selecting a term T*; in order to satisfy our three principles above, no information from term T* or after should be available during the training phase. If we let each docket-vote feature vector d i and docket-vote outcome v i have term T(d i ), then our training feature set for model term T* is D T = {d i : T(d i ) < T*} and our training target set V T corresponds to matching v i records. While some information may be known intra-term, i.e., for {d i : T(d i ) = T*}, this modeling procedure only retrains at the outset of each term. For example, while some decisions in term T* may have been observed by December, cases in January are predicted using only information prior to October. Other than the incremental delay feature discussed above, no information derived from the current court term is incorporated into the model until the start of the following term. While we represent D and V above as sets of vectors, we can easily consider it to be a feature matrix with each docket-vote in a row and each feature in a column. As of 2015, D 2015 based on SCDB Legacy (beta) has 249,793 docket-votes; under our feature engineering approach described above, D 2015 has 1,501 columns. In many machine learning approaches, we might pre-process D by rescaling, rotating, interacting, or removing columns. Random forest classifiers, especially when applied to binarized or indicator variables, do not generally require pre-processing. Furthermore, random subspace methods like random forests implicitly remove or “select” features by subsetting the feature space for each sub-learner tree. One weakness of the scikit-learn implementation of random forests relative to alternatives like xgboost, however, is its treatment of missing data. In most cases, this is handled by mapping missing values to a separate “missing” indicator column during encoding; in some cases, however, a historical mean imputation may be used. However, no additional feature selection or pre-processing methods are applied to D prior to learning. We next apply a learning algorithm to D and V. As noted previously, we selected a random forest classifier [4]. Random forests are part of the family of ensemble methods. Ensemble methods leverage the wisdom of the statistical crowds. In the case of random forest classifiers, we construct a forest of statistically diverse trees using bootstrap aggregation on random substrates of our training data. To cast predictions, we simply calculate predictions for each of our individual trees and then average across the entire forest. While an individual statistical learner (a single tree) might offer an unrepresentative prediction of a given phenomenon, the crowd-sourced average of a larger group of learners is often better able to forecast outcomes. Not only have random forests proven to be “unreasonably effective” in a wide array of supervised learning contexts [29], but in our testing, random forests outperformed other common approaches including support vector machines (LibLinear, LibSVM) and feedforward artificial neural network models such as multi-layer perceptron models implemented with [30]. For details of the implementation, interested readers are directed to the scikit-learn documentation [28] and [31] and keras documented [30]. Of some note, however, is our experimentation with the warm_start parameter to “grow” the forest online. Recall that at the beginning of each term, the model is retrained to incorporate newly observed data. In [25], we built a “fresh” forest model each term with number of trees selected by cross-validated hyperparameter search. In this updated research, however, we have simulated performance using both “fresh” forests and “growing” forests, in which trees are added to an existing forest. Only under certain circumstances, such as the changing of the natural court, following the addition or loss of a Justice, does the model build a “fresh forest”. For example, the models used to produce this paper’s results were trained with 125 initial trees beginning in 1816 (5 ∗ 25 trees, five for each term between 1791-1816). Each term, in the absence of a natural court change, an additional five trees were trained and added to the prior term’s forest. Our implementation of this “growing” approach allows for substantially faster simulation times and more stable predictions, as it only need train a small number of trees per step. Equally important is that most trees in the forest are stable for most years, and so the same inputs in year T and T + 1 are likely to produce the same predictions. Generally speaking, most learners benefit from joint cross-validation and hyperparameter search. For the “fresh” forest approach, in which a new random forest is built each term, we performed a number of experiments by grid-searching the number of trees, minimum number of leaves per node, maximum depth per tree, heuristic used to select the number of features per tree (e.g., log, sqrt), and split criterion (e.g., Gini vs. entropy) for each model retraining, i.e., for each term. This approach allows the parameters to adapt over the nearly 200 years of change in historical sample composition and size. However, we found that the marginal improvement in accuracy and F1 were not worth the substantial increase in computational requirement and decreased stability of predictions. In the simple examples included in the Github repository, a cross-validated hyperparameter does not have a noticeable impact on accuracy over “default” random forest configuration. As a whole, our model construction applies standard pre-processing and learning approaches within each step, but experiments with purposeful and atypical design around longitudinal model application. For simplicity of subsequent presentation and replication, only the “growing” forest approach described above with five trees per step is presented. All source and results are available at [26] for a reader interested in the details of model specification and implementation.

Conclusion and future research Building upon prior work in the field of judicial prediction [1–3], we offer the first generalized, consistent and out-of-sample applicable machine learning model for predicting decisions of the Supreme Court of the United States. Casting predictions over nearly two centuries, our model achieves 70.2% accuracy at the case outcome level and 71.9% at the justice vote level. More recently over the past century, we outperform an in-sample optimized null model by nearly 5%. Among other things, we believe such improvements in modeling should be of interest to court observers, litigants, citizens and markets. Indeed, with respect to markets, given judicial decisions can impact publicly traded companies, as highlighted in [32], even modest gains in prediction can produce significant financial rewards. We believe that the modeling approach undertaken in this article can also serve as a strong baseline against which future science in the field of judicial prediction might be cast. While a researcher seeking to optimize performance for a given case or a given time period might pursue an alternative approach, our effort undertaken herein was directed toward building a general model—one that could stand the test of time across many justices and many distinct social, political and economic periods. Beyond predicting U.S. Supreme Court decisions, our work contributes to a growing number of articles which either highlight or apply the tools of machine learning to some class of prediction problems in law or legal studies (e.g., [5], [33], [34], [35], [36], [37], [38], [39], [40]). We encourage additional applied machine learning research directed to these areas and new areas where the application of predictive analytics might be fruitful. At its core, our effort relies upon a statistical ensemble method used to transform a set of weak learners into a strong learner. We believe a number of future advancements in field of legal informatics will likely rely on elements of that basic approach. Namely, our focus on statistical crowd sourcing actually foreshadows future developments in the field. Future research will seek to find the optimal blend of experts, crowds [41] and algorithms as some ensemble of these three streams of intelligence likely will produce the best performing model for a wide class of prediction problems [42].

Acknowledgments We would like to thank our reviewers and all of those who provided comments on prior drafts of this paper.

Author Contributions Conceptualization: DMK MJB JB. Data curation: DMK MJB. Formal analysis: DMK MJB. Project administration: DMK MJB. Software: DMK MJB. Validation: DMK MJB. Visualization: DMK MJB. Writing – original draft: DMK MJB JB. Writing – review & editing: DMK MJB JB.