This was a preliminary analysis that I presented by way of contrast. As soon as predictive algorithms produced by machine learning are involved, the issues are more complicated. This mainly has to do with the character of the algorithms involved. The variables involved are no longer constitutive of rankings, they are indicators supposed to predict target variables. User reviews of restaurant food score the quality of the food itself; histories of credit card use merely serve to predict whether someone will repay a loan in the future.

This difference has huge consequences for accountability, for all phases of decision-making it becomes much more intricate. What would it amount to if my default position of introducing total transparency applies? What would the shape of accounting become?

Before going into the issue, a preliminary remark is due. Machine learning is very much interrelated all through the stages of data collection, model construction, and model use. When calling an organization to account concerning, say, the use of a profile, an account of what the profile means and how it has been constructed is inescapable. And any account of how an algorithm has been constructed, cannot do without an account of how datasets have been used in the process (say, as concerns possibly biased data). So accounting for machine learning models can only make sense if all phases are taken into account.Footnote 8 With this conception, I take aim at Zarsky who painstakingly tries to relate the rationales for transparency to disclosure of specific stages of decision-making (Zarsky 2013: 1533 ff.). I think that in the domain of machine learning in which all phases are tightly interrelated, this is a futile exercise. Transparency only makes sense if all aspects of machine learning are laid bare.Footnote 9

Now let me return to the task at hand and analyze the meaning of accountability as soon as full transparency is observed.Footnote 10

Phase 1: Data Collection

At first, datasets have to be collected which are to serve as input to the process of machine learning proper. The quality of them is absolutely essential since any shortcomings risk being built into the very model to be developed later. An obvious requirement is that data are appropriate to the questions being asked. Frictions of the kind may arise in particular when data from one context are imported into another context—the data need to be reinterpreted which is a precarious task. As a case in point compare the sale of prescription data from pharmacies to insurance companies; these were to be used in a machine learning effort to predict health risks (Pasquale 2015: 27). More generally, this addresses the importance of careful scrutiny of the practices of data brokers, companies that buy as many datasets as possible from everywhere, and resell them to any interested party.

A particular concern that of late has attracted a lot of attention is, whether the datasets are free of bias.Footnote 11 For the sake of illustration, let us consider model construction for admission decisions concerning university education. Apart from the issue that the target variable (fit for education) is a subjective affair, the process of labeling applicants with one of its possible values (“class variables”—in this particular case either “fit” or “unfit” for education) may be influenced by prejudice, say against women. So from the very start, some of the training data points may carry wrong labels (Barocas and Selbst 2016: 681 ff.). Furthermore, the dataset to be used for training may be biased against specific groups that society wants to protect from discrimination. Along, say, lines of race or gender, records contain more errors, less details, or simply suffer from underrepresentation in the sample as a whole. Unavoidably, skewed data will produce a skewed model later on (Barocas and Selbst 2016: 684 ff.).

Moreover, only very coarse features may be used in model construction (for the sake of cost efficiency), while these features demonstrably correlate with sensitive dimensions like race or gender. Compare the infamous practice of “redlining”—simply excluding neighborhoods as a whole—in deciding about granting credit. Ultimate decision-making will reproduce this bias (Barocas and Selbst 2016: 688 ff.). Finally, the dataset may contain variables that serve well for predictive purposes, but at the same time correlate with one or more sensitive categories. This so-called proxy-problem is a hard one: how to distinguish between the discriminatory and the non-discriminatory part (Barocas and Selbst 2016: 691 ff.)?

For all of these defects in training data, one has to find remedies to be applied in the subsequent phase of model construction (see below: “discrimination-aware” modeling).

Phase 2: Model Construction

Subsequently, the data available are used as training material for machine learning. The techniques employed are various: classification and decision trees, support vector machines (SVMs), ensemble methods, neural networks, and the like. In inductive fashion, an appropriate model gets constructed that best fits the data. Such a model evolves step by step, its error ever diminishing.Footnote 12 Models are made for purposes of prediction: think of predicting who deserves a loan, what insurance premium to set, whom to inspect for tax evasion or for suspicious activities at the airport, and so on (cf. above).

By way of illustration, take the construction of a decision tree. In recursive fashion, the training data are split ever and again into subsets (nodes) along a single attribute. At every step, one chooses the attribute that best separates the data at hand. What criterion to employ for splitting? A common measure for determining the best separation is a variable called “information gain”: the difference between the amount of “entropy” before and after the contemplated split (summated with proper weights). The highest information gain indicates where the next split should take place. While proceeding in this fashion, the amount of entropy decreases with every step. The procedure stops when all subsets are pure (all elements belonging to a single class)—and hence entropy has become zero for all of them.

In the process of modeling, several pitfalls have to be avoided. A dominant concern is “overfitting”: one goes on and on to train (say) the classifier until the very end. The end product surely fits the training data—but only those; it is unfit to generalize to other, new data.Footnote 13 One recipe against overfitting (among many) is to divide the training data into a training set (80%) and a test set (20%). The classifier is trained on the first set, its error diminishing with every iteration. Simultaneously, one keeps an eye on the classifier’s error as applied to the test set. When the latter error starts to increase, it is time to stop and be satisfied with the classifier a few steps back (early stopping). In another approach, one fully grows the classifier, but subsequently, working bottom-up, prunes it back until the so-called generalization error (on a test set) no longer improves.

More generally, one may try out several classifiers simultaneously. For the purpose, divide the available data into training set, validation set, and test set. Then train the classifiers on the training set, choose between them by comparing performances on the validation set, and characterize the performance of the chosen classifier by applying it to the test set.

Note that most procedures in machine learning as just described are what its practitioners call “greedy”: they select the local optimum at each step of construction; hence, global optimality is not guaranteed. Therefore, machine learning is not so much a science in search of the unique solution. It more resembles the art of engineering which tries to find a workable solution; good judgment and intuition are needed to steer toward a good-enough model.

A further problem that needs to be taken into account is the “class imbalance problem.” In many areas, the class variables of the target variable are represented very unequally in the population. Think of transactions that amount to tax evasion, monetary fraud, or terrorist intentions—these only make up a tiny fraction of all transactions. Training on such an imbalanced dataset may produce a model that over fits to the majority of data representing bona-fide transactions. In order to circumvent the problem, a conditio sine qua non is choosing an appropriate performance measure, ensuring that false negatives are given more weight than false positives.Footnote 14 Besides, the main approach is to adjust the available training set in order to obtain a more balanced set. Either one deletes data points from the overrepresented class (undersampling) or adds data points from the underrepresented class (oversampling)—for a recent overview, cf. Chawla (2010). The latter alternative, of oversampling, can also be implemented in a more sophisticated fashion by artificially creating new data points that are located nearby the available minority data points (SMOTE, as originally proposed by Nitesh Chawla in Chawla et al. 2002).

A final accounting task that needs mentioning here relates back to my discussion of bias in underlying datasets. If such data are used inadvertently for model construction, chances are that the biases involved will be built straight into it. Concerns of the kind have generated efforts towards “discrimination-free” or “discrimination-aware” modeling.Footnote 15 At first sight, it would seem that simply deleting any sensitive dimension involved from datasets would be a step forward. However, some of the remaining model variables may actually be correlated with it, allowing discrimination to continue. In consistent fashion, one may take the next step and eliminate all correlated dimensions as well. But at a price: every deletion of a variable also deletes information valuable for the task of prediction.

In order to prevent this loss of information, these practitioners prefer to keep biased datasets intact. Instead, the very models and their datasets for training are being reconsidered. How to train models with a view to obtaining unbiased results? In the pre-processing stage, one may change the set of training data involved. Options to be considered are locally “massaging” the data in such a way that borderline cases are relabeled, and/or locally introducing “preferential sampling” that deletes and/or duplicates training instances. In the processing stage, one may take to developing models under non-discrimination constraints. In the post-processing phase, finally, one may try and suitably alter the classification rules obtained. Such deliberations about circumvention of bias in modeling should become part and parcel of the accounting process concerning model construction.

Notice that accounting on this score may benefit from developments in the fresh field of “algorithmic transparency.” These include procedures to test models of machine learning, afterwards, whether they suffer from “algorithmic discrimination” or not. Testing consists of randomly changing the attributes of the sensitive dimension in the training set (say, from male to female and vice versa). “Quantitative Input Influence” measures allow an estimation of whether or not group properties (like race or gender) have undue influence on outcomes (Datta et al. 2016).

Phase 3: Model Use

Upon completion, the model is ready to be used for making decisions. That is, for decisions about granting that loan or charging that insurance premium, or about whom to inspect for tax evasion or for suspicious activity, and the like. Such decisions can now be made assisted by the algorithm developed; by no means it is implied that these are to be taken in fully automated fashion. As a rule, there is an array of possibilities: from mainly human to fully automated decision-making. Depending on the particular context at hand, one or other solution may be optimal. In the context of camera surveillance (CCTV), for example, Macnish devotes a whole article to arguing the various options; in conclusion, he recommends a combination of manual and automated modes of making decisions (Macnish 2012). So I want to argue that at the end of the day, proper accounting should provide a reasoned report about and justification for the chosen levels of automation in decision-making.