The main graph (top) shows the changes in the ratio between the number of publications cited in the connectionist corpus (orange) and the corresponding number in the symbolic corpus (blue), both adjusted by the total number of publications in WoS. The additional graphs (bottom) represent the number of publications cited during a given period for each corpus.

To retrace this history, we present a very simple analytical framework which, within a vast array of heterogeneous technologies with a very high level of complexity, isolates a number of reference points allowing us simultaneously to account for the transformation of calculation infrastructures, and different ways of critically analysing their performance. To look at the design of technical systems and their epistemic aim, together, we posit that an “intelligent” machine must articulate a world, a calculator, and a target, based on different configurations. These notions refer to the functional framework within which the design of intelligent artefacts is typically broken down, based on varied terminologies: “environment”/“inputs”/“data”/“knowledge base” (world), “calculation”/“program”/“model”/“agent” (calculator), and “objectives”/“results”/“outputs” (target). Predictive machines can thus be said to establish a calculator in the world by granting it a target. The devices designed throughout the history of AI equip the world, the calculator, and the target with varied and changing entities. They thus propose radically different ways of interrelating the architecture of these sets. The shift in AI research from symbolic machines towards connectionist machines is therefore the result not of a change in the history of ideas or the validity of one scientific model over another, but of a controversy that led actors to profoundly shift, transform, and redefine the form given to their artefacts. The process that this analytical model allows us to be attentive to is a long historical reconfiguration of alliances and paradigms between competing scientific communities. This affects calculation techniques, but also and above all the form given to these machines, their objectives, the data that they process, and the questions that they address (Latour, 1987). To put it in a way that will become clearer throughout the article: while the designers of symbolic machines sought to insert in the calculator both the world and the target, the current success of connectionist machines is related to the fact that, almost in contrast, their creators empty the calculator so that the world can adopt its own target.

CYBERNETICS AND THE BEGINNINGS OF CONNECTIONISM

The origins of neural networks are found in the pioneering history of computer science and early cybernetics. Even though the term was coined later, cybernetics can effectively be considered “connectionist”⁷ and still refers to the goal of mathematically modelling a neural network, set by neurophysiologist Warren McCulloch and logician Walter Pitts in 1943. To this day, that seminal article continues to be quoted as the starting point of the connectionist journey, even in current citations in deep learning articles. The chronology of scientific activity in AI (Figure 3) clearly demonstrates the pre-eminence of the connectionist approach during the early cybernetic period. McCulloch and Pitt’s first article proposed a formal model (Figure 4) in which neurons use variables as inputs and weight them to produce a sum that triggers the neuron’s activation if it exceeds a certain threshold.

Figure 4. Formal model of an artificial binary threshold neuron

This proposition not formulated as pertaining to artificial intelligence — the term did not exist — but rather as a neurophysiology experimentation tool that was consistent with the biological knowledge of the time regarding the brain’s neural processes. It was rapidly associated with the idea of learning through the work of neuropsychologist Donald O. Hebb (1949), which shows that the repeated activation of a neuron by another via a given synapse increases its conductivity and can be considered learning. Biologically inspired, the formal neural model constituted one of the main points of reflection for cyberneticians at the time, and was to become the cornerstone of the calculator of the first “intelligent” machines (Dupuy, 2005).

The close coupling between the world and the calculator

The characteristic feature of the architecture of these machines is that their coupling with the environment (the world) is so organic that it is not necessary to grant the calculator its own agentivity. The goal of cybernetics is to create nothing more than a black box of learning and association, the target of which is regulated by measuring the deviation (i.e. the error) between the world and the machine’s behaviour. This representation of intelligent machines was initially based on a materialistic conception of information that differed from the symbolic conception that prevailed at the time of the emergence of artificial intelligence (Triclot, 2008). As a form of order opposed to entropy, information is a signal rather than a code. With information theory as developed by Shannon (1948), information did not have to be associated with a given meaning; it was conceived of as pure form, independent of all other considerations, limited to “expressing the magnitude of the order or structure in a material agencing” (Triclot, 2008).

Cybernetic machines defined the target of their calculation based only on a comparison of inputs from and outputs towards the world. Norbert Wiener’s (1948) predictive device applied to guiding anti-aircraft missiles was based on continuously updating their trajectory, comparing the real trajectory of the target with prior estimates. The device had to converge towards the best solution on the basis of the available data; this data informed, corrected, and oriented the calculator. Negative feedback — i.e. incorporating the measurement of output error as a new input into an adaptive system — would thus constitute the main axiom of cybernetics. It allowed technical systems to be considered in a strictly behaviourist form, echoing the behaviourist psychology of the time (Skinner, 1971). Just like for living organisms, machines inductively adapted to signals from the environment with a coupling that was so tight that it did not require internal representations or intentions; in short, an “intelligence” specific to them. When Arturo Rosenblueth, Norbert Wiener, and Julian Bigelow (1943) formulated the founding principles of cybernetics, they imagined a self-correcting machine capable, through probabilistic operators, of modifying or adopting end goals that were not “internal” but rather produced by adapting its behaviour according to its own mistakes. Rigorously “eliminativist”, the design of cybernetic machines could do away with the notions of intention, plans, or reasoning (Galison, 1994). Theorizing the functioning of one of the most famous of these machines, the Homeostat, Ross Ashby (1956: 110) described the calculating portion of the environment/machine system as a “black box”⁸. The configuration of cybernetic prediction machines so tightly coupled the world and the calculator that their target was to optimize the adaptive operation of the system that they formed together. The cybernetic machines of the 1950s (Homeostat, Adaline, etc.) were no more than laboratory artefacts with very limited aims and capacity; by contrast, deep learning calculators would eventually and much more efficiently come to offer a black box around a world of data, turning outputs into inputs.

The Perceptron and connectionist machines

Particularly in the field of visual recognition, McCulloch and Pitts’ neural networks provided a highly suitable solution for equipping the calculator of the first adaptive machines. At the end of the 1950s, these machines underwent an important development that contributed to the first wave of public interest in brain machines⁹. The connectionist approach inspired the work of Bernard Widrow (Adaline), Charles Rosen at Stanford (Shakey), or even the Pandemonium, Oliver Selfridge’s hybrid device (1960). However, it was the Perceptron initiative (1957–1961) of Frank Rosenblatt, a psychologist and computer scientist at Cornell University, that embodied the first true connectionist machine and became the emblem of another way of enabling a calculation artefact with intelligent behaviour. This device, designed for the purpose of image recognition, received much attention and obtained a large amount of financing from the US Navy (the ONR). Frank Rosenblatt’s machine was inspired by McCulloch and Pitts’ formal neural networks, but added an additional machine learning mechanism. In the Perceptron’s superimposed layers, the input neurons stimulated retinal activity and the output neurons classified the “features” recognized by the system; only the hidden, intermediate layers were capable of learning. Contrary to McCulloch and Pitts’ logical — and “top-down” — organization, Frank Rosenblatt advocated a “bottom-up” approach that let the learning mechanism statistically organize the network structure. Following an initial software-based implementation, Frank Rosenblatt undertook the construction of the sole hardware version of the Perceptron: the Mark I, which consisted of 400 photoelectric cells connected to neurons. The synaptic weights were encoded in potentiometers, and changes in weight during learning were made by electric engines. However, the concrete implementation of these learning machines remained very rare due to the technical limitations of the time, and above all, was halted by the development of an AI exploring an entirely different direction of research: the “symbolic” school.

SYMBOLIC AI

When the main proponents of the Dartmouth founding meeting, John McCarthy and Marvin Minsky, coined the term “artificial intelligence” (AI) in 1956, their intention was to oppose the connectionism of early cybernetics (Dupuy, 2005)¹⁰. They very explicitly wanted to give machines a goal other than adaptively adjusting inputs and outputs. The purpose of “symbolic”¹¹ AI was to implement rules in computers via programs, so that high-level representations could be manipulated. The emergence of AI thus constituted a veritable “anti-inductive” movement in which logic had to counter the “chimera” of the connectionist approach, which was accused of refusing to define data processing independent of physical processes and of proposing a theory of the mind (Minsky, 1986)¹². As the chronology shows (Figure 3), the symbolic approach prevailed in scientific production in the AI field from the mid-1960s up until the early 1990s.

It was initially informed by the work of Herbert Simon, carried out alongside Alan Newell at RAND in the 1950s. In 1956 they wrote the first program intended to simulate machine decision-making, the Logic Theorist (1956), with the announcement — which would become a typical habit among AI researchers — that “over Christmas, Allen Newell and I invented a thinking machine” (McCorduck, 2004: 168). Modelling reasoning was the central feature of this first wave of AI, which spanned the period from 1956 until the early 1970s. This field of research soon consisted of a small group from MIT (Minsky, Papert), Carnegie Mellon (Simon, Newell), and Stanford University (McCarthy). Despite internal differences, this closed circle established a monopoly over defining AI issues and obtained the majority of (large) funds and access to huge computer systems. From 1964 to 1974 they received 75% of the funding for AI research granted by the ARPA and the Air Force (Fleck, 1982: 181), and benefited from the rare calculation capacities needed for their projects. At the ARPA, they enjoyed the unfailing support of Joseph Licklider, who funded symbolic projects while justifying them in terms of their hypothetical military applications.

This seizure of power by the symbolic school over the then fuzzy and very open definition of intelligent machines took on the form of an excommunication, pronounced in the book that Marvin Minsky and Seymour Papert (1969) dedicated to demonstrating the ineffectiveness of neural networks. At the beginning of the 1960s, the connectionist approaches inherited from early cybernetics experienced a certain degree of enthusiasm, driven by the media success of Frank Rosenblatt’s Perceptron. Even though, as a student, Marvin Minsky himself developed neural networks (Snarc, 1951), he wished to confirm the mathematical pre-eminence of symbolic AI over the “mystical” nature “surrounded by a romantic atmosphere” of the self-organized and distributed systems of connectionists (Minsky and Papert, 1969, note 13). Targeting a limited and simplified single-layer version of the Perceptron, he and Seymour Papert demonstrated that neural networks were incapable of calculating the XOR (the exclusive OR) function and therefore had no future. As Mikel Olazaran (1996) shows, Minsky and Papert’s strategy was to write the pre-eminence of the symbolic school into the definition of artificial intelligence. Even though the book’s effects likely went beyond its authors’ intentions, its consequences would be definitive. Following the premature death of Frank Rosenblatt in 1971, neural networks were abandoned, their funding was cut, and the work that was to perpetuate their essence would be carried out outside of the AI field.

A space to manipulate symbols

The main feature of the architecture of symbolic machines is that they break the ties with the world and open up an independent space of reasoning within their calculator. The so-called “von Neumann” configuration of new computers implemented in the 1950s established this very space. Whereas the ENIAC (1946) was designed to calculate ballistic tables by “programming” the machine into the hardware, the EDVAC project (1952) separated the logical operations carried out on the symbols (software) of the physical structure of machines (hardware) (von Neumann, 1945). The program was thus granted its own space independent of the physical operation of the computer. It became a “universal automatic computer with a centralized program” (Goldstine, 1972: 198–199) and the programming, independent of hardware processes, could be freed to be done “on paper”, as Alan Turing (2004: 21) put it. Paul Edwards (1996) shows how, with the appearance of sophisticated programming languages similar to human languages, and subsequently compiled into machine language represented by 0s and 1s, the physical machine could be separated from the symbolic machine. Artificial intelligence could thus be considered as the science of the mind in the machine. One of AI’s first contributions to computer science was precisely related to the designing of programming languages, the most famous of which was LISP, developed by John McCarthy in 1958, which was fully identified with AI research due to its logical abstraction capabilities¹³.

As soon as it was created in the calculator, this programming space was available to manipulate symbols. AI was born in the same year as cognitive science (1956), and together the two fields would shape the efforts to give computers a capacity for reasoning (Gardner, 1985). Contrary to behaviourist psychology, which inspired the adaptive “black boxes” of cybernetics, cognitive science’s aim was to bestow logical and abstract capabilities on machines. And unlike connectionism, these fields showed no interest in human physiology and behaviour, paying attention only to reasoning. The computational theory of the mind established a duality, positing that mental states could be described both in a physical form as a set of physical information-processing instances, and in a symbolic form as mechanically executable operations of comparing, ranking, or inferring meaning (Andler, 2016). This “physical symbol systems” hypothesis states that the mind does not directly access the world but rather consists of internal representations of the world that can be described and organized in the form of symbols inserted in programs.

A “toy” world

The founders of AI did their utmost to separate data from the sensory world and human behaviour¹⁴. The world of symbolic machines was a theatre backdrop created by the machine in order to project the syntax of its logical rules onto it: chess or checkers games (Arthur Samuel), geometry theorems (with Herbert Gelernter’s Geometry Theorem Prover), video game backgrounds. The emblematic projects of this first wave of AI were characterized by the invention of simplified spaces of forms that must be recognized and moved, such as Marvin Minsky’s MicroWorlds (MAC) or Terry Winograd’s famous SHLURDU language. Just like the limited space with a few rooms and objects in which the Shakey robot is supposed to move around, it is a fictitious, “toy”¹⁵ space in which objects can easily be associated with the syntax of the rules, which are calculated to produce relevant system behaviour.

If the calculator projects its own world, this is also because its goal is to contain its own target. This is how this AI has been able to claim that it is “strong”, because the goals given to the system are specific to it and can be deduced from a sort of reasoning incorporated into the logical inferences made by the models. The highly ingenious languages invented to shape the syntax of these systems are all inferential. They organize into stages the elementary processing operations transforming entities, each of which is an inference of a correct calculation (Andler, 1990: 100): a decision tree, intermediate chain of reasoning, breakdown of goals and sub-goals, and means-ends analysis. The rational target of the calculation is enclosed within the program’s syntax. The machine must solve the problem, find the true or correct solution, and make the right decision¹⁶. Therefore, it was not necessary to give it the correct response (as the examples of learning techniques would do), because the rules have to lead it to this, following the inferences of the calculator. Because the syntax of the reasoning and the semantics of the manipulated objects were both constructed within the calculator, it was possible to confuse them with each other in correct and more or less deterministic reasonings — but at the expense of an artificial design in which the “intelligent” world was that implemented by the designer; a regulated, precise, and explicit world, so that reasoning could be its target. While these machines were capable of achieving certain performances in a closed environment, they quickly proved to be blind and stupid as soon as they were faced with an external world.

The first AI winter

At the beginning of the 1970s AI entered its first winter, which froze both the symbolic and connectionist projects. The two streams had both made many promises, and the results were far from meeting expectations. On the connectionist side, Frank Rosenblatt’s Perceptron had been harmed by the media exposure in which its proponent — with the complicity of the US Navy — had liberally participated. Among a plethora of media headlines enthusiastic about the imminent arrival of intelligent machines, the New York Times announced: “The Navy last week demonstrated the embryo of an electronic computer named the Perceptron which, when completed in about a year, is expected to be the first non-living mechanism able to ‘perceive, recognize and identify its surroundings without human training or control’”¹⁷. However, it was especially within symbolic AI, with Herbert Simon and Marvin Minsky leading it, that the exaggerated prophecies and announcements were quickly disappointing. Giddy with the researchers’ promises, the army and the DARPA had thought that they would soon have machines to translate Russian texts, robots for infiltrating enemy lines, or voice command systems for tank and plane pilots, but discovered that the “intelligent” systems announced are only artificial games played in synthetic environments. In 1966 the National Research Council cut the funding for automated translation — a foreboding decision that would trigger a cascade of divestments by the financial and academic supporters of AI. At the beginning of the 1970s, Minsky and Papert’s MicroWorlds project at MIT experienced difficulties and lost its support. At Stanford, the Shakey robot no longer received military financing, and the DARPA SUR speech recognition program benefiting Carnegie Mellon was abruptly shut down. In England, the highly critical Lighthill report in 1973 would also play a role in stopping public funding for AI (Crevier, 1997: 133–143).

With the funding crisis, increasingly visible criticism began to be levelled at the very undertaking to logically model reasoning. In 1965, the RAND ordered Hubert Dreyfus to write a report on AI, which he entitled “Alchemy and Artificial Intelligence”, and which used a vigorous argument that he later elaborated on in the first edition of his successful book What Computers Can’t Do (Dreyfus, 1972). Bitter and intense, the controversy between the AI establishment and Hubert Dreyfus considerably undermined the idea that rational rules could make machines “intelligent”. The explicit definition of logical rules was completely devoid of the corporeal, situated, implicit, embodied, collective, and contextual forms of the perception, orientation, and decisions of human behaviours¹⁸. Criticism was also put forward by the first generation of “renegades”, who became significant opponents of the hopes that they themselves had expressed; for example Joseph Weizenbaum (1976), the founder of ELIZA, and Terry Winograd, the disappointed designer of SHRDLU (Winograd and Flores, 1986). “Intelligent” machines reasoned according to elegant rules of logic, a deterministic syntax, and rational objectives, but their world did not exist.

THE SECOND WAVE OF AI: A WORLD OF EXPERTS

AI nevertheless experienced a second spring during the 1980s, when it proposed a significant modification to the architecture of symbolic machines under the name of “expert systems”¹⁹. This renaissance was made possible by access to more powerful calculators allowing far bigger volumes of data to be input into computer memory. The “toy” worlds could thus be replaced with a repertoire of “specialized knowledge” taken from expert knowledge²⁰. The artefacts of second-generation AI interacted with an external world that had not been designed and shaped by programmers. It was now composed of knowledge that had to be obtained from specialists in different fields, transformed into a set of declarative propositions, and formulated in a language that was as natural as possible (Winograd, 1972) so that users could interact with it by asking questions (Goldstein and Papert, 1977). This externality of the world to calculate led to a modification in the structure of symbolic machines, separating the “inference engine” into what would subsequently constitute the calculator and a series of possible worlds called “production systems”, according to the terminology proposed by Edward Feigenbaum for DENDRAL, the first expert system that could identify the chemical components of materials. The data that supplied these knowledge bases consisted of long, easily modifiable and revisable lists of rules of the type “IF … THEN” (for example: “IF FEVER, THEN [SEARCH FOR INFECTION]”), which were dissociated from the mechanism allowing one to decide when and how to apply the rule (inference engine). MYCIN, the first implementation of a knowledge base of 600 rules aimed at diagnosing infectious blood diseases, was the starting point, in the 1980s, of the development of knowledge engineering that would essentially be applied to scientific and industrial contexts: XCON (1980) helped the clients of DEC computers configure them; DELTA (1984) identified locomotive breakdowns; PROSPECTOR detected geological deposits, etc. (Crevier, 1997, starting at p. 233). Large-scale industries developed AI teams as a part of their organization; researchers got started on the industrial adventure; investors rushed towards this new market; companies grew at an exceptional rate (Teknowledge, Intellicorp, Inference) — always with the faithful support of ARPA (Roland and Shiman, 2002) ‒; and the media seized the phenomenon, once again announcing the imminent arrival of “intelligent machines” (Waldrop, 1987).

The sanctuaries of the rules

Faced with criticism of the rigid computationalism of the first era that invented an abstract universe without realistic ties to the world, AI research undertook a top-down process to complete, intellectualize, and abstract the conceptual systems intended to manipulate the entities of these new knowledge bases. The symbolic movement thus strengthened its goal of rationalization by putting excessive emphasis on modelling in order to encompass a variety of contexts, imperfections in reasoning, and the multiplicity of heuristics, thus moving closer to the user’s world through the intermediary of experts. This dedication to programming the calculator was characterized by more flexibility of logical operators (syntax) and the densification of the conceptual networks used to represent knowledge (semantics). The movement observed in AI research sought to de-unify the central, generic, and deterministic mechanism of computational reasoning in order to multiply, decentralize, and probabilize the operations carried out on knowledge. Borrowing from discussions around the modularity of the mind in particular (Fodor, 1983), the systems implemented in calculators broke down the reasoning process into elementary blocks of interacting “agents” which independently could have different ways of mobilizing knowledge and inferring consequences from it²¹. It was thus within the semantic organization of the meanings of heuristics taken from knowledge bases that the main innovations of the second wave of symbolic AI were designed. They used languages (PROLOG, MICROPLANNER, CYCL) and intellectual constructions with a rare degree of sophistication, for example the principle of lists; the notion of “conceptual dependency” as detailed by Roger Schank; Ross Quillian’s semantic networks, and so on. The unfinished masterpiece of these multiple initiatives was Douglas Lenat’s Cyc, a general common-sense knowledge ontology based on an architecture of “fundamental predicates”, “truth functions” and “micro-theories”, which everyone in the AI community admired but no one used.

The growing volume of incoming knowledge and the complexification of the networks of concepts intended to manipulate it were the cause of another large-scale shift: logical rules became conditional and could be “probabilitized”. With regard to the rational and logical approach represented by John McCarthy, from the 1970s Marvin Minsky and Seymour Papert defended the idea that “the dichotomy right/wrong is too rigid. In dealing with heuristics rather than logic the category true/false is less important than fruitful/sterile. Naturally, the final goal must be to find a true conclusion. But, whether logicians and purists like it or not, the path to truth passes mainly through approximations, simplifications, and plausible hunches which are actually false when taken literally” (Minsky and Papert, 1970: 41). Among the thousands of rules formulated by the experts, it is possible, based on a fixed premise (IF…), to establish a probability of whether the second proposition (THEN…) has a possibility of being true. The probabilization of knowledge rules meant that the deterministic form of the inferential reasoning that had experienced its moment of glory during the first age of AI could be relaxed. By becoming more realistic, diverse, and contradictory, the knowledge entering prediction machines also introduced probability into them (Nilsson, 2010: 475). When the “fruitful/sterile” pair replaced the “true/false” pair, the target providing the goal for the calculator appeared to be less of a logical truth than an estimate of the correctness, relevance or verisimilitude of the responses provided by the system. However, this estimate could no longer be taken care of essentially by the rules of the calculator; it had to be externalized towards a world composed of experts, who were mobilized to provide examples and counterexamples for machine learning mechanisms²².

With the probabilization of inferences, these techniques penetrated deeper into the AI field in order to complete tasks that had become impossible for programmers to complete “by hand” (Carbonnell et al., 1983). Following the work of Tom Mitchell (1977), learning methods could be described as a static solution for finding the best model within a space of hypotheses — or “versions” — automatically generated by the calculator. With expert systems, this space of hypotheses was highly structured by the nature of the input data, i.e., the “knowledge”. The learning mechanism “explores” the multiple versions of models produced by the calculator to search for a consistent hypothesis, making use of logical inferences to build reasonings (concept generalization, subsumption, inverse deduction). The statistical methods to eliminate potential hypotheses also matured and developed, producing inference-based reasoning such as decision trees (which subsequently gave rise to random forests, “divide and conquer” techniques, or Bayesian networks that served to order dependencies between variables with causalist formalism (Domingos, 2015)). Even when automated, the automatic discovery of a target function conserved the idea that models are hypotheses and that even though machines no longer applied a certain type of deductive reasoning, they chose the best possible reasoning from among a set of potential reasonings. However, starting in the early 1990s, a change in the nature of the data constituting the calculator’s input world led to a shift in the field of machine learning. There was more data, it was no longer organized in the form of labelled variables or interdependent concepts, and it soon lost its intelligibility as it became numerical vectors (infra). No longer possessing a structure, data could only be collected in the form of statistical proximity. There was consequently a shift in the machine learning field from “exploration-based” methods to “optimization-based” methods (Cornuéjols et al., 2018, p. 22), which would tear down the sanctuaries of the rules to the benefit of mass statistical calculations.

By increasingly expanding the volume and realism of the data to calculate, the inductive mechanism changed direction within the calculator. If the data no longer provided information on the relationships between one another (categories, dependencies between variables, conceptual networks), then in order to identify the target function, the inductive mechanism had to rely on the final optimization criteria in order to carry out the correct distribution (Cornuéjols et al., 2018: 22). The transformation in the composition of the world to learn led researchers to modify the inductive method implemented, and by doing so, to propose an entirely different architecture for predictive machines. This shift accelerated with neural networks (infra), but the turn had already been prepared within the world of machine learning. Because data were increasingly less “symbolic”, the inductive mechanism no longer searched for the model in the structure of initial data, but rather in the optimization factor (Mazières, 2016). The calculation target was no longer internal to the calculator but rather a value that the world assigned to it from outside — and which was very often “human”, as demonstrated by all of the manual work to label data: does this image contain a rhinoceros (or not)? Did this user click on this link (or not)? The answer (the optimization criteria) must be input into the calculator along with the data so that the former can discover an adequate “model”. The new machine learning methods (SVM, neural networks) thus proved to be more effective at the same time that they became unintelligible, as the inventor of decision trees, Léo Breiman (2001), emphasized in a provocative article on the two cultures of statistical modelling.

The magnificent sanctuaries erected by the builders of expert systems did not fulfil their promises. They soon proved to be extremely complex and very limited in their performance. The highly dynamic market that had developed in the mid-1980s suddenly collapsed and promising AI companies went bankrupt, in particular because to sell expert systems, they also had to sell specialized workstations called “LISP machines” at exorbitant prices, at a time when the PC market was on the rise (Markoff, 2015: 138 onwards). The decrease in cost and increase in calculation capacity during the 1980s made powerful calculators accessible to the heterodox and deviant schools of thought that had been excluded from the funding of large computer science projects as a result of the symbolic school’s monopoly (Fleck, 1987: 153). The control of the small circle of influential universities over the “symbolic” definition of AI became weaker, given that expert systems produced only very limited results in the fields of voice synthesis, shape recognition, and other sectors. Symbolic AI was so weak at the beginning of the 1990s that the term almost disappeared from the research lexicon. Creating infinite repositories of explicit rules to convey the thousands of subtleties of perception, language, and human reasoning was increasingly seen as an impossible, unreasonable, and inefficient task (Collins, 1992; Dreyfus, 2007).

THE DISTRIBUTED REPRESENTATIONS OF DEEP LEARNING

It was in this context and the end of the depression phase which had begun in the late 1960s, that the connectionist approaches experienced a comeback in the 1980s and 1990s, with an immense amount of theoretical and algorithmic creativity. Following a meeting in June 1979 in La Jolla (California), organized by Geoff Hinton and James Anderson, an interdisciplinary research group composed of biologists, physicists, and computer scientists once again proposed to turn its attention back to the massively distributed and parallel nature of mental processes in order to find an alternative to classic cognitivism. This group acquired real visibility in 1986 with the publication of two volumes of research under the name Parallel Distributed Processing (PDP), the term chosen to avoid the negative reputation of “connectionism” (Rumelhart et al., 1986b). As opposed to the sequential approaches of computer and symbolic reasoning, PDP explored the micro-structures of cognition, once again using the metaphor of neurons to design a counter-model with original properties: elementary units were linked together via a vast network of connections; knowledge was not statically stored but resided in the strength of the connections between units; these units communicated with one another via a binary activation mechanism (“the currency of our system is not symbols but excitation and inhibition”, p. 132); these activations took place all the time, in parallel, and not following the stages of a process; there was no central control over flows; one sub-routine did not trigger the behaviour of another one but instead sub-systems modulated the behaviour of other sub-systems by producing constraints that were factored into the calculations; and the operations carried out by the machine were similar to a relaxation system in which the calculation iteratively proceeded to carry out approximations to satisfy a large number of weak constraints (“the system should be thought of more as settling into a solution than calculating a solution”, p. 135). The connectionists’ device did create internal representations, and these representations could be high-level, but they were “sub-symbolic”, statistical, and distributed (Smolensky, 1988). As this brief summary conveys, the connectionist approach was not a simple method but rather a highly ambitious intellectual construction intended to totally overturn computational cognitivism:

I think in the early days, back in the 50s, people like von Neumann and Turing didn’t believe in symbolic AI. They were far more inspired by the brain. Unfortunately, they both died much too young and their voice wasn’t heard. In the early days of AI, people were completely convinced that the representations you needed for intelligence were symbolic expressions of some kind, sort of cleaned-up logic where you can do non-monotonic things, and not quite logic, but like logic, and that the essence of intelligence was reasoning. What has happened now is that there’s a completely different view, which is that what a thought is, is just a great big vector of neural activity. So, contrast that with a thought being a symbolic expression. I think that the people who thought that thoughts were symbolic expressions just made a huge mistake. What comes in is a string of words and what comes out is a string of words, and because of that, strings of words are the obvious way to represent things. So, they thought what must be in between was a string of words, or something like a string of words. And I think what’s in between is nothing like a string of words. […] Thoughts are just these great big vectors and these big vectors have causal powers; they cause other big vectors, and that’s utterly unlike the standard AI view.²³

While these epistemic references have lost their edge for the new pragmatic users of neural networks of today, who never experienced the exclusion and mockery to which their predecessors were subjected, they were a constant driver of the unrelenting pursuit of the connectionist project. What had to be inserted between the strings of words coming in and those going out was not a model programmed by a logician’s mind but a network of elementary entities that adapted its coefficients to inputs and outputs. To the extent possible, it was necessary for it to “do this on its own”, and that required many artefacts.

Reconfiguring connectionism based on algorithms

In the early 1980s, in line with the work of John Hopfield, who proposed a revised version of the Perceptron model that gave each neuron the possibility of updating its values independently, the physicist Terry Sejnowski and the English psychologist Geoff Hinton developed new multi-layered architectures for neural networks (called Boltzmann machines). They also designed Nettalk, a system with three layers of neurons and 18,000 synapses that was successful in transforming texts into spoken phrases. However, the true turning point in this re-emergence was the creation of an algorithm called stochastic gradient back-propagation (“backprop” for short), which allowed the weights of coefficients to be calculated (Rumelhart et al., 1986a). Contradicting the criticism of Minsky and Papert (1969), the authors showed that when networks are given multiple layers, they can easily be trained, as the additional layers of neurons make it possible for them to learn non-linear functions. The algorithm works by taking the derivative of the network loss function and “propagates” its error to correct the coefficients in the lower levels of the network²⁴. Similarly to cybernetic machines, the output error is “propagated” towards the inputs (Figure 5).

Figure 5. Operation of a simple neural network

With the existence of a general-purpose algorithm that served to optimize any type of neural network, the 1980s and 1990s were a remarkable period of inventiveness that strongly influenced the re-emergence of connectionism. One of the first successes was their application by Yann LeCun to zip code recognition carried out at AT&T Bell Labs (Lecun et al., 1989), which “invented” the convolution technique. Using the US Postal Service database, he was successful in training a multi-layer network to recognize the zip code numbers written on packages. His successful approach became one of the first widespread business applications of neural networks, first in the banking (verification of check amounts) and postal sectors. This was followed by a series of proposals to integrate a greater number of hidden layers, to complexify the map of connections (encoders), to diversify optimization functions (ReLU), to integrate memory into network layers (recurrent networks and LSTM), to make unsupervised and supervised learning dependent on the part of the network (beliefs network), and so on (Kurenkov, 2015). In a highly creative way, numerous architectures wiring the relationships between neurons differently were put to the test to explore their properties.

“They might not be convex but they’re more effective!”

Even though these algorithms laid the foundations of the majority of the approaches now referred to as deep learning, their invention was not immediately crowned with success. From 1995 to 2007, institutional support became very rare, papers were refused at conferences, and the results obtained remained limited. “They went through a colossal winter”, a computer vision researcher says. “The truth is that, at the time, nobody could get those machines to work. There were five laboratories in the world that knew how, but we couldn’t manage to train them”²⁵. The researchers maintaining these techniques around Geoff Hinton, Yann LeCun, and Yoshua Bengio were a small, isolated — but cohesive — group, whose exclusive support came from the Canadian Institute for Advanced Research (CIFAR). Their situation became even more difficult in 1992 faced with the emergence of an original learning technique: support-vector machines — also called “kernel methods” –, which proved to be very effective on small datasets. Already exiled from the artificial intelligence community, connectionists once again found themselves on the fringes of the machine learning community.

At the time, if you said that you were making a neural network, you couldn’t publish a paper. It was like that up until 2010, a has-been field. I remember that one time, LeCun was at our lab as a guest professor, and we had to make the effort of eating with him. Nobody wanted to go. It was bad luck, I swear. He would cry, his publications were refused at the CVPR, his methods weren’t trendy, it wasn’t sexy. So people gravitated towards what was popular. They gravitated towards kernels, SVM machines. And LeCun would say: “I have a 10-layer neural network and it does the same thing”. Then we would say, “Are you sure? What’s new?” Because once you have a neural network, even though it might have 10 layers this time, it doesn’t work any better than the last one. It sucked! Then he would say, “Yeah, but there isn’t as much data!”.²⁶

One argument constantly appears in the criticism levelled at the rare proponents of neural networks:

They [SVM proponents] would always say, “they [neural networks] aren’t convex, they’re just a shortcut”. That’s all that came out of their mouths. We would submit papers and they’d say, “they’re not convex!” Maths wizards, obsessed with optimization, who’d never seen anything else in their life! It was like that for years. But we didn’t give a damn.²⁷

Due to their non-linear nature²⁸, neural networks could not guarantee that the overall minimum had been found during the loss function optimization phase; it could just as well converge towards a local minimum or plateau²⁹. From 2005 to 2008, a veritable policy of reconquest was initiated by the small group of “neural conspirators” (Markoff, 2015: 150) who set out to convince the machine learning community that it had been the victim of an epidemic of “convexitis” (LeCun, 2007). When their papers were refused at the NIPS in 2007, they organized an offshoot conference, transporting participants to the Hyatt Hotel in Vancouver by vehicle to defend an approach that the proponents of the dominant SVMs at the time considered archaic and alchemistic. Yann LeCun led the way with the title of his paper: “Who Is Afraid of Non-convex Loss Functions?” After presenting multiple results showing that neural networks were more effective than SVMs, he argued that an excessive attachment to the theoretical requisites resulting from linearized models was hindering the creation of innovative calculation architectures and the ability to consider other optimization methods. The very simple technique of stochastic gradient descent could not guarantee convergence towards a global minimum, yet “when empirical evidence suggests a fact for which you don’t have theoretical guarantees, that precisely means that the theory is maladapted […], if that means that you have to throw convexity out the window, then that’s okay!” (LeCun, 2017, 11’19).

Creative people are always crazy. At the beginning, that group, the creative people, were very tumultuous. After that, people from fields other than AI arrived, coming from maths and dismissing gradient descent to tell you about their methods: “my theorem is more elegant than yours”. In optimization, people spent something like ten years searching for a more effective convex method and doing highly sophisticated but very costly things [in terms of calculation capacity]. That does have its advantages, but it had been bled dry, with thousands of papers, and when the big wave of data arrived, all of a sudden, none of their machines worked!³⁰

Transforming the world into vectors

In this way, connectionists shifted the scientific controversy around convexity, requiring the new data flows knocking at the doors of laboratories to contain the choice of the best calculation method. The architecture of predictive machines was transformed to cater for big data. It bared no resemblance to the small, calibrated, and highly-artificial datasets of the traditional competitions between researchers. This is because during this debate, the computerization of society and the development of web services triggered the emergence of new engineering problems based on large data volumes, such as spam detection, collaborative filtering techniques for making recommendations, inventory prediction, information searches, or the analysis of social networks. In the industrial context, the statistical methods of the new field of data science borrowed from and developed machine learning techniques (Bayesian methods, decision trees, random forests, etc.) without worrying about positioning themselves with respect to AI concerns (Dagiral and Parasie, 2017). On the other hand, it was clear that faced with the volume and heterogeneity of data features, as opposed to “confirmatory” techniques, it was necessary to use more “exploratory” and inductive methods (Tuckey, 1962). It was also in contact with industry players (AT&T originally, followed by Google, Facebook, and Baidu) that the neural network conspirators addressed problems, calculation capacities, and datasets allowing them to demonstrate the potential of their machines and to assert their viewpoint in the scientific controversy. They brought in a new referee: the effectiveness of predictions, in this case when applied to the “real” world.

Neo-connectionists first imposed their own terms in the debate. According to them, it was necessary to distinguish the “width” of the “shallow” architecture of SVMs from the “depth” (the term “deep learning” was coined by Geoff Hinton in 2006) of architectures based on layers of neurons. By doing so, they were able to demonstrate that depth is preferable to width: only the former is calculable when the data and dimensions increase, and is capable of capturing the diversity of data features. However convex SVMs may be, they do not give good results on large datasets: the dimensions increase too quickly and become incalculable; poor examples trigger considerable disturbances in predictions; and the solution consisting of linearizing a non-linear method deprives the system of its capacity to learn complex representations (Bengio and LeCun, 2007). The crusaders of connectionism thus managed to convince people that it was preferable to sacrifice the intelligibility of the calculator and rigorously controlled optimization for better perception of the complexity of dimensions present in this new form of data. When the volume of training data increases considerably, many local minimums exist, but there are enough redundancies and symmetries for the representations learned by the network to be robust and tolerant to errors in learning data. At the heart of the debate with the machine learning community, one thing went without saying: only laboratories used linear models; the world, the “real world” where data are produced by the digitization of images, sounds, speech, and text, is non-linear. It is noisy; the information contained in it is redundant; data flows are not categorized according to the attributes of homogeneous, clear, and intelligibly-constructed variables; examples are sometimes false. As Yoshua Bengio et al. wrote, “an AI must fundamentally understand the world around us, and we argue that this can only be achieved if it can learn to identify and disentangle the underlying explanatory factors hidden in the observed milieu of low-level sensory data” (2014, p. 1). This is why a “deep” architecture has more calculation power and is more “expressive” than a “shallow” architecture (LeCun and Bengio, 2007). Decreasing the intelligibility of the calculator to increase its ability to capture the complexity of the world, this controversy around convexity, clearly demonstrates that as opposed to being an example of naive empiricism, the production of inductive machines was the result of intense work to convince people of the need to fundamentally reformulate the relationship between the calculator and the world.

In order for data to shift the scientific debate, it was therefore necessary to radically increase the volume of research datasets. In a 1988 article on character recognition, Yann LeCun used a database of 9,298 handwritten zip code numbers. The database used for character recognition since 2012 (MNIST) contained 60,000 labelled pieces of data on 28-pixel wide black and white images. It served to demonstrate the effectiveness of neural networks, but did not overcome the support for other techniques such as SVMs. In addition, scientific communities took advantage of the Internet to produce much more voluminous datasets and explicitly to build them for machine learning tasks by creating input/output pairs. This systematic collection of the broadest and most elementary digital data possible allowed gave more meaning to Hubert Dreyfus’ statement that the “the best model of the world is the world itself” (Dreyfus, 2007: 1140). As the heterodox approaches critical of representational AI had long argued, representations are found in data from the world, as opposed to being internal to the calculator (Brooks, 1988). The creation of ImageNet, the dataset used during the challenge presented at the beginning of this article, which was initiated by Li Fei-Fei (Deng et al., 2009), is exemplary of this. Today, this database contains 14 million images, the elements of which were manually annotated into 21,841 categories by using the hierarchical structure of another classic database in natural language processing, Wordnet (Miller, 1995). To be successful in this immense task of qualifying elements identified by hand-drawn squares in images, it was necessary to crowdsource the tasks to thousands of annotators via Mechanical Turk (Su et al., 2012; Jaton, 2017). From 9,298 pieces of data to 14 million, such a massive change in the volume of datasets — and therefore in the dimensions of the data –became meaningful only when accompanied by an exponential growth in the power of calculators, offered by parallel computing and the development of GPUs (Figure 6). In 2009, “backprop” was implemented on graphics cards that enabled a neural network to be trained up to 70 times faster (Raina et al., 2009). Today, it is considered good practice to learn a category in a classification task with 5,000 examples per category, which quickly leads datasets to contain several million examples. The exponential growth in datasets accompanied a parallel change in calculator architectures: the number of neurons in a network doubles every 2.4 years (Goodfellow et al., 2016: 27).

However, another transformation in data was also initiated by connectionists, this time to granularize data and transform it into a calculable format through “embedding” operations. A neural network requires the inputs of the calculator to take on the form of a vector. Therefore, the world must be coded in advance in the form of a purely digital vectorial representation. While certain objects such as images are naturally broken down into vectors, other objects need to be “embedded” within a vectorial space before it is possible to calculate or classify them with neural networks. This is the case of text, which is the prototypical example. To input a word into a neural network, the Word2vec technique “embeds” it into a vectorial space that measures its distance from the other words in the corpus (Mikolov et al., 2013). Words thus inherit a position within a space with several hundreds dimensions. The advantage of such a representation resides in the numerous operations offered by such a transformation. Two terms whose inferred positions are near one another in this space are equally similar semantically; these representations are said to be distributed: the vector of the concept “apartment” [-0.2, 0.3, -4.2, 5.1…] will be similar to that of “house” [-0.2, 0.3, -4.0, 5.1…]. Semantic proximity is not deduced from a symbolic categorization but rather induced from the statistical proximity between all of the terms in the corpus. Vectors can thus advantageously replace the words that they represent to resolve complex tasks, such as automated document classification, translation, or automatic summarization. The designers of connectionist machines thus carried out highly artificial operations to transform data into another representation system and to “rawificate” them (Denis and Goëta, 2017). While natural language processing was pioneering for “embedding” words in a vectorial space, today we are witnessing a generalization of the embedding process which is progressively extending to all applications fields: networks are becoming simple points in a vectorial space with graph2vec, texts with paragraph2vec, films with movie2vec, meanings of words with sense2vec, molecular structures with mol2vec, etc. According to Yann LeCun, the goal of the designers of connectionist machines is to put the world in a vector (world2vec). Instead of transforming inputs into symbols interrelated via a fabric of interdependent concepts, this vectorization creates neighbourhood proximities between the internal properties of the elements in the learning corpus³¹.