ENGINES OF EVIDENCE

In the early 1960s, every company had a huge investment in the future of computers. I fell into a device-oriented research group, wherein the goal was to find new phenomena that we could turn into computer memory. Magnetic core memories were already becoming a nuisance. Some people worked on photochromic phenomena, other people worked on semiconductors; and I worked on superconductivity. In fact, I wrote my thesis on superconducting memories.

Everyone felt the urgency to replace core memories to get a computer that is smaller, faster, and cheaper. I remember Bell Labs and IBM working frantically on superconductivity. Every physical phenomenon you can think of was explored to be turned into a memory device. Eventually, semiconductors won the great race.

It was Fairchild Camera who first came out with the semiconductor memories. We all laughed at them saying: "Who is going to risk losing their memory when the power fails?"

I was essentially fired and had to look for another job, because semiconductors were taking over. Luckily, I had a friend at UCLA, so I gave him a call and he told me there was a position open. At that time I didn't even know what the position was. I was supposed to teach whatever I knew, which was computer memories. But there wasn't much need for teaching memories, so I got into AI. The decision to leave industry and go to academia was the best decision I made in my life, except, of course, for marrying my wife.

I got to UCLA in 1969. I immediately got interested in statistical decision theory and decision analysis. It took me ten years to get into what I've been doing since, namely, automated decision making. The only group that was into this challenge was Ron Howard's, in management, not in computer science.

In the late '70s and early '80s, everybody in AI was working on expert systems for all kinds of applications, from medical diagnosis to mineral explorations. The idea was that, wherever you pay a professional, often called "expert," you can emulate that professional on a computer. By interviewing the professional, you can extract the basic rules by which he or she operates and, once you have a computer full of rules, you have an engine that can activate the rules in response to the evidence observed, and this will tell you, for example, where to dig for oil or what medical test to conduct next.

The rule-based system, scientifically speaking, was on the wrong track. They modeled the experts instead of modeling the disease. The problems were that the rules created by the programmers did not combine properly. When you added more rules, you had to undo the old ones. It was a very brittle system.

It didn't quite work out for many reasons. The time it took to interview the expert was one of the main obstacles. The expert had to sit there for two- three weeks and tell the computer programmer how professionals conduct their everyday business, including their line of reasoning.

The rule-based system, scientifically speaking, was on the wrong track. They modeled the experts instead of modeling the disease. The problems were that the rules created by the programmers did not combine properly. When you added more rules, you had to undo the old ones. It was a very brittle system. If there was a procedural change in the hospital, for example, you'd have to rewrite the whole system. And we're not just talking about one or two rules here, there were hundreds of them, all interacting in a way that the professional—in this case, the doctor—did not quite understand; once you put in 100 rules, you forgot the first, fifth, and the seventh.

Another reason I didn't like it was because it wasn't scientifically transparent. I am lazy. So I need to understand what I'm doing and I need to understand it mathematically. Rule-based systems were not mathematically solid to the point where you could prove things about their behavior. Mathematical elegance tells you: "If you do things right you are guaranteed a certain behavior." There's something very pleasing about such guarantees, which was absent from rule-based systems.

A new thinking came about in the early '80s when we changed from rule-based systems to a Bayesian network. Bayesian networks are probabilistic reasoning systems. An expert will put in his or her perception of the domain. A domain can be a disease, or an oil field—the same target that we had for expert systems. The idea was to model the domain rather than the procedures that were applied to it. In other words, you would put in local chunks of probabilistic knowledge about a disease and its various manifestations and, if you observe some evidence, the computer will take those chunks, activate them when needed and compute for you the revised probabilities warranted by the new evidence.

It's an engine for evidence. It is fed a probabilistic description of the domain and, when new evidence arrives, the system just shuffles things around and gives you your revised belief in all the propositions, revised to reflect the new evidence.

The Bayesian network is a speedy way of computing your revised beliefs after you told me your initial beliefs.

The problem was getting compactness and speed; these were the two major obstacles. Theoretically, belief revision requires exponential time and exponential memory, and these couldn't be afforded.

We took advantage of the fact that the knowledge builder understands what is relevant and what is not relevant. That gave us a sparse network, and if you had a sparse network, you could leverage the sparseness and get speed and compactness. The Bayesian network is a speedy way of computing your revised beliefs after you told me your initial beliefs. That apparently took off because it had all the nice properties of probability calculus, plus the procedural advantages of rule-based systems. And it was transparent.

The main ingredient that made Bayesian networks popular and useful was "re-configurability." For example, if the problem was to troubleshoot a car engine and they changed the fuel pump, you didn't have to rewrite the whole system; you just had to change one subsystem that was responsible for modeling the pump and all the rest remained intact. So re-configurability and transparency were the main selling points of Bayesian networks.

My contributions were: 1) a religious fanaticism to do things correctly, namely, to do things by the dictates of probability calculus; and 2) the biologically inspired architecture of asynchronous and distributed computation. You start with a collection of simple, stupid modules, as in neural networks, all working autonomously and communicating only with their neighbors. Comes new evidence, it activates several such modules, they send signals to their neighbors, who then wake up and start propagating messages to their neighbors, and so on—eventually, the system relaxes to the correct belief. What do I mean by correct belief? Belief you would compute if you had the time to spend on doing things correctly by the dictates of probability calculus.

We are losing the transparency now with deep learning. I am talking to users who say "it works well" but they don't know why. Once you unleash it, it has its own dynamics, it does its own repair and its own optimization, and it gives you the right results most of the time. But when it doesn't, you don't have a clue as to what went wrong and what should be fixed.

I left probabilistic reasoning when it was in an embryonic state of development because I was excited about causality. I left this area when many people have found it to be very useful. Some people tell me that Google and Siri and all these nice applications are using ideas or algorithms that were developed at that time. I'm very happy. I don't know exactly what they're doing, partly because they are very secretive and partly because I went on to greener pastures.

We are losing the transparency now with deep learning. I am talking to users who say "it works well" but they don't know why. Once you unleash it, it has its own dynamics, it does its own repair and its own optimization, and it gives you the right results most of the time. But when it doesn't, you don't have a clue as to what went wrong and what should be fixed. That's something that worries me.

We should be aiming at a different kind of transparency. If something goes wrong, the user should be able to examine the system and spot where the fault is, and when it's working well, the system should give the user meaningful comments of progress. These comments should relate to our experience, and thus to the human perception of the phenomenon.

Some argue that transparency is not really needed. We do not understand the human anatomy and the human neural architecture, and yet it runs well and we forgive our meager understanding. In the same way, they argue, why not unleash deep learning systems and create intelligence without understanding how they do it. It's fine. I personally don't like this kind of opacity, and that is why I won't spend my time on it. Deep learning has a place. The non-transparent system can do a marvelous job, and our mind is proof of that marvel.

I am trying to understand the theoretical limitations of those systems. We have discovered, for example, that some basic barrier exists which, unless broken, we wouldn't get a real human kind of intelligence, no matter what we do. This is my current interest.

I admire what people like Michael Jordan and Geoff Hinton do. They created good vision systems that recognize objects and texts. It's very impressive. But how far can it go? What are the theoretical limitations and how can we overcome them? The work we are doing now on causality highlights some of the basic limitations that need to be overcome. One of them is free will, others are thinking counterfactually, and thinking causally. Theoretically, you cannot get any conclusion about cause and effect relationships, and definitely not about counterfactual, from statistical data alone. When I say counterfactual I mean a statement like, "I should have done things better." If we exclude this kind of statement, we lose a lot of human communication.

That's how we teach our children, with a slap on the wrist and by saying out-loud: "You shouldn't have spilled the milk," or "you should have done your homework." What does "You should have done..." mean? It means go back in history, relive your experience and modify the software that governed your behavior? This is how we communicate with our children. If we lose that, we lose the ability to form communities of communicating robots. This is a topic that excites me these days.

Regarding cybernetics, you know that I was a physicist. I worked on memory devices and I got into decision theory because I was interested in cybernetics. We all were absolutely sure that one day we were going to create intelligence. The question was only how. I thought that decision theory was a way to get there. So I studied the papers by Howard Raiffa (who just died recently), Savage, on Bayesian statistics, Ron Howard, and Kahneman and Tversky, who were starting to publish their articles on mental heuristics. This was in the late '70s.

Tversky and Kahneman were big players at the time, and they came out with those heuristics that I thought should be emulated, not overridden. Being in AI, I knew the importance and computational role that such heuristics play in problem solving. Recall, I wrote my first book on heuristics, and I had chess-playing machine as a metaphor for many ideas in decision theory.

Tversky and Kahneman were working on probabilistic and decision-making biases. For example, what is more likely, that a daughter will have blue eyes given that her mother has blue eyes or the other way around—that the mother will have blue eyes given that the daughter has blue eyes? Most people will say the former—they'll prefer the causal direction. But it turns out the two probabilities are the same, because the number of blue-eyed people in every generation remains stable. I took it as evidence that people think causally, not probabilistically—they're biased by having easy access to causal explanations, even though probability theory tells you something different.

There are many biases in our judgment that are created by our inclination to attribute causal relationships where they do not belong. We see the world as a collection of causal relationships and not as a collection of statistical or associative relationships. Most of the time, we can get by, because they are closely tied together. Once in a while we fail. The blue-eye story is an example of such failure.

The philosophical debate about free will is a very nice debate, but I'm not interested in it. I want to build machines that act as if they had a free will and imagine that I have a free will so that we can communicate with each other as if we both have free will. This is an engineering question.

The slogan, "Correlation doesn't imply causation" leads to many paradoxes. For instance, the size of a child's thumb is highly correlated with their reading ability. So, naively, if you want to be taller, you should learn to read better. This kind of paradoxical example convinces us that correlation does not imply causation. Still, people fall into that trap quite often because they crave causal explanations. The mind is a causal processor, not an association processor. Once you acknowledge that, the question remains how we reconcile the discrepancies between the two. How do we organize causal relationships in our mind? How do we operate on and update such a mental presentation? This opens up many questions that were not addressed by philosophers, psychologists, computer scientists and statisticians. They were not addressed because we did not have computational models for these representations. We do have these models today, so there is much excitement in the air, and much work to be done.

__________

The philosophical debate about free will is a very nice debate, but I'm not interested in it. I want to build machines that act as if they had a free will and imagine that I have a free will so that we can communicate with each other as if we both have free will. This is an engineering question. The philosophy of non-determinism and mind-body duality is irrelevant.

The question that we should be asking is what gives us the illusion, algorithmically, that we have free will? It is an undeniable fact that I have a sensation that, if I wish, I can touch my nose and if I do not wish, I won't. You have the same sensation. Sensation is undeniable—it exists. It is probably an illusion.

Give me a software model that would explain when I do have that sensation and when I don't have it. Then the question becomes, why did evolution equip us with that illusion? What is the computational advantage of me believing that you have free will and you believing that I have free will?

Forgetting about whether we do have free will or not. We have here a computational phenomenon that must serve some evolutionary function, survival function, and computational function. If the phenomenon didn't offer computational advantage it wouldn't be evolved. This is what I'm trying to capture. What computational advantage does it bestow upon us? Indeed, there is experimental evidence to the fact that free will is an illusion. Studies have found that people make up their mind, neurally speaking, before they get the sensation that they made a decision. But this too doesn't concern me. I want to find out what computational processes took place in creating that illusion. I want to implement it on a machine so that a community of robots will be able to play better soccer.

A machine is a body, not a mind.

I'm mainly interested in the computational pragmatics of the phenomenon, as opposed to its philosophical grounding. You and I believe strongly that a robot is a deterministic machine. So all the philosophical impediments about Heisenberg uncertainty principle and mind-body duality do not enter into our problem. A machine is a body, not a mind.

Now take this body, which is undeniably deterministic, and equip it with something (i.e. free will) that is normally attributed only to an organic thinking machine—a human being; it's an engineering problem. I want to understand it so that I can build it. Why do I want to build it? Because it has a computational advantage; for instance, the compactness of our communication language. When a coach tells a player to sit on the bench because he should have passed the ball and didn't, why does he talk in this convoluted way? What does "You ought to have known better" mean to a deterministic machine?

If you assume that free will exits, you can then invoke counterfactual thinking, and use it as a communication language to speed up the transfer of information from the coach to you, the player. This I believe is the key.

What computational advantage do we get from this assumption that you and I have free will? This is an exciting problem, because once we understand it, we can have robots simulating free will. Never mind the unproductive philosophical question of whether they do indeed have free will. Obviously, they were programmed to follow deterministic rules, and they follow their programming rules faithfully, so, from metaphysical viewpoint they lack free will.

At the same time, if they communicate the way we do, with free-will vocabulary, we can increase the bandwidth of the communication channel between man and machine. This is what counts. And once we build such robots we will be 100 times closer to understanding how human beings do that.

There are many questions that only two decades ago were considered to be metaphysical. Today they are formulated mathematically and are being answered statistically. It is a revolution in the way that scientists conduct science. I'm talking about the causal revolution.

We are living in the midst of this revolution, which is not recognized in any of the popular science writing. It's not as heralded as the revolution we see in AI, because there aren't new gadgets that are coming out of it. It is a conceptual revolution, which has a lot of ramifications to the way scientists see their role and the way that many of them are beginning to orient their thinking. This is the causal revolution.

There are many questions that only two decades ago were considered to be metaphysical. Today they are formulated mathematically and are being answered statistically. It is a revolution in the way that scientists conduct science. I'm talking about the causal revolution.

Its effect is noticed mainly in the research circles, in the way scientists find causal explanations, both in experimental and observational studies. The way we think about our ability to establish cause and effect is changing; it's changing by the language that we're using, by what we believe is logically feasible, and by the kind of questions we ask.

Let me ask you a simple question: Was it the aspirin I took that caused my headache to go away? A question like this, about the real cause of things, was considered metaphysical and not part of science. Today, we can answer this question from data and tell you the probability that it was aspirin. This, in my opinion is a profound revolution.

More has been learned about causal inference in the last few decades than the sum total of everything that had been learned about it in all prior recorded history. It's not my quote; it's Gary King's. It will eventually come to the surface, where ordinary folks will feel the difference, and it eventually will come down to education. So far this excitement occurred mainly in research circles; it hasn't percolated down to education yet.

I spend a lot of time on education, on the gap between the thinking of the research community and the textbook community. The gap is enormous. To give you an example, if you take any textbook in statistics, you wouldn't find the phrase "cause and effect" in the index. It does not appear there. It is the opposite of what we find in research journals. If you go to any statistical conference, you will find the word "cause" in at least 100 or 200 papers, competing for respectability through the word "cause." So causality flipped itself from being a liability to being an asset, a source of respectability. This flip is unseen in education, which is a terrible gap. I'm working a lot on that.