Drawing on the usual suspects (Google/Google Books/Google Scholar/Libgen/LessWrong/Hacker News/Twitter) in investigating leprechauns, I have compiled a large number of variants of the story; below, in reverse chronological order by decade, letting us trace the story back towards its roots:

A similar thing happened here in the United States at one of our research institutions. Where a perceptron had been trained to distinguish between—this was for military purposes—It could… it was looking at a scene of a forest in which there were camouflaged tanks in one picture and no camouflaged tanks in the other. And the perceptron—after a little training—got… made a 100% correct distinction between these two different sets of photographs. Then they were embarrassed a few hours later to discover that the two rolls of film had been developed differently. And so these pictures were just a little darker than all of these pictures and the perceptron was just measuring the total amount of light in the scene. But it was very clever of the perceptron to find some way of making the distinction.

Like I had a friend in Italy who had a perceptron that looked at a visual… it had visual inputs. So, he… he had scores of music written by Bach of chorales and he had scores of chorales written by music students at the local conservatory. And he had a perceptron—a big machine—that looked at these and those and tried to distinguish between them. And he was able to train it to distinguish between the masterpieces by Bach and the pretty good chorales by the conservatory students. Well, so, he showed us this data and I was looking through it and what I discovered was that in the lower left hand corner of each page, one of the sets of data had single whole notes. And I think the ones by the students usually had four quarter notes. So that, in fact, it was possible to distinguish between these two classes of… of pieces of music just by looking at the lower left… lower right hand corner of the page. So, I told this to the… to our scientist friend and he went through the data and he said: ‘You guessed right. That’s… that’s how it happened to make that distinction.’ We thought it was very funny.

Now when this sort of thing happens research labs tend to split along age-based lines. The young hairs say “Great! We’re in line for the Nobel Prize!” and the old heads say “Something’s gone wrong”. Unfortunately, the old heads are usually right—as they were in this case. What had happened was that the photographs containing tanks had been taken in the morning while the army played tanks on the range. After lunch the photographer had gone back and taken pictures from the same angles of the empty range. So the net had identified the most reliable single feature which enabled it to classify the two sets of photos, namely the angle of the shadows. “AM = tank, PM = no tank”. This was an extremely effective way of classifying the two sets of photographs in the training set. What it most certainly was not was a program that recognizes tanks. The great advantage of neural nets is that they find their own classification criteria. The great problem is that it may not be the one you want!

The story goes something like this. A research team was training a neural net to recognize pictures containing tanks. (I’ll leave you to guess why it was tanks and not tea-cups.) To do this they showed it two training sets of photographs. One set of pictures contained at least one tank somewhere in the scene, the other set contained no tanks. The net had to be trained to discriminate between the two sets of photographs. Eventually, after all that back-propagation stuff, it correctly gave the output “tank” when there was a tank in the picture and “no tank” when there wasn’t. Even if, say, only a little bit of the gun was peeping out from behind a sand dune it said “tank”. Then they presented a picture where no part of the tank was visible—it was actually completely hidden behind a sand dune—and the program said “tank”.

It is not yet clear how an artificial neural net could be trained to deal with “the world” or any really open-ended sets of problems. Now some readers may feel that this unpredictability is not a problem. After all, we are talking about training not programming and we expect a neural net to behave rather more like a brain than a computer. Given the usefulness of nets in unsupervised learning, it might seem therefore that we do not really need to worry about the problem being of manageable size and the training process being predictable. This is not the case; we really do need a manageable and well-defined problem for the training process to work. A famous AI urban myth may help to make this clearer.

These facts refute a Neoplatonic argument for the essential immateriality of the soul, viz. that since the mind deals with universal representations, it operates in a specifically immaterial way…So, awareness is not explained by connectionism. The results of neural net training are not always as expected. One team intended to train neural nets to recognize battle tanks in aerial photos. The system was trained using photos with and without tanks. After the training, a different set of photos was used for evaluation, and the system failed miserably—being totally incapable of distinguishing those with tanks. The system actually discriminated cloudy from sunny days. It happened that all the training photos with tanks were taken on cloudy days, while those without were on clear days. 44 What does this show? That neural net training is mindless. The system had no idea of the intent of the enterprise, and did what it was programmed to do without any concept of its purpose. As with Dawkins’ evolution simulation (p. 66), the goals of computer neural nets are imposed by human programmers.

I remember this kind of thing from the 1980s: the US Army was testing image recognition seekers for missiles and was getting excellent results on Northern German tests with NATO tanks. Then they tested the same systems in other environment and there results were suddenly shockingly bad. Turns out the image recognition was keying off the trees with tank-like minor features rather than the tank itself. Putting other vehicles in the same forests got similar high hits but tanks by themselves (in desert test ranges) didn’t register. Luckily a sceptic somewhere decided to “do one more test to make sure”.

However, in his source, “‘Why Should I Trust You?’ Explaining the Predictions of Any Classifier [LIME]” , Ribeiro et al 2016, they specify of their dog/wolf snow-detector NN that they “trained this bad classifier intentionally, to evaluate whether subjects are able to detect it [the bad performance]” using LIME for insight into how the classifier was making its classification, concluding that “After examining the explanations, however, almost all of the subjects identified the correct insight, with much more certainty that it was a determining factor. Further, the trust in the classifier also dropped substantially.” So Nikolaychuk appears to have misremembered. (Perhaps in another 25 years students will be told in their classes of how a NN was once trained by ecologists to count wolves…)

Neural networks are designed to learn like the human brain, but we have to be careful. This is not because I’m scared of machines taking over the planet. Rather, we must make sure machines learn correctly. One example that always pops into my head is how one neural network learned to differentiate between dogs and wolves. It didn’t learn the differences between dogs and wolves, but instead learned that wolves were on snow in their picture and dogs were on grass. It learned to differentiate the two animals by looking at snow and grass. Obviously, the network learned incorrectly. What if the dog was on snow and the wolf was on grass? Then, it would be wrong.

You might think that this is rather like one of the classic optical illusions, but it’s worse than that. If you notice that you look at something this way, and then that way, and it looks different, you’ll notice something is odd. This is not something our deep learner will do. Nor is it able to identify any bias that might exist in the corpus of data it was trained on…or maybe it is. If there is any property of the training data set that is strongly predictive of the training criterion, it will zero in on that property with the ferocious clarity of Darwinism. In the 1980s, an early backpropagating neural network was set to find Soviet tanks in a pile of reconnaissance photographs. It worked, until someone noticed that the Red Army usually trained when the weather was good, and in any case the satellite could only see them when the sky was clear. The medical school at St Thomas’ Hospital in London found theirs had learned that their successful students were usually white.

So What Did the Machines See? Dr. Kosinski and Mr. Wang [ Wang & Kosinski 2018 ] say that the algorithm is responding to fixed facial features, like nose shape, along with “grooming choices,” such as eye makeup. But it’s also possible that the algorithm is seeing something totally unknown. “The more data it has, the better it is at picking up patterns,” said Sarah Jamie Lewis, an independent privacy researcher who Tweeted a critique of the study. “But the patterns aren’t necessarily the ones you think that they are.” Tomaso Poggio , the director of M.I.T.’s Center for Brains, Minds and Machines, offered a classic parable used to illustrate this disconnect. The Army trained a program to differentiate American tanks from Russian tanks with 100% accuracy. Only later did analysts realized that the American tanks had been photographed on a sunny day and the Russian tanks had been photographed on a cloudy day. The computer had learned to detect brightness. Dr. Cox has spotted a version of this in his own studies of dating profiles. Gay people, he has found, tend to post higher-quality photos. Dr. Kosinski said that they went to great lengths to guarantee that such confounders did not influence their results. Still, he agreed that it’s easier to teach a machine to see than to understand what it has seen.

A neural network is useless if it only sees one example of a matching input/output pair. It cannot infer the characteristics of the input data for which you are looking for from only one example; rather, many examples are required. This is analogous to a child learning the difference between (say) different types of animals—the child will need to see several examples of each to be able to classify an arbitrary animal… It is the same with neural networks. The best training procedure is to compile a wide range of examples (for more complex problems, more examples are required) which exhibit all the different characteristics you are interested in. It is important to select examples which do not have major dominant features which are of no interest to you, but are common to your input data anyway. One famous example is of the US Army “Artificial Intelligence” tank classifier. It was shown examples of Soviet tanks from many different distances and angles on a bright sunny day, and examples of US tanks on a cloudy day. Needless to say it was great at classifying weather, but not so good at picking out enemy tanks.

…television programme Horizon ; a neural network was trained to attempt to distinguish tanks from trees. Pictures were taken of forest scenes lacking military hardware and of similar but perhaps less bucolic landscapes which also contained more-or-less camouflaged battle tanks. A neural network was trained with these input data and found to differentiate successfully between tanks and trees. However, when a new set of pictures was analysed by the network, it failed to detect the tanks. After further investigation, it was found…

There is an telling story about how the Army recently went about teaching a backpropagating net to identify tanks set against a variety of environmental backdrops. The programmers correctly fed their multi-layer net photograph after photograph of tanks in grasslands, tanks in swamps, no tanks on concrete, and so on. After many trials and many thousands of iterations, their net finally learned all of the images in their database. The problem was that when the presumably “trained” net was tested with other images that were not part of the original training set, it failed to do any better than what would be expected by chance. What had happened was that the input/training fact set was statistically corrupt. The database consisted mostly of images that showed a tank only if there were heavy clouds, the tank itself was immersed in shadow or there was no sun at all. The Army’s neural net had indeed identified a latent pattern, but it unfortunately had nothing to do with tanks: it had effectively learned to identify the time of day! The obvious lesson to be taken away from this amusing example is that how well a net “learns” the desired associations depends almost entirely on how well the database of facts is defined. Just as Monte Carlo simulations in statistical mechanics may fall short of intended results if they are forced to rely upon poorly coded random number generators, so do backpropagating nets typically fail to achieve expected results if the facts they are trained on are statistically corrupt.

Why was this? It turns out that the images they were training on always had glamour-shot type photos of friendly tanks, with an immaculate blue sky, etc. The enemy tank photos, on the other hand, were all spy photos, not very clear, sometimes fuzzy, etc. And it was these characteristics that the neural net was training on, not the tanks at all. On a bright sunny day, the tanks would do nothing. On an overcast, hazy day, they’d start firing like crazy . . .

As the story goes, a network was set up with the input being the pixels in a picture, and the output was a single bit, yes or no, for the existence of an enemy tank hidden somewhere in the picture. When the training was complete, the network performed beautifully, but when applied to new data, it failed miserably. The problem was that in the test data, all of the pictures that had tanks in them were taken on cloudy days, and all of the pictures without tanks were taken on sunny days. The neural net was identifying the existence or non-existence of sunshine, not tanks.

The choice of the dimensionality and domain of the input set is crucial to the success of any connectionist model. A common example of a poor choice of input set and test data is the Pentagon’s foray into the field of object recognition. This story is probably apocryphal and many different versions exist on-line, but the story describes a true difficulty with neural nets.

This is a great example of what many consider the biggest issue with neural networks. If there are more than 10 to 20 neurons, it is impossible to understand how the network is arriving at its results. One cannot tell if the net is making decisions based on correct information, or, as in the above example, something totally irrelevant. Neural networks have a remarkable ability to derive meaning and extract patterns from data that are too complex to be analyzed by human beings. However, some people trust neural networks to be experts in their area of training. Neural nets are used in such areas as sales forecasting, risk management, customer research, undersea mine detection, facial recognition, and data validation. Although neural networks are promising, and the progress made in the past several years has led to significant funding for neural net research, many people are hesitant to put confidence in something that no human being can completely understand.

Let’s consider a more sophisticated example, that of determining whether a tank is hiding in a photograph. A neural net can be configured so that each output value correlates to exactly one pixel. If the pixel is part of the image of a tank, the net should output a one; otherwise, the net should output a zero. The input information would most likely consist of the color of the pixel. The network would be trained by feeding it many pictures with and without tanks. The training would continue until the network correctly identified whether the photos included tanks. The U.S. military conducted a research project exactly like the one we just described. One hundred photographs were taken of tanks hiding behind trees and in bushes, and another 100 photographs were taken of ordinary landscape with no tanks. Fifty photos from each group were kept “secret,” and the rest were used to train the neural network. The network was initialized with random weights before being fed one picture at a time. When the network was incorrect, it adjusted its input weights until the correct output was reached. Following the training period, the 50 “secret” pictures from each group of photos were fed into the network. The neural network correctly identified the presence or absence of a tank in each photo. The real question at this point has to do with the training—had the neural net actually learned to recognize tanks? The Pentagon’s natural suspicion led to more testing. Additional photos were taken and fed into the network, and to the researchers’ dismay, the results were quite random. The neural net could not correctly identify tanks within photos. After some investigation, the researchers determined that in the original set of 200 photos, all photos with tanks had been taken on a cloudy day, whereas the photos with no tanks had been taken on a sunny day. The neural net had properly separated the two groups of pictures, but had done so using the color of the sky to do this rather than the existence of a hidden tank. The government was now the proud owner of a very expensive neural net that could accurately distinguish between sunny and cloudy days!

A counter-narrative, also perhaps apocryphal, emerged from the 1991 Gulf War. US soldiers firing at tanks had been trained on simulators that imaged flames shooting out from the tank to indicate a kill. When army investigators examined Iraqi tanks that were defeated in battles, they found that for some tanks the soldiers had fired four to five times the amount of munitions necessary to disable the tanks. They hypothesized that the overuse of firepower happened because no flames shot out, so the soldiers continued firing. If the hypothesis is correct, human perceptions were altered in accord with the idiosyncrasies of intelligent machines, providing an example of what can happen when human-machine perceptions are caught in a feedback loop with one another.

…Another conclusion emerges from Cariani’s call (1998) for research in sensors that can adapt and evolve independently of the epistemic categories of the humans who create them. The well-known and perhaps apocryphal story of the neural net trained to recognize army tanks will illustrate the point. For obvious reasons, the army wanted to develop an intelligent machine that could discriminate between real and pretend tanks. A neural net was constructed and trained using two sets of data, one consisting of photographs showing plywood cutouts of tanks and the other actual tanks. After some training, the net was able to discriminate flawlessly between the situations. As is customary, the net was then tested against a third data set showing pretend and real tanks in the same landscape; it failed miserably. Further investigation revealed that the original two data sets had been filmed on different days. One of the days was overcast with lots of clouds, and the other day was clear. The net, it turned out, was discriminating between the presence and absence of clouds. The anecdote shows the ambiguous potential of epistemically autonomous devices for categorizing the world in entirely different ways from the humans with whom they interact. While this autonomy might be used to enrich the human perception of the world by revealing novel kinds of constructions, it also can create a breed of autonomous devices that parse the world in radically different ways from their human trainers.

While humans have for millennia used what Cariani calls ‘active sensing’—‘poking, pushing, bending’—to extend their sensory range and for hundreds of years have used prostheses to create new sensory experiences (for example, microscopes and telescopes), only recently has it been possible to construct evolving sensors and what Cariani (1998: 718) calls ‘internalized sensing’, that is, “bringing the world into the device” by creating internal, analog representations of the world out of which internal sensors extract newly-relevant properties’.

Neural nets and genetic algorithms (including the story of the Russian tanks): Neural nets (or artificial neural networks, to give them their full name) are pieces of software inspired by the way the human brain works. In brief, you can train a neural net to do tasks like classifying images by giving it lots of examples, and telling it which examples fit into which categories; the neural net works out for itself what the defining characteristics are for each category. Alternatively, you can give it a large set of data and leave it to work out connections by itself, without giving it any feedback. There’s a story, which is probably an urban legend, which illustrates how the approach works and what can go wrong with it. According to the story, some NATO researchers trained a neural net to distinguish between photos of NATO and Warsaw Pact tanks. After a while, the neural net could get it right every time, even with photos it had never seen before. The researchers had gleeful visions of installing neural nets with miniature cameras in missiles, which could then be fired at a battlefield and left to choose their own targets. To demonstrate the method, and secure funding for the next stage, they organised a viewing by the military. On the day, they set up the system and fed it a new batch of photos. The neural net responded with apparently random decisions, sometimes identifying NATO tanks correctly, sometimes identifying them mistakenly as Warsaw Pact tanks. This did not inspire the powers that be, and the whole scheme was abandoned on the spot. It was only afterwards that the researchers realised that all their training photos of NATO tanks had been taken on sunny days in Arizona, whereas the Warsaw Pact tanks had been photographed on grey, miserable winter days on the steppes, so the neural net had flawlessly learned the unintended lesson that if you saw a tank on a gloomy day, then you made its day even gloomier by marking it for destruction.

Once upon a time—I’ve seen this story in several versions and several places, sometimes cited as fact, but I’ve never tracked down an original source—once upon a time, I say, the US Army wanted to use neural networks to automatically detect camouflaged enemy tanks. The researchers trained a neural net on 50 photos of camouflaged tanks amid trees, and 50 photos of trees without tanks. Using standard techniques for supervised learning, the researchers trained the neural network to a weighting that correctly loaded the training set—output “yes” for the 50 photos of camouflaged tanks, and output “no” for the 50 photos of forest. Now this did not prove, or even imply, that new examples would be classified correctly. The neural network might have “learned” 100 special cases that wouldn’t generalize to new problems. Not, “camouflaged tanks versus forest”, but just, “photo-1 positive, photo-2 negative, photo-3 negative, photo-4 positive…” But wisely, the researchers had originally taken 200 photos, 100 photos of tanks and 100 photos of trees, and had used only half in the training set. The researchers ran the neural network on the remaining 100 photos, and without further training the neural network classified all remaining photos correctly. Success confirmed! The researchers handed the finished work to the Pentagon, which soon handed it back, complaining that in their own tests the neural network did no better than chance at discriminating photos. It turned out that in the researchers’ data set, photos of camouflaged tanks had been taken on cloudy days, while photos of plain forest had been taken on sunny days. The neural network had learned to distinguish cloudy days from sunny days, instead of distinguishing camouflaged tanks from empty forest. This parable—which might or might not be fact—illustrates one of the most fundamental problems in the field of supervised learning and in fact the whole field of Artificial Intelligence…

“Neural Network Follies”, Neil Fraser, September 1998:

In the 1980s, the Pentagon wanted to harness computer technology to make their tanks harder to attack…The research team went out and took 100 photographs of tanks hiding behind trees, and then took 100 photographs of trees—with no tanks. They took half the photos from each group and put them in a vault for safe-keeping, then scanned the other half into their mainframe computer. The huge neural network was fed each photo one at a time and asked if there was a tank hiding behind the trees. Of course at the beginning its answers were completely random since the network didn’t know what was going on or what it was supposed to do. But each time it was fed a photo and it generated an answer, the scientists told it if it was right or wrong. If it was wrong it would randomly change the weightings in its network until it gave the correct answer. Over time it got better and better until eventually it was getting each photo correct. It could correctly determine if there was a tank hiding behind the trees in any one of the photos…So the scientists took out the photos they had been keeping in the vault and fed them through the computer. The computer had never seen these photos before—this would be the big test. To their immense relief the neural net correctly identified each photo as either having a tank or not having one. Independent testing: The Pentagon was very pleased with this, but a little bit suspicious. They commissioned another set of photos (half with tanks and half without) and scanned them into the computer and through the neural network. The results were completely random. For a long time nobody could figure out why. After all nobody understood how the neural had trained itself. Eventually someone noticed that in the original set of 200 photos, all the images with tanks had been taken on a cloudy day while all the images without tanks had been taken on a sunny day. The neural network had been asked to separate the two groups of photos and it had chosen the most obvious way to do it—not by looking for a camouflaged tank hiding behind a tree, but merely by looking at the color of the sky…This story might be apocryphal, but it doesn’t really matter. It is a perfect illustration of the biggest problem behind neural networks. Any automatically trained net with more than a few dozen neurons is virtually impossible to analyze and understand.

Tom White attributes (in October 2017) to Marvin Minsky some version of the tank story being told in MIT classes 20 years before, ~1997 (but doesn’t specify the detailed story or version other than apparently the results were “classified”).

Vasant Dhar & Roger Stein, Intelligent Decision Support Methods, 1997 (pg98, limited Google Books snippet):

…However, when a new set of photographs were used, the results were horrible. At first the team was puzzled. But after careful inspection of the first two sets of photographs, they discovered a very simple explanation. The photos with tanks in them were all taken on sunny days, and those without the tanks were taken on overcast days. The network had not learned to identify tank like images; instead, it had learned to identify photographs of sunny days and overcast days.

Royston Goodacre, Mark J. Neal, & Douglas B. Kell, “Quantitative Analysis of Multivariate Data Using Artificial Neural Networks: A Tutorial Review and Applications to the Deconvolution of Pyrolysis Mass Spectra”, 1994-04-29:

…As in all other data analysis techniques, these supervised learning methods are not immune from sensitivity to badly chosen initial data (113). [113: Zupan, J. and J. Gasteiger: Neural Networks for Chemists: An Introduction. VCH Verlagsgesellschaft, Weinheim (1993)] Therefore the exemplars for the training set must be carefully chosen; the golden rule is “garbage in—garbage out”. An excellent example of an unrepresentative training set was discussed some time ago on the BBC television programme Horizon; a neural network was trained to attempt to distinguish tanks from trees. Pictures were taken of forest scenes lacking military hardware and of similar but perhaps less bucolic landscapes which also contained more-or-less camouflaged battle tanks. A neural network was trained with these input data and found to differentiate most successfully between tanks and trees. However, when a new set of pictures was analysed by the network, it failed to distinguish the tanks from the trees. After further investigation, it was found that the first set of pictures containing tanks had been taken on a sunny day whilst those containing no tanks were obtained when it was overcast. The neural network had therefore thus learned simply to recognise the weather! We can conclude from this that the training and tests sets should be carefully selected to contain representative exemplars encompassing the appropriate variance over all relevant properties for the problem at hand.

Fernando Pereira, “neural redlining”, RISKS 16(41), 1994-09-12:

Fred’s comments will hold not only of neural nets but of any decision model trained from data (eg. Bayesian models, decision trees). It’s just an instance of the old “GIGO” phenomenon in statistical modeling…Overall, the whole issue of evaluation, let alone certification and legal standing, of complex statistical models is still very much open. (This reminds me of a possibly apocryphal story of problems with biased data in neural net training. Some US defense contractor had supposedly trained a neural net to find tanks in scenes. The reported performance was excellent, with even camouflaged tanks mostly hidden in vegetation being spotted. However, when the net was tested on yet a new set of images supplied by the client, the net did not do better than chance. After an embarrassing investigation, it turned out that all the tank images in the original training and test sets had very different average intensity than the non-tank images, and thus the net had just learned to discriminate between two image intensity levels. Does anyone know if this actually happened, or is it just in the neural net “urban folklore”?)

Erich Harth, The Creative Loop: How the Brain Makes a Mind, 1993/1995 (pg158, limited Google Books snippet):

…55. The net was trained to detect the presence of tanks in a landscape. The training consisted in showing the device many photographs of scene, some with tanks, some without. In some cases—as in the picture on page 143—the tank’s presence was not very obvious. The inputs to the neural net were digitized photographs;

Hubert L. Dreyfus & Stuart E. Dreyfus, “What Artificial Experts Can and Cannot Do”, 1992:

All the “continue this sequence” questions found on intelligence tests, for example, really have more than one possible answer but most human beings share a sense of what is simple and reasonable and therefore acceptable. But when the net produces an unexpected association can one say it has failed to generalize? One could equally well say that the net has all along been acting on a different definition of “type” and that that difference has just been revealed. For an amusing and dramatic case of creative but unintelligent generalization, consider the legend of one of connectionism’s first applications. In the early days of the perceptron the army decided to train an artificial neural network to recognize tanks partly hidden behind trees in the woods. They took a number of pictures of a woods without tanks, and then pictures of the same woods with tanks clearly sticking out from behind trees. They then trained a net to discriminate the two classes of pictures. The results were impressive, and the army was even more impressed when it turned out that the net could generalize its knowledge to pictures from each set that had not been used in training the net. Just to make sure that the net had indeed learned to recognize partially hidden tanks, however, the researchers took some more pictures in the same woods and showed them to the trained net. They were shocked and depressed to find that with the new pictures the net totally failed to discriminate between pictures of trees with partially concealed tanks behind them and just plain trees. The mystery was finally solved when someone noticed that the training pictures of the woods without tanks were taken on a cloudy day, whereas those with tanks were taken on a sunny day. The net had learned to recognize and generalize the difference between a woods with and without shadows! Obviously, not what stood out for the researchers as the important difference. This example illustrates the general point that a net must share size, architecture, initial connections, configuration and socialization with the human brain if it is to share our sense of appropriate generalization

Hubert Dreyfus appears to have told this story earlier in 1990 or 1991, as a similar story appears in episode 4 (German) (starting 33m49s) of the BBC documentary series The Machine That Changed the World, broadcast 1991-11-08. Hubert L. Dreyfus, What Computers Still Can’t Do: A Critique of Artificial Reason, 1992, repeats the story in very similar but not quite identical wording (Jeff Kaufman notes that Dreyfus drops the qualifying “legend of” description):

…But when the net produces an unexpected association, can one say that it has failed to generalize? One could equally well say that the net has all along been acting on a different definition of “type” and that that difference has just been revealed. For an amusing and dramatic case of creative but unintelligent generalization, consider one of connectionism’s first applications. In the early days of this work the army tried to train an artificial neural network to recognize tanks in a forest. They took a number of pictures of a forest without tanks and then, on a later day, with tanks clearly sticking out from behind trees, and they trained a net to discriminate the two classes of pictures. The results were impressive, and the army was even more impressed when it turned out that the net could generalize its knowledge to pictures that had not been part of the training set. Just to make sure that the net was indeed recognizing partially hidden tanks, however, the researchers took more pictures in the same forest and showed them to the trained net. They were depressed to find that the net failed to discriminate between the new pictures of trees with tanks behind them and the new pictures of just plain trees. After some agonizing, the mystery was finally solved when someone noticed that the original pictures of the forest without tanks were taken on a cloudy day and those with tanks were taken on a sunny day. The net had apparently learned to recognize and generalize the difference between a forest with and without shadows! This example illustrates the general point that a network must share our commonsense understanding of the world if it is to share our sense of appropriate generalization.

Dreyfus’s What Computers Still Can’t Do is listed as a revision of his 1972 book, What Computers Can’t Do: A Critique of Artificial Reason, but the tank story is not in the 1972 book, only the 1992 one. (Dreyfus’s version is also quoted in the 2017 NYT article and Hillis 1996’s Geography, Identity, and Embodiment in Virtual Reality, pg346.)

Laveen N. Kanal, Artificial Neural Networks and Statistical Pattern Recognition: Old and New Connections’s Foreword, discusses some early NN/tank research (predating not just LeCun’s convolutions but backpropagation), 1991:

…[Frank] Rosenblatt had not limited himself to using just a single Threshold Logic Unit but used networks of such units. The problem was how to train multilayer perceptron networks. A paper on the topic written by Block, Knight and Rosenblatt was murky indeed, and did not demonstrate a convergent procedure to train such networks. In 1962–63 at Philco-Ford, seeking a systematic approach to designing layered classification nets, we decided to use a hierarchy of threshold logic units with a first layer of “feature logics” which were threshold logic units on overlapping receptive fields of the image, feeding two additional levels of weighted threshold logic decision units. The weights in each level of the hierarchy were estimated using statistical methods rather than iterative training procedures [L.N. Kanal & N.C. Randall, “Recognition System Design by Statistical Analysis”, Proc. 19th Conf. ACM, 1964]. We referred to the networks as two layer networks since we did not count the input as a layer. On a project to recognize tanks in aerial photography, the method worked well enough in practice that the U.S. Army agency sponsoring the project decided to classify the final reports, although previously the project had been unclassified. We were unable to publish the classified results! Then, enamored by the claimed promise of coherent optical filtering as a parallel implementation for automatic target recognition, the funding we had been promised was diverted away from our electro-optical implementation to a coherent optical filtering group. Some years later we presented the arguments favoring our approach, compared to optical implementations and trainable systems, in an article titled “Systems Considerations for Automatic Imagery Screening” by T.J. Harley, L.N. Kanal and N.C. Randall, which is included in the IEEE Press reprint volume titled Machine Recognition of Patterns edited by A. Agrawala 1977 . In the years which followed multilevel statistically designed classifiers and AI search procedures applied to pattern recognition held my interest, although comments in my 1974 survey, “Patterns In Pattern Recognition: 1968–1974” [IEEE Trans. on IT, 1974], mention papers by Amari and others and show an awareness that neural networks and biologically motivated automata were making a comeback. In the last few years trainable multilayer neural networks have returned to dominate research in pattern recognition and this time there is potential for gaining much greater insight into their systematic design and performance analysis…

While Kanal & Randall 1964 matches in some ways, including the image counts, there is no mention of failure either in the paper or Kanal’s 1991 reminiscences (rather, Kanal implies it was highly promising), there is no mention of a field deployment or additional testing which could have revealed overfitting, and given their use of binarizing, it’s not clear to me that their 2-layer algorithm even could overfit to global brightness; the photos also appear to have been taken at low enough altitude for there to be no clouds, and to be taken under similar (possibly controlled) lighting conditions. The description in Kanal & Randall 1964 is somewhat opaque to me, particularly of the ‘Laplacian’ they use to binarize or convert to edges, but there’s more background in their “Semi-Automatic Imagery Screening Research Study and Experimental Investigation, Volume 1”, Harley, Bryan, Kanal, Taylor & Grayum 1962 (mirror), which indicates that in their preliminary studies they were already interested in prenormalization/preprocessing images to correct for altitude and brightness, and the Laplacian, along with silhouetting and “lineness editing”, noting that “The Laplacian operation eliminates absolute brightness scale as well as low-spatial frequencies which are of little consequence in screening operations.”

An anonymous reader says he heard the story in 1990: