Neural networks (NNs), recently referred to as deep learning, only work "effectively" with data that is produced from a process of a continuous function .

My article should actually stop here with one sentence. However, there is so much hype, sadly, keeping the entire AI industry busy, not to mention some announcements from big players like Google and IBM. Not knowing what they are doing exactly forces us to give them the benefit of the doubt for now. Nevertheless, NNs are not a natural fit for natural languages and knowledge representation as I explained below in layman's terms.

This simple notion of continuous function , for some reason, is not registering in the minds of those people who are fascinated by the inner workings of NNs and who are not paying attention to the "data" aspect of it. Equally possible that most engineers who have hands-on experience with NNs have used data from a continuous-function process, and they do not realize or understand the immense variability and discontinuities of linguistics.

From the word level analysis perspective, natural language is not produced by a continuous function! It is produced by a process with as many discontinuities as the number of linguistic rules, logical inferences, and asymptotic decision boundaries.

What is a continuous function (or continuous data)? It is a sequence where each item is related to the one before and one after determined by a process. Data of stock prices, weather temperature, topological maps, any kind of imagery, sound waves, projectile motion and many such data types are continuous even though they can be highly nonlinear.

Natural Language is Not Continuous at the Word Level

The sequence "Mary loves her cats" is not continuous at the word level, ontologically. The proximity of the words do not guarantee that they are descriptive of each other (i.e., Mary can love anyting). So, the proximity of Mary and cats is incidental rather than ontological, which makes this connection arbitrarily useless, and makes NN training pointless. Language becomes more continuous at the concept level such as in { onomasticon-Mary: human female}{ event concept-love :emotional attachment} { noun concept-cats : pets}. So, the concepts next to each other show a better agreement, which points to a relationship that a woman is emotionally attached to pets (cats) in this particular case. If NNs operated at this level, then this sentence would be useful in training.

The difference between these two approaches is huge, obviously. The former suffers from a word level discontinuity whereas the latter shows concept (meaning) level continuity. By the way, concept representation is not syntactic parsing. It is ontological parsing which is a colossal, expensive, and labor intensive task. That's why the NN approach, the way it has been used the last three decades (word level analysis), is short of being the silver bullet.

What if We Do it, Anyway?

Below is a typical, multi-layer NN. If we feed the both ends of this system with pages of text (perhaps for Q/A or for translation from one language to another), and apply a learning algorithm like back-propagation, the NN may eventually converge. But what did we accomplish, really?

We mapped bunch of symbols in the input space to another bunch of symbols in the output space. Is this nothing but popularizing concepts via word (or proximity) repetitions rather than actually understanding the concepts ontologically? Can this mapping recognize different senses of the words? Vectorizing the data, or using some other NN architecture will not change this argument. Since there are discontinuities in the data, NN will be forced to allocate a memory unit (like a neuron) for each single decision boundary (corresponding to as many discontinuties as possible.) This is not any different than using a database and putting every symbol in allocated slots. It utterly beats the purpose of NNs by nulling the associative memory advantage of NNs.

When it comes to testing it (recall) using input data that is different than the training set, the results will most likely be gibberish because there are no neighboring relationships in a discontinuous data that can be absorbed and utilized. If you test it with a very similar data, then you may get some recognizable results here and there. In other words, its validity will be very much limited to the similarity to the training data set.

Using a continuous data set on the other hand, NNs show much greater ability and flexibility to reproduce what was trained. Its memory becomes an associated memory (unlike a database) with robustness properties, and similar to the biological brain. Such a NN can interpolate and extrapolate (to a degree) nicely. That's why NNs had so much to offer in engineering applications, although it is evident that NNs cannot "generalize" which is another term for true learning.

Even in engineering applications using continuous data, neural networks are known to fail generalization outside their training data range.

Some engineers may also overlook the fact that vectorization, dimensionality reduction, and other Kernel approaches are methods of convenience, turning original data into a processable continuous data at the expense of information loss. The rate of information loss may be tolerable for feature detection and categorization problems, especially for imagery data, but it will not be tolerable in case of language. Text, a sequence of words, is not a byproduct of a statistical process of words, it is a byproduct of a cognitive process of concepts. Statistics help but does not solve. It helps in the fat-tail phenomena of language, and makes NN approach mainly dormant outside the training zone when vectorization is utilized. The bottom line, word level analysis (using NNs of any kind, or any statistical approach) will not work completely unless these methods move on to concept level analysis.

Is Google Hyping it?

The recent news about Google's success of using NNs for translations between English and Japanese cannot be judged by the example published in the article. Because these judgments cannot be made without actually seeing the degree of deviation from the training set. In other words, we cannot tell how similar was the training data to the translated text. Not only this important reference point is missing, but also the degree of generalization must be shown. Does this system work only within a very narrow domain of the training set? When it fails, how does it fail? What is the frequency of failure? One thing for sure, Google is not trying to impress the scientific community, (or avoiding scrutiny) otherwise they would reveal these important metrics. This was the extent of their reporting: "Rekimoto promoted his discovery to his hundred thousand or so followers on Twitter, and over the next few hours thousands of people broadcast their own experiments with the machine-translation service. Some were successful, others meant mostly for comic effect."

Data Size: A Typical Under-Estimation About Natural Languages

Another way to look at this problem is by data size. Let's assume we are training a NN using a data set of 10,000 pages text. The total number of pages of all possible knowledge in the world (ever) is an incalculable number, but let's assume it is a sextillion: 1,000,000,000,000,000,000,000. Then the question is, how much of the sextillion pages of data can be handled by training a NN using 10,000 pages? The answer would obviously be too small compared to the whole.

On the other hand, grammar rules and ontological semantics mastered by the human brain can handle the entire sextillion (since those pages were written by human). If you know how to read and write, the entire sextillion will be understandable to you. This is the horrifying truth between the capabilities of the human brain versus the current state of neural networks.

The assumption that we can handle natural language and knowledge representation by sheer data crunching using neural networks, which are known to be fruitful only within their training data range, is ridiculously optimistic.

Lacking of the Brick & Bridge Concept

Obviously, we need a learning method that captures grammar rules and ontological semantics rather than mapping of symbols. However, this requires a more advanced approach than our conventional NNs. I may explain this better with the brick & bridge analogy.

The main problem with NNs is that they are inspired from biological structure of the human brain by only modeling at the brick level. The NN diagram above is equivalent to the brick section shown below. If you look at a group of bricks forming a bridge, then there is a higher level of functionality. Unfortunately, our microscopic view of some of the 30 billion neurons in the brain is not good enough to understand the "bridge type" structures in the brain and their individual functions which may be a key to solving complex problems like language understanding. You may view this idea as "network of networks" instead of variants by changing the depth or feedforward connections.

From the inspirational point of view, we will be waiting for advances in biotech to understand these structures of the brain which can help our NN implementation. We should rather call it "deep darkness" than "deep learning" without such clues.

One thing is clear that our current level of sophistication with NNs is way too rudimentary, and any claim by a company that suggests otherwise is destined to be labeled "hype." Is Google hyping it? Most likely yes, unless they have a secret biological research facility that is a century ahead of its closest competition, and unless they have discovered the bridges.

A follow up article just been published How does IBM Watson Compare to Google's Hype of Deep Learning for NLP? that takes the argument one step further.

NOTES for Comments

A follow-up article has been published to expand some of the arguments specifically in reference to Google articles.

There were some comments made to this article speculating why NNs actually work effectively for NLP. If something works, the burden of proof requires a working example, not a speculation. A link to a translator, summarizer, categorizer, or a search engine where, say, a dozen different senses of the English word FALL is disambiguated by a NN application for a given random sentence would be one way. Or you can point to a scientific article showing disambiguation capability to random sentences via scientific measurements. Comments will only be meaningful and helpful to the AI community with this sort of attention.

We should approach the NLP problem outside the single-minded box of NN approach where language is treated as a signal processing problem discarding the process that created the signal. NNs will eventually prove effective once some creative, hybrid approaches emerge that does not leave out modeling of the cognitive process.

Also, the article is not about hype behind NNs, it is about the PR style of Google which gives an impression that an artificial "Google" brain based on DL has finally been invented which can make its own decisions, writes its own codes, and can solve the most complicated cognitive problems. I personally believe this is not an expected achievement at this stage of the game with the examples presented. A little modesty, and realistic view of the future outcome are necessary. Otherwise, this type of PR is potentially damaging the AI industry, and people who are making a living from the AI technology will suffer the consequences later.











