







Vowel Recognition Network

Two neural networks were trained on vowel data available at the CMU repository. This data consists of 10 log-area parameters of the reflection coefficients of 11 different steady-state vowel sounds. Our interest in this example was to gauge the effect of using input spaces of different dimensionality. Whether adding dimensions will improve the generalization of the network. Both networks were trained to recognize the vowel with the had sound contrasted with another 10 vowel sounds.



The first network received as input the first two coefficients. After training, it achieved a better than 86% success rate on the training data. The following figure illustrates its decision regions. In the input region of [3.2,2.3]X[0.7,1.2] we see that many disparate decision regions are used to secure a perfect classification.



The second network received the first four coefficients as inputs. It achieved a success rate of over 91% on the training data. It also achieved perfect classification in the same input region. However it doesn't seem to have partitioned the space. There is only one decision region in this area. If we slice this polytope using four hyperplanes bisecting each of its dimensions, we only get 10 sub polytopes, suggesting a small degree of concavity (less than the 16 we would expect for a perfectly convex shape). In addition these sub-decision regions are mostly delimited in the third and fourth dimension with the first two dimensions ignored (span the whole space). Therefore, it appears that the network makes use of these added dimensions to form a more regular decision region in that difficult portion of the input space.











Algorithm and Network Complexity

The Decision Intersection Boundary Algorithm's complexity stems from its transversal of the hyperplane arrangement in the first layer of hidden units. As such, that part of its complexity is equivalent to similar algorithms, such as arrangement construction [Edelsbrunner 87], which are exponential in the input dimension.



Do we really need to examine every vertex though? Perhaps the network can only use a small number of decision regions, or it is limited with respect to the complexity of the decision regions. The following figure demonstrates that a network is capable of exponential complexity by an inductive construction of a network with an exponential number of decision regions, where each decision region has an exponential number of vertices. The figure shows the inductive step from one dimension to two dimensions. The zeros represent decision regions (line in 1-D, squares in 2-D).













Generalization and Learning

Generalization is the ability of the network to correctly classify points in the input space that it was not explicitly trained for. In a semi-parametric model like a neural network, generalization is the ability to describe the correct output for groups of points without explicitly accounting for each point individually. Thus, the model must employ some underlying mechanisms to classify a large number of points from a smaller set of parameters.



In our feed-forward neural network model we can characterize the possible forms of generalization with two mechanisms. The first is by proximity: nearby points in the same decision region are classified the same way. The second is by face sharing: the same hidden unit hyperplane is used as a border face in either multiple decision regions or multiple times in a decision region.



Given these two mechanisms, how well do learning algorithms exploit them? It is intuitive that learning algorithms which pull borders (Hyperplanes in this case) towards similar output points and away from different output points, should be geared for proximity generalization by forming decision regions. However face sharing is more combinatorial in nature, and might not be as amenable to continuous deformation of parameters as found in many algorithms.



To illustrate this point a group of 8 hidden unit neural networks were trained on the decision regions illustrated in the 2-D complexity figure. One thousand data points were taken as a training sample. Two hundred different networks were tested. Each was initialized with random weights between -2 and 2, and trained for 300 online epochs of backpropagation with momentum. Of the 200 networks none managed to learn the desired 9 decision regions. The network with the best performance generated 6 decision regions (left figure) out of an initial two. However, in examining its output unit's weights, we see that only one weight changed sign, the weight initially closest to zero, indicating a predisposition to this configuration in the initial conditions. The other figures are the decision regions of the more successful networks.

