$\begingroup$

In the paper called Deep Learning and the Information Bottleneck Principle the authors state in section II A) the following:

Single neurons classify only linearly separable inputs, as they can implement only hyperplanes in their input space $u = wh+b$. Hyperplanes can optimally classify data when the inputs are conditioanlly indepenent.

To show this, they derive the following. Using Bayes theorem, they get:

$p(y|x) = \frac{1}{1 + exp(-log\frac{p(x|y)}{p(x|y')} -log\frac{p(y)}{p(y')})} $ (1)

Where $x$ is the input, $y$ is the class and $y'$ is the predicted class (I assume, $y'$ not defined). Continuing on, they state that:

$\frac{p(x|y)}{p(x|y')} = \prod^N_{j=1}[\frac{p(x_j|y)}{p(x_j|y')}]^{np(x_j)} $ (2)

Where $N$ is the input dimension and $n$ I'm not sure (again, both are undefined). Considering a sigmoidal neuron, with the sigmoid activation function $\sigma(u) = \frac{1}{1+exp(-u)}$ and preactivation $u$, after inserting (2) into (1) we get the optimal weight values $w_j = log\frac{p(x_j|y)}{p(x_j|y')}$ and $b=log\frac{p(y)}{p(y')}$, when the input values $h_j=np(x_j)$.

Now on to my questions. I understand how inserting (2) into (1) leads to the optimal weight and input values $w,b,h$. What I do not understand however, is the following:

How is (1) derived using Bayes theorem? How is (2) derived? What is $n$? What is the meaning of it? I assume it has something to do with conditional independence Even if the dimensions of x are conditionally independent, how can one state that it is equal to to its scaled probability? (i.e how can you state $h_j=np(x_j)$?)

EDIT: The variable $y$ is a binary class variable. From this I assume that $y'$ is the "other" class. This would solve question 1. Do you agree?