# Autoencoders

There are many things, systems, that have Principal Components Analysis as the result of their evolution, their computation; their dynamics. Things like neural networks, for example. So, this time, I decided to play with autoencoders.

An autoencoder is a feed forward neural network that satisfies three properties:

1. It has only one hidden layer
2. If $n_i$ is the dimension of the input layer, $n_o$ the dimension of the output layer and $p$ the dimension of the hidden one; then $n_i = n_o = n$ and $p < n$
3. The output should be as close as possible to the input (In some sense, usually the quadratic error one)

This is the dimensionality reduction setup of the autoencoder, portraying the characteristic funnel architecture shown in figure 1b; it can be seen as a sequence of two affine maps between 3 vector spaces $X$ $H$ and $Y$  as in the figure 1a.

The maps are then $f(\mathbf{x}) = \mathbf{A}\mathbf{x} + \mathbf{a}$ and $g(\mathbf{x}) = \mathbf{B}\mathbf{x} + \mathbf{b}$, where $\mathbf{x}, \mathbf{b} \in \mathbb{R}^n$ $\mathbf{a} \in \mathbb{R}^p$, $\mathbf{A} \in \mathbb{R}^{p \times n}$ and $\mathbf{B} \in \mathbb{R}^{n\times p}$.

Depending upon the restrictions put in the hidden layer, the autoencoder learns a code for the input layers, extracting, hopefully, the most useful or relevant aspects of the data in order to satisfy property 3. In the dimensionality reduction setup, as we will see, the space $H$ is the subspace spanned by the first $p$ Principal Directions of the input data (See my ,soon to be online, post on PCA and ICA).

Autoencoders are widely used nowadays in the architecture of Deep Neural Networks (check the wiki page for more details), however, my main interest in this post comes from their structural similarity  to the Basal Ganglia, a part of the brain involved in action execution, cancellation and reward processing. Indeed, this neural system is composed of regions with inter-region connectivity that quantitatively reflect the funnelling  shown in figure 1b.

The biological plausibility of the other properties in the definition of the autoencoder, specifically the 3rd one, might be more problematic. In that case, it should be argued either that the computation in those networks aims to implement the identity map or that the pretended dimensionality reduction can emerge with other goals or mechanisms (like Hebbian learning). More details about the biological plausibility are given in the classical paper by Bar-Gad et al., 2003; for the time being I will focus in some mathematical and computational aspects.

The Autoencoder and PCA

The first result is a tour de force. What is the relationship of an optimally trained autoencoder and PCA?

Theorem. $H$ is the principal subspace spanned by the first $p$ eigenvectors of the covariance matrix of $\mathbf{X}$.

The complete proof is given in (Bourlard & Kamp, 1988). The idea is as follows. Let $\mathbf{X} = (\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_m)$ a matrix whose columns ( $\mathbf{x}_i$) are the training vectors living in $\mathbb{R}^n$. Then, assuming the maps introduced before are linear, the optimisation problem of property 3 becomes $(\mathbf{\hat{B}}, \mathbf{\hat{b}}) = \mbox{argmin}_{(\mathbf{B},\mathbf{b}) \in \mathbb{R}^{nxp} \times \mathbb{R}^n}\quad \mbox{tr} (\mathbf{X} - \mathbf{B} \mathbf{H} - \mathbf{b})^T(\mathbf{X} - \mathbf{B}\mathbf{H} - \mathbf{b})$,

where $\mathbf{H} =f(\mathbf{x})$ (as defined above) and the minimization is over the product space of the all invertible linear maps between $\mathbb{R}^p$ and $\mathbb{R}^n$, and the translation group in $\mathbb{R}^n$. Note that we do not care about the nature of the input-hidden map at all. We just need an invertible matrix $\mathbf{\hat{B}}$ and a vector $\mathbf{\hat{b}}$ that make the objective function as small as possible.

We can solve this in two steps, first getting rid of the translations and then focusing in the matrix part. Imagine a simplex (i.e. a triangle in 2d) with the columns of $\mathbf{R} = \mathbf{X} - \mathbf{B} \mathbf{H}$ as vertices; for a fixed $\mathbf{B}$, the problem ask us to find a point such that the  sum of squares of its distance to all of the vertices is minimum. That point is the centroid $\mathbf{\mu} = \mathbf{R} \mathbf{1}_n/N$ of the simplex (you can check this by differentiation w.r.t $\mathbf{b}$); the centroid of the cloud of points given by $\mathbf{R}$. Therefore, $\mathbf{\mu} = \mathbf{R} \mathbf{1}_n/N = \mbox{argmin}_{\mathbf{b} \in \mathbb{R}^n}\quad \mbox{tr} (\mathbf{R} - \mathbf{b})^T(\mathbf{R} - \mathbf{b})$

Replacing this value into the original formula, we can see that the effect of this optimal value is that of centering the corresponding matrices (Check it!). We reformulate the problem for the second step, the optimization over the space of matrices of size $n \times p$, as $\mathbf{\hat{B}} = \mbox{argmin}_{\mathbf{B} \in \mathbb{R}^{n\times p}}\quad \mbox{tr} (\bar{\mathbf{X}} - \mathbf{B}\bar{\mathbf{H}} )^T(\bar{\mathbf{X}} - \mathbf{B}\bar{\mathbf{H}})$,

Now, remembering that $p < n$, we realize that this minimum is in fact reached and that the matrix doing the job, $\mathbf{B} \mathbf{\bar{H}}$ , is precisely the best rank $p$ approximation to $\mathbf{\bar{X}}$, that is:

(*) $\mathbf{B}\bar{\mathbf{H}} = \mathbf{U}_p\mathbf{\Sigma} \mathbf{V}_p^T$,

where $\bar{\mathbf{X}} = \mathbf{U}\mathbf{\Sigma} \mathbf{V}^T$ is the Singula Value Decomposition of $\bar{\mathbf{X}}$ and $\mathbf{U}_p$, $\mathbf{V}_p$, the first $p$ columns of the unitary matrices $\mathbf{U}$ and $\mathbf{V}$(Again, see my post on PCA and ICA for more details). Matching the sizes, we find that $\mathbf{B} = \mathbf{U}_p\mathbf{T}^{-1}$

and $\mathbf{\bar{H}} = \mathbf{T}\mathbf{\Sigma}_p\mathbf{V}_p^T$,

for an arbitrary $\mathbf{T} \in GL_n$. Therefore, columns in $\mathbf{B}$ span the principal $p$ subspace. They are effectively mixed up by $\mathbf{T}$ and its inverse, so they will be, in general not the exact eigenvectors nor even orthogonal in general (But you can recover them using the Gram-Schmidt procedure plus some sort of search in the Special Orthogonal group? There is some research being done on this but that’s not of my interest right now).

The hidden layer

As we are interested in linear activation functions, the nature of the input-hidden map is elucidated, rather swiftly, by the equation $\mathbf{U}_p\mathbf{\Sigma} \mathbf{V}_p^T = \mathbf{W}_1\bar{\mathbf{X}} + \mathbf{a}\mathbf{1}^T(\mathbf{I} - \mathbf{1}\mathbf{1}^T/m)$

derived from (*). The result is (and I am afraid I am rushing now because I am getting tired of writing) $\mathbf{A} = \mathbf{T}\mathbf{U}_p^T$.

So the weights in the input-hidden layer are also a linear combination of the principal directions (Also, $\mathbf{a}$ is arbitrary, see the original paper for details). All this, taken together, tell us that the activity of in the hidden layers does look like linear combinations of the original PC and that, the change of basis matrix’s column vectors span the same space as the $p$-principal subspace. Here is some graphic representation of what we have learned so far theoretically: Figure 2. The column space of the matrix B is the same as the Principal Subspace spanned by the first p eigenvectors of the covariance matrix. In green I show the optimal bias vector, in orange the corresponding weights. In black the original Principal directions. In gray, the principal hyperplane.

Training my own autoencoder

It is time to put everything to test and try with an actual neural network. I decided to use Matlab’s Deep Neural Networks toolbox.  Here is my code (sorry for putting a picture but I still don’t have access to nice plugins to write my code): Basically, I trained a neural network to learn the identity map. I tried different algorithms and different objective functions. The first thing you notice is that the training usually gets stuck in “local minima”. This is widely known and the reason was shown quite lucidly in (Baldi & Hornik, 1989): the Frobenius norm objective function has lots of saddle points but only one true minimum. So I had to repeat the training many times for some algorithms.

This slideshow requires JavaScript.

The results, as shown in the previous figure, is that, when the minimum is reached, the eigenspace of the covariance matrix is indeed spanned by the columns/rows of the weight matrices. Also, the responses of the two neurons (yellow, next figure) in the hidden layer lie very close to the hyperplane obtained by computation of the actual eigenvectors. Of course I could have used Cannonical Correlation Analysis to compute the actual angle between subspaces but that’s enough for this post I reckon.

References

Baldi, P., & Hornik, K. (1989). Neural networks and principal component analysis: Learning from examples without local minima. Neural networks2(1), 53-58.

Bar-Gad, I., Morris, G., & Bergman, H. (2003). Information processing, dimensionality reduction and reinforcement learning in the basal ganglia. Progress in neurobiology71(6), 439-473.

Bourlard, H., & Kamp, Y. (1988). Auto-association by multilayer perceptrons and singular value decomposition. Biological cybernetics59(4-5), 291-294.