Stacked Approximated Regression Machine, A Simple Deep Learning Approach


NOTE: It seems like there’s been a bit of skepticism from some top ML researchers so I’m going to approach this skeptically as well.

SUPER NOTE: Looks like this paper was actually withdrawn, soooo that’s awkard.. This turned out to be more of a review of Sparse Coding methods rather than the actualy paper.

Sparse Coding


Since this paper has a deep basis in sparse coding, I did a quick review of what sparse coding is based on Hugo Larochelle’s lecture. Sparse Coding belongs to the unsupervised learning family of machine learning algorithms. The main idea is to find a latent representation $h^{(t)}$ for each $x^{(t)}$ such that:

  • $h^{(t)}$ is sparse (contains many zeros)
  • we can reconstruct the original input $x^{(t)}$ as well as possible



  • becomes the reconstruction error.
  • is the reconstruction,
  • is the reconstruction vs. sparsity control
  • is the sparsity penalty

We constrain the columns of $D$ (also called a Dictionary) to be of norm $1$ (otherwise, $D$ could grow big while $h^{(t)}$ becomes small to satisfy the prior). This is also sometimes seen as the norm of the columns of $D$ being constrained to being no greater than $1$

So your learned reconstruction then becomes:

Optimizing objective (inferring latent representations)

Gradient Descent

One option is SGD:

The problem here is that the L1 norm is not differentiable at 0 so it’s unlikely that SGD will ever land on $h_k^{(t)} = 0$ (which is what we want to happen to make a sparse vector). The solution here is to clamp to 0 when $h_k^{(t)}$ changes sign.


Another option is Iterative Shrinkage and Thresholding Algorithm

  • initialize $h^{(t)}$
  • while $h^{(t)}$ has not converged
    • set $h^{(t)}$ to $h^{(t)} - \alpha D^{T}(Dh^{(t)} - x^{(t)})$
    • set $h^{(t)}$ to $\text{shrink}(h^{(t)}, \alpha \lambda)$
  • return $h^{(t)}$

Here $\text{shrink}(a,b) = […, \text{sign} (a_i) \text{max} (|a_i| - b_i, 0),… ]$

Will converge if $\frac{1}{\alpha}$ is bigger than the largest eigenvalue of $D^{T}D$

Learning the dictionary

The original problem can be reformulated to minimize with respect to $D$ instead

Projected gradient Descent

Need to make sure that the columns of $D$ are bound to a norm of 1.

  • While $D$ hasn’t converged
    • $D$ is set to
    • renormalize the columns of $D$
      • for each column: $D_{.,j}$ set to $\frac{D_{.,j}}{|D_{.,j}|_2}$


“This paper proposes the Stacked Approximated Regression Machine (SARM), a novel, simple yet powerful deep learning (DL) baseline.”

It is largely based on PCANet:

  • “A cascaded principal component analysis (PCA) is employed to learn multi-stage filter banks”
  • “no nonlinear operation involved until.. last layer, where binary hashing and histograms [used to compute outputs]”
  • Good baseline to justify use of more time consuming training required for DNNs

Other related works:

  • Autoencoders and Predictive Sparse Decomposition (stacked)
  • Wavelet Scattering Network (ScatNet)
    • Thought: How is this different than HMax or any bio models of vision problems where there are fixed stacked filters? May be irrelevant, but based off brief description in this paper, seems similar.
  • Deep Network Implementation of Sparse Coding
Written on September 3, 2016