Sunday, June 23, 2019

Different perspectives on importance weight autoencoders

The importance-weighted autoencoder (IWAE) approach of Burda et al.[1] defines a sequence of increasingly tighter bounds on the marginal likelihood of latent variable models. Recently, Cremer et al.[2] reinterpreted the IWAE bounds as ordinary variational evidence lower bounds (ELBO) applied to increasingly accurate variational distributions. In this work, we provide yet another perspective on the IWAE bounds. We interpret each IWAE bound as a biased estimator of the true marginal likelihood where for the bound defined on K samples we show the bias to be of order O(1/K). In our theoretical analysis of the IWAE objective we derive asymptotic bias and variance expressions. Based on this analysis we develop jackknife variational inference (JVI), a family of bias-reduced estimators reducing the bias to O(K^{-(m+1)}) for any given m < K while retaining computational efficiency. Finally, we demonstrate that JVI leads to improved evidence estimates in variational autoencoders. We also report first results on applying JVI to learning variational autoencoders.
Microsoft Research


  1. Burda, Y., Grosse, R., & Salakhutdinov, R. (2016). IMPORTANCE WEIGHTED AUTOENCODERS. 
  2. Cremer, C., Morris, Q., & Duvenaud, D. (n.d.). Workshop track -ICLR 2017 REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS. 
  3. Debiasing Evidence Approximations: On Importance-Weighted Autoencoders and Jackknife Variational Inference

Thursday, June 20, 2019

Weight uncertainty in neural networks

Key Idea:

New methods to fulfill old dreams. Long before people modeled weights in neural networks with probabilistic distributions and output posterior mean. But the intractability is always an issue. In this paper author proposes some numerical scheme to approximate.

Image result for bayesian neural networks
Left: Traditional deterministic neural networks. Right: Bayesian neural networks. From [1]

Let $\epsilon $ be a random variable having a probability density given by $q(\epsilon)$ and let $w=t(\theta, \epsilon)$ where $t(\theta, \epsilon)$ is a deterministic function. Suppose further that the marginal probability density of $w, q(w | \theta)$, is such that $q(\epsilon) d\epsilon = q(w|\theta) dw$ . Then for function $f$ with derivatives in $w$:
$\frac{\partial}{\partial \theta} \mathbb{E}_{q(\mathbf{w} | \theta)}[f(\mathbf{w}, \theta)]=\mathbb{E}_{q(\epsilon)}\left[\frac{\partial f(\mathbf{w}, \theta)}{\partial \mathbf{w}} \frac{\partial \mathbf{w}}{\partial \theta}+\frac{\partial f(\mathbf{w}, \theta)}{\partial \theta}\right]$

\begin{aligned} \frac{\partial}{\partial \theta} \mathbb{E}_{q(\mathbf{w} | \theta)}[f(\mathbf{w}, \theta)] &=\frac{\partial}{\partial \theta} \int f(\mathbf{w}, \theta) q(\mathbf{w} | \theta) \mathrm{d} \mathbf{w} \\ &=\frac{\partial}{\partial \theta} \int f(\mathbf{w}, \theta) q(\epsilon) \mathrm{d} \epsilon \\ &=\mathbb{E}_{q(\epsilon)}\left[\frac{\partial f(\mathbf{w}, \theta)}{\partial \mathbf{w}} \frac{\partial \mathbf{w}}{\partial \theta}+\frac{\partial f(\mathbf{w}, \theta)}{\partial \theta}\right] \end{aligned}

Variational approximation to Bayesian posterior distribution on the weights:
\begin{aligned} \theta^{\star} &=\arg \min _{\theta} \mathrm{KL}[q(\mathbf{w} | \theta) \| P(\mathbf{w} | \mathcal{D})] \\

 &=\arg \min _{\theta} \int q(\mathbf{w} | \theta) \log \frac{q(\mathbf{w} | \theta)}{P(\mathbf{w}) P(\mathcal{D} | \mathbf{w})} \mathrm{d} \mathbf{w} \\

&=\arg \min _{\theta} \mathrm{KL}[q(\mathbf{w} | \theta) \| P(\mathbf{w})]-\mathbb{E}_{q(\mathbf{w} | \theta)}[\log P(\mathcal{D} | \mathbf{w})]
The resulting cost function is called variational free energy or expected lower-bound
$\mathcal{F}(\mathcal{D}, \theta)=\mathrm{KL}[q(\mathbf{w} | \theta) \| P(\mathbf{w})]
- \mathbb{E}_{q(\mathbf{w} | \theta)}[\log P(\mathcal{D} | \mathbf{w})]$

We approximate it:
$\mathcal{F}(\mathcal{D}, \theta) \approx \sum_{i=1}^{n} \log q\left(\mathbf{w}^{(i)} | \theta\right)-\log P\left(\mathbf{w}^{(i)}\right) - \log P\left(\mathcal{D} | \mathbf{w}^{(i)}\right)$

where $w^{(i)}$ denotes  the ith Monte Carlo sample drawn form the variational posterior $q\left(\mathbf{w}^{(i)} | \theta\right)$

Notice that every term of the approximate cost depends on the particular weights drawn from the variational posterior: this is an instance of a variance reduction technique known as common random numbers. The author claims that a prior without an easy-to-compute closed form complexity cost performed best.

Gaussian variational posterior:
Suppose the variational posterior is a diagonal Gaussian distribution, then a sample of the weights $w$ can be obtained by sampling a unit Gaussian, shifting it by mean $\mu$ and scaling by a standard deviation $\sigma$.

$\sigma=\log(1+ exp(\rho))$
$\theta=(\mu, \rho)$
$\mathbf{w}=t(\theta, \epsilon)=\mu+\log (1+\exp (\rho)) \circ \epsilon$

class Gaussian(object):
    def __init__(self, mu, rho):
        super().__init__() = mu
        self.rho = rho
        self.normal = torch.distributions.Normal(0,1)
    def sigma(self):
        return torch.log1p(torch.exp(self.rho))
    def sample(self):
        epsilon = self.normal.sample(self.rho.size())
        return + self.sigma * epsilon
    def log_prob(self, input):
        return (-math.log(math.sqrt(2 * math.pi))
                - torch.log(self.sigma)
                - ((input - ** 2) / (2 * self.sigma ** 2)).sum()
torch.log1p(input, output=None) 
is more accurate than
for small values of input 

Optimisation steps:

  1. Sample $\epsilon \sim \mathcal{N}(0, I)$
  2. Let $\mathbf{w}=\mu+\log (1+\exp (\rho)) \circ \epsilon$
  3. Let $\theta=(\mu, \rho)$
  4. Let $f(\mathbf{w}, \theta)=\log q(\mathbf{w} | \theta)-\log P(\mathbf{w}) P(\mathcal{D} | \mathbf{w})$
  5. Gradient w.r.t the mean:$\Delta_{\mu}=\frac{\partial f(\mathbf{w}, \theta)}{\partial \mathbf{w}}+\frac{\partial f(\mathbf{w}, \theta)}{\partial \mu}$
  6. Calculate the gradient w.r.t. the standard deviation parameter $\rho$: $\Delta_{\rho}=\frac{\partial f(\mathbf{w}, \theta)}{\partial \mathbf{w}} \frac{\epsilon}{1+\exp (-\rho)}+\frac{\partial f(\mathbf{w}, \theta)}{\partial \rho}$
  7. Update the variational parameters: 

\begin{array}{c}{\mu \leftarrow \mu-\alpha \Delta_{\mu}} \\ {\rho \leftarrow \rho-\alpha \Delta_{\rho}}\end{array}

Scale mixture prior

The author propose a scale mixture of two Gaussians densities as the prior. Each density is zero-mean-ed, but having differing variances.
P(\mathbf{w})=\prod_{j} \pi \mathcal{N}\left(\mathbf{w}_{j} | 0, \sigma_{1}^{2}\right)+(1-\pi) \mathcal{N}\left(\mathbf{w}_{j} | 0, \sigma_{2}^{2}\right)

\log P(\mathbf{w})=\sum_{j} \log \left(\pi \mathcal{N}\left(\mathbf{w}_{j} | 0, \sigma_{1}^{2}\right)+(1-\pi) \mathcal{N}\left(\mathbf{w}_{j} | 0, \sigma_{2}^{2}\right)\right)

where $w_j$ is the $j$th weight of the network. $\sigma_1^2$ and $\sigma_2^2$ are the variances of the mixture components. The first mixture component of the prior is giving a larger variance than the second, i.e., $\sigma_1 \gt \sigma_2$, providing a heavier tail in the prior density that a plain Gaussian prior.  The second mixture component has a small variance $\sigma_2 \ll 1$ causing many of the weights to a priori tightly concentrate around zero.

class ScaleMixtureGaussian(object):
    def __init__(self, pi, sigma1, sigma2):
        self.pi = pi
        self.sigma1 = sigma1
        self.sigma2 = sigma2
        self.gaussian1 = torch.distributions.Normal(0,sigma1)
        self.gaussian2 = torch.distributions.Normal(0,sigma2)
    def log_prob(self, input):
        prob1 = torch.exp(self.gaussian1.log_prob(input))
        prob2 = torch.exp(self.gaussian2.log_prob(input))
        return (torch.log(self.pi * prob1 + (1-self.pi) * prob2)).sum()

Minibatches and KL re-weighting:

If minibatches are partitioned uniformly at random, the KL cost can be distributed non-uniformly among the minibatches at each epoch. Let $\pi \in [0, 1]^M$ and $\sum_{i=1}^M \pi_i =1 $, and define:

\begin{array}{r}{\mathcal{F}_{i}^{\pi}\left(\mathcal{D}_{i}, \theta\right)=\pi_{i} \mathrm{KL}[q(\mathbf{w} | \theta) \| P(\mathbf{w})]} \\ {-\mathbb{E}_{q(\mathbf{w} | \theta)}\left[\log P\left(\mathcal{D}_{i} | \mathbf{w}\right)\right]}\end{array}

Then $\mathbb{E}_{M}\left[\sum_{i=1}^{M} \mathcal{F}_{i}^{\pi}\left(\mathcal{D}_{i}, \theta\right)\right]=\mathcal{F}(\mathcal{D}, \theta)$ where $E_M$ denotes an expectation over the random partitioning of minibatches.  The author suggests that $\pi_i= \frac{2^{M-i}}{ 2^M -1}$ works well.


  1. Wikipedia Contributors. (2019, March 11). Variance reduction. Retrieved August 8, 2019,
  2. Nitarshan Rajkumar. (2018). Weight Uncertainty in Neural Networks. Retrieved August 13, 2019,

Monday, June 17, 2019

Importance Weighted Autoencoders

Key Idea:

Original VAE optimize over variational lower bound:
$\mathcal{L}(x)= E_{q(h|x)}[\log \frac{p(x,h)}{q(h|x)}]$
This paper introduces a tighter bound (k-sample importance weighting estimate)
$\mathcal{L}_k(x)= E_{h_1, ..., h_k \sim q(h|x)}[\log \frac{1}{k} \sum_{i=1}^k \frac{p(x,h_i)}{q(h_i|x)}]$

Furthermore, the author claims:

  • $\log p(x) \ge \mathcal{L}_{k+1} \ge \mathcal{L}_k$
  • $\log p(x)= \lim_{k \to \infty} \mathcal{L}_k$ if $\frac{p(h,x)}{q(h|x)}$ is bounded

1. First, $\mathcal{L}_{k}=\mathbb{E}\left[\log \frac{1}{k} \sum_{i=1}^{k} w_{i}\right] \leq \log \mathbb{E}\left[\frac{1}{k} \sum_{i=1}^{k} w_{i}\right]=\log p(\mathbf{x})$

2. $I \subset \{1, 2, ..., k \}$ ($I$ being uniformly drawn from $\{ 1, 2, ..., k \} , |I|=m , m \le k$ )
$\mathbb{E}_{I=\left\{i_{1}, \ldots, i_{m}\right\} } \left[\frac{a_{i_{1}}+\ldots+a_{i_{m}}}{m}\right]= \frac{a_{1}+\ldots+a_{k}}{k}$
$$ \begin{align} \mathcal{L}_{k} &=\mathbb{E}_{\mathbf{h}_{1}, \ldots, \mathbf{h}_{k}}\left[\log \frac{1}{k} \sum_{i=1}^{k} \frac{p\left(\mathbf{x}, \mathbf{h}_{i}\right)}{q\left(\mathbf{h}_{i} | \mathbf{x}\right)}\right] \\ &=\mathbb{E}_{\mathbf{h}_{1}, \ldots, \mathbf{h}_{k}}\left[\log \mathbb{E}_{I=\left\{i_{1}, \ldots, i_{m}\right\}}\left[\frac{1}{m} \sum_{j=1}^{m} \frac{p\left(\mathbf{x}, \mathbf{h}_{i_{j}}\right)}{q\left(\mathbf{h}_{i_{j}} | \mathbf{x}\right)}\right]\right] \\ &\ge \mathbb{E}_{\mathbf{h}_{1}, \ldots, \mathbf{h}_{k}}\left[\mathbb{E}_{I=\left\{i_{1}, \ldots, i_{m}\right\}}\left[\log \frac{1}{m} \sum_{j=1}^{m} \frac{p\left(\mathbf{x}, \mathbf{h}_{i_{j}}\right)}{q\left(\mathbf{h}_{i_{j}} | \mathbf{x}\right)}\right]\right] \textrm{(Jensen's inequality) }\\ &=\mathbb{E}_{\mathbf{h}_{1}, \ldots, \mathbf{h}_{m}}\left[\log \frac{1}{m} \sum_{i=1}^{m} \frac{p\left(\mathbf{x}, \mathbf{h}_{i}\right)}{q\left(\mathbf{h}_{i} | \mathbf{x}\right)}\right]\\ &=\mathcal{L}_{m} \textrm{($k \ge m$)} \end{align} $$

where $h_{i_j}= h_j , j \in I \subset \{ 1, 2, ..., k \}$

3. Consider the random variable $M_{k}=\frac{1}{k} \sum_{i=1}^{k} \frac{p\left(\mathbf{x}, \mathbf{h}_{i}\right)}{q\left(\mathbf{h}_{i} | \mathbf{x}\right)}$, if $p(\mathbf{h}, \mathbf{x}) / q(\mathbf{h} | \mathbf{x})$ is bounded. Then it follows from the strong law of large numbers that $M_k$ converges to $E_{q\left(\mathbf{h}_{i} | \mathbf{x}\right)}\left[\frac{p\left(\mathbf{x}, \mathbf{h}_{i}\right)}{q\left(\mathbf{h}_{i} | \mathbf{x}\right)}\right]= p(\mathbf{x})$  almost surely.

In the paper[1], the author uses mean absolute deviation to show the overall estimation variance cannot be large.  Furthermore, the author proves that the mean absolute variance of our estimator $\mathcal{L}_k $ is no more than $2+2\delta$, where $\delta=\log p(x)- \mathcal{L}_k$

Training Procedure:

\begin{align} \nabla_{\boldsymbol{\theta}} \mathcal{L}_{k}(\mathbf{x}) &=\nabla_{\boldsymbol{\theta}} \mathbb{E}_{\mathbf{h}_{1}, \ldots, \mathbf{h}_{k}}\left[\log \frac{1}{k} \sum_{i=1}^{k} w_{i}\right] \\ &=\nabla_{\boldsymbol{\theta}} \mathbb{E}_{\boldsymbol{\epsilon}_{1}, \ldots, \epsilon_{k}}\left[\log \frac{1}{k} \sum_{i=1}^{k} w\left(\mathbf{x}, \mathbf{h}\left(\mathbf{x}, \boldsymbol{\epsilon}_{i}, \boldsymbol{\theta}\right), \boldsymbol{\theta}\right)\right] \\ &=\mathbb{E}_{\epsilon_{1}, \ldots, \epsilon_{k}}\left[\nabla_{\boldsymbol{\theta}} \log \frac{1}{k} \sum_{i=1}^{k} w\left(\mathbf{x}, \mathbf{h}\left(\mathbf{x}, \boldsymbol{\epsilon}_{i}, \boldsymbol{\theta}\right), \boldsymbol{\theta}\right)\right] \\ &=\mathbb{E}_{\epsilon_{1}, \ldots, \epsilon_{k}}\left[\sum_{i=1}^{k} \widetilde{w}_{i} \nabla_{\boldsymbol{\theta}} \log w\left(\mathbf{x}, \mathbf{h}\left(\mathbf{x}, \boldsymbol{\epsilon}_{i}, \boldsymbol{\theta}\right), \boldsymbol{\theta}\right)\right] \end{align}

where $\epsilon_{1}, \dots, \epsilon_{k}$ are auxiliary variables, $\epsilon_i \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$;
$w_i=w\left(\mathbf{x}, \mathbf{h}\left(\mathbf{x}, \boldsymbol{\epsilon}_{i}, \boldsymbol{\theta}\right), \boldsymbol{\theta}\right)$ are the importance weights expressed as a deterministic function;

$\widetilde{w_{i}}=\frac {w_{i}}  { \sum_{i=1}^{k} w_{i} } $ are the normalized importance weights.

In the gradient-based algorithm, we draw $k$ samples from the recognition network (i.e., draw k sets of auxiliary variables) and use the Monte Carlo estimate:
$\sum_{i=1}^{k} \widetilde{w}_{i} \nabla_{\boldsymbol{\theta}} \log w\left(\mathbf{x}, \mathbf{h}\left(\boldsymbol{\epsilon}_{i}, \mathbf{x}, \boldsymbol{\theta}\right), \boldsymbol{\theta}\right)$.
If $k=1$ then the above degenerates into the standard VAE update rule.


The implementation part is rather interesting, too.
We analyze the implementation from this place. First, two classes are defined:

  • IWAE_1 for one-stochastic-layer IWAE
  • IWAE_2 for two-stochastic-layer IWAE 
IWAE_1 and IWAE_2 are roughly similar. We focus on IWAE_1:

class IWAE_1(nn.Module):
    def __init__(self, dim_h1, dim_image_vars):
        super(IWAE_1, self).__init__()
        self.dim_h1 = dim_h1
        self.dim_image_vars = dim_image_vars

        ## encoder
        self.encoder_h1 = BasicBlock(dim_image_vars, 200, dim_h1)
        ## decoder
        self.decoder_x =  nn.Sequential(nn.Linear(dim_h1, 200),
                                        nn.Linear(200, 200),
                                        nn.Linear(200, dim_image_vars),
    def encoder(self, x):
        mu_h1, sigma_h1 = self.encoder_h1(x)
        eps = Variable(
        h1 = mu_h1 + sigma_h1 * eps                
        return h1, mu_h1, sigma_h1, eps
    def decoder(self, h1):
        p = self.decoder_x(h1)
        return p
    def forward(self, x):
        h1, mu_h1, sigma_h1, eps = self.encoder(x)
        p = self.decoder(h1)
        return (h1, mu_h1, sigma_h1, eps), (p)

forward()  method invokes both encoder() and decoder() which makes intuitive sense. encoder() returns approximated posterior Gaussian density, mean and variance of that Gaussian and auxiliary variable eps. decoder() directly invokes:

nn.Sequential(nn.Linear(dim_h1, 200),   nn.Tanh(),
                                        nn.Linear(200, 200),
                                        nn.Linear(200, dim_image_vars),
 This is perhaps why PyTorch is so popular----very intuitive framework.

The loss function is worth noticing:

    def train_loss(self, inputs):
        h1, mu_h1, sigma_h1, eps = self.encoder(inputs)
        #log_Qh1Gx = torch.sum(-0.5*((h1-mu_h1)/sigma_h1)**2 - torch.log(sigma_h1), -1)
        log_Qh1Gx = torch.sum(-0.5*(eps)**2 - torch.log(sigma_h1), -1)
        p = self.decoder(h1)
        log_Ph1 = torch.sum(-0.5*h1**2, -1)
        log_PxGh1 = torch.sum(inputs*torch.log(p) + (1-inputs)*torch.log(1-p), -1)

        log_weight = log_Ph1 + log_PxGh1 - log_Qh1Gx
        log_weight = log_weight - torch.max(log_weight, 0)[0]
        weight = torch.exp(log_weight)
        weight = weight / torch.sum(weight, 0)
        weight = Variable(, requires_grad = False)
        loss = -torch.mean(torch.sum(weight * (log_Ph1 + log_PxGh1 - log_Qh1Gx), 0))
        return loss
Notice the loss. We might think the loss should just $-\mathcal{L}_k$ and let Adam optimize it. But that's not the case. We decompose the ELBO:
\nabla_{\boldsymbol{\theta}} \log w\left(\mathbf{x}, \mathbf{h}\left(\mathbf{x}, \boldsymbol{\epsilon}_{i}, \boldsymbol{\theta}\right), \boldsymbol{\theta}\right) &=\nabla_{\boldsymbol{\theta}} \log p\left(\mathbf{x}, \mathbf{h}\left(\mathbf{x}, \boldsymbol{\epsilon}_{i}, \boldsymbol{\theta}\right) | \boldsymbol{\theta}\right)-\nabla_{\boldsymbol{\theta}} \log q\left(\mathbf{h}\left(\mathbf{x}, \boldsymbol{\epsilon}_{i}, \boldsymbol{\theta}\right) | \mathbf{x}, \boldsymbol{\theta}\right) \\

&=\nabla_{\boldsymbol{\theta}} \log p\left(\mathbf{x}| \mathbf{h}\left(\mathbf{x}, \boldsymbol{\epsilon}_{i}, \boldsymbol{\theta}\right) , \boldsymbol{\theta}\right) + \nabla_{\boldsymbol{\theta}} \log p\left( \mathbf{h}\left(\mathbf{x}, \boldsymbol{\epsilon}_{i}, \boldsymbol{\theta}\right) , \boldsymbol{\theta}\right) -\nabla_{\boldsymbol{\theta}} \log q\left(\mathbf{h}\left(\mathbf{x}, \boldsymbol{\epsilon}_{i}, \boldsymbol{\theta}\right) | \mathbf{x}, \boldsymbol{\theta}\right) \\

&= \nabla_{\boldsymbol{\theta}} \: \textrm{log_PxGh1}  + \nabla_{\boldsymbol{\theta}} \: \textrm{log_Ph1} -  \nabla_{\boldsymbol{\theta}} \: \textrm{log_Qh1Gx}


You may want to use version B of ELBO in VAE paper:

\begin{align} \mathcal{L}\left(\boldsymbol{\theta}, \boldsymbol{\phi} ; \mathbf{x}^{(i)}\right) &\approx \widetilde{\mathcal{L}}^{B}\left(\boldsymbol{\theta}, \boldsymbol{\phi} ; \mathbf{x}^{(i)}\right) \\ &=\color{blue} {-D_{K L}\left(q_{\phi}\left(\mathbf{z} | \mathbf{x}^{(i)}\right) \| p_{\boldsymbol{\theta}}(\mathbf{z})\right)+\frac{1}{L} \sum_{l=1}^{L}\left(\log p_{\boldsymbol{\theta}}\left(\mathbf{x}^{(i)} | \mathbf{z}^{(i, l)}\right)\right)} \end{align}
But this trick cannot be reapplied in $k>1$ case where $k$ is the sample number ( not mini-batch sample number).

log_weight = log_weight - torch.max(log_weight, 0)[0]

What is this line for? It's a typical way of dealing with numerical instability issue. We borrow the example from CS231n lecture notes:

"When you’re writing code for computing the Softmax function in practice, the intermediate terms $e^{f_{y_i}}$ and $\sum_j e^{f_j}$ may be very large due to the exponentials. Dividing large numbers can be numerically unstable, so it is important to use a normalization trick. Notice that if we multiply the top and bottom of the fraction by a constant C and push it into the sum, we get the following (mathematically equivalent) expression:

$ \frac{e^{f_{y_i}}}{\sum_j e^{f_j}} = \frac{Ce^{f_{y_i}}}{C\sum_j e^{f_j}} = \frac{e^{f_{y_i} + \log C}}{\sum_j e^{f_j + \log C}} $

We are free to choose the value of $C$. This will not change any of the results, but we can use this value to improve the numerical stability of the computation. A common choice for $C$ is to set $\log C=−max_j f_j$. This simply states that we should shift the values inside the vector f so that the highest value is zero. In code:"
f = np.array([123, 456, 789]) # example with 3 classes and each having large scores
p = np.exp(f) / np.sum(np.exp(f)) # Bad: Numeric problem, potential blowup

# instead: first shift the values of f so that the highest number is 0:
f -= np.max(f) # f becomes [-666, -333, 0]
p = np.exp(f) / np.sum(np.exp(f)) # safe to do, gives the correct answer

However I have reasonable suspicion on this line of code. Since $p(x) \lt 1$ , $\log p(x) <0$. Subtraction by the biggest only makes each negative term fly towards the negative infinity. Is that really beneficial?

In the paper, $\tilde{w}_i =\frac{w_i}{\sum_i^k w_i}$ are called normalized importance weights. Notice in the update formula:
$\mathbb{E}_{\epsilon_{1}, \ldots, \epsilon_{k}}\left[\sum_{i=1}^{k} \widetilde{w}_{i} \nabla_{\boldsymbol{\theta}} \log w\left(\mathbf{x}, \mathbf{h}\left(\mathbf{x}, \boldsymbol{\epsilon}_{i}, \boldsymbol{\theta}\right), \boldsymbol{\theta}\right)\right]$

We don't take the gradient w.r.t. the normalized importance weights. This detail, reflected on PyTorch code is the following:

weight = torch.exp(log_weight)
weight = weight / torch.sum(weight, 0)
weight = Variable(, requires_grad = False)

requires_grad = False 
will let PyTorch ignore variable weight when doing backpropagation, which is exactly what we want.


  1. Burda, Y., Grosse, R., & Salakhutdinov, R. (2016). IMPORTANCE WEIGHTED AUTOENCODERS. 
  2. PyTorch Implementation by Xinqiang Ding
  3. CS231n course notes

Friday, June 7, 2019

Prism Code Highligher Test

p { color: red }

for i in range(250):

Official Site:
How to use:
<pre><code class="language-css line-numbers">p { color: red }</code></pre>


Variational Autoencoders 2: Implementation

Prior: isotropic multivariate Gaussian $p_{\boldsymbol{\theta}}(\mathbf{z})=\mathcal{N}(\mathbf{z} ; \mathbf{0}, \mathbf{I})$

Probabilistic encoer(recognition model): Bernoulli or Gaussian $q_{\phi}(\mathbf{z} | \mathbf{x})$

Probabilistic decoder : multivariate Gaussian(real-valued data) or Bernoulli(binary data) $p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})$

$\log q_{\phi}\left(\mathbf{z} | \mathbf{x}^{(i)}\right)=\log \mathcal{N}\left(\mathbf{z} ; \boldsymbol{\mu}^{(i)}, \boldsymbol{\sigma}^{2(i)} \mathbf{I}\right)$
So $\phi=\{ \boldsymbol{\mu}^{(i)}, \boldsymbol{\sigma}^{2(i)} \}$

The structure for decoder is similar.

We sample using the re-parameterization trick:
$\mathbf{z}^{(i, l)} \sim q_{\phi}\left(\mathbf{z} | \mathbf{x}^{(i)}\right)$
$\mathbf{z}^{(i, l)}=g_{\phi}\left(\mathbf{x}^{(i)}, \boldsymbol{\epsilon}^{(l)}\right)

=\boldsymbol{\mu}^{(i)}+\boldsymbol{\sigma}^{(i)} \odot \boldsymbol{\epsilon}^{(l)}$

where $\epsilon^{(l)} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$

For implementation, I like one example from PyTorch's examples:

Some details:

  • MNIST images are gray-scale images ranging from 0 to 255. But 

will transform "a PIL Image or numpy.ndarray (H x W x C) in the range [0, 255] to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0] if the PIL Image belongs to one of the modes (L, LA, P, I, F, RGB, YCbCr, RGBA, CMYK, 1) or if the numpy.ndarray has dtype = np.uint8"

  • As you may might notice, the choice of $p(z | x)$ is a Bernoulli distribution. No good explanations have been found. MNIST orignally are gray-scale images. After conversion it's indeed in the range of $[0,1]$, but can you take a step further and claim its the probability of being some sort of black or white? Of course this doesn't make too much sense. But as a very crude approximation it might work. One guy tried to explain in his blog post[3], but I found his explanations flawed and left comments for him. You can find other discussions about this from [4] and [5]. IMHO, this is just an engineering artifact


  1. PyTorch document on ToTensor method
  2. PyTorch official VAE example 
  3. Using a Bernoulli VAE on Real-Valued Observations - Rui Shu. (2018). 
  4. r/MachineLearning - what loss to use for variational auto encoder with real numbers? (2011)
  5. Loss function autoencoder vs variational-autoencoder or MSE-loss vs binary-cross-entropy-loss

Saturday, June 1, 2019

Variational Autoencoders (1): Motivation

What is it?

This paper proposes a variational approach for training directed graphical models with continuous latent variables and intractable posteriors/marginals. The idea is to reparameterize the latent variables so that they can be written as a deterministic mapping followed by a stochastic perturbation. This allows Monte Carlo estimators of the variational lower bound to be differentiated with respect to the variational parameters.
Anonymous Openreview Reviewer
Problem Setting:

  • A massive dataset $X=\{x^{(i)} \}_{i=1}^N$ consisting of N i.i.d. samples of some continuous or discrete variables $x$
  • There exists a simple latent space $z$:
           $z \sim p_{\theta^{*}} (z)$
           $x|z \sim  p_{\theta^{*}} (x|z) $

  • Furthermore, we assume the prior $p_{\theta^{*}} (z)$ and likelihood $ p_{\theta^{*}} (x|z) $ come from parametric families of distributions $p_{\theta}(z)$ and $p_{\theta} (x|z) $. And their PDF (Probability Density Function) are differentiable almost everywhere w.r.t. to both $\theta$ and $z$

What do we want to achieve?
  • Learn (approximately) via ML/MAP estimation for the parameter $\theta$
  • Approximate $p_{\theta} (z|x)$ given $\theta$ and an observed value $x$, useful for coding or data representation task 
  • Approximately marginal inference of the variable $x$. $p_{\theta}(x)=\int p_{\theta} (x|z)p_{\theta}(z)dz$
Difficulties we are facing:
  • Intractability:  $p_{\theta}(x)=\int p_{\theta} (x|z)p_{\theta}(z)dz$, $p_{\theta}(z|x)=\frac{ \int p_{\theta} (x|z)p_{\theta}(z)dz} {p_{\theta}(x)}$ are all intractable. Therefore:
  1. Standard EM algorithm cannot be used
  2. Mean-field VB is often also impossible(it requires closed-form solutions to certain expectations of the joint PDF)[2]
  • A large dataset: Sampling based methods, e.g., MCMC, Monte Carlo EM are still possible but too slow[2]. 
  • Pure MAP will lead to over-fitting when dimension of $z$ is high[2]. 

Auto-encoding Variational Bayes


  • Learn a neural net parameterized by $\phi$  $p_{\phi}(z|x)$to approximate  the posterior 
  • Construct estimator of the variational lower bound, jointly optimizing $\phi$ and $\theta$

Graphical model under consideration. Solid lines denote the generative model $p_{\theta}(z) p_{\theta}(x|z)$ while dashed lines denote the variational approximation $q_{\phi}(z|x)$ to the intractable posterior $p_{\theta}(z|x) $ From [1]'s slides

The variational bound:
$\log p_{\theta}(x^{(1)}, ..., x^{(N)})=\sum_{i=1}^N \log p_{\theta} (x^{(i)})$
for each $i$:
\log p_{\theta}(x^{(i)})&=E_{z \sim q_{\phi}(z|x^{(i)}) } [\log p_{\theta}(x^{(i)})] \\

&=E_z [\log \frac{p_{\theta}(x^{(i)} |z) p_{\theta}(z)} {p_{\theta}(z | x^{(i)} ) }]  \: (\textrm{Bayes' Rule}) \\

&=E_z[\log \frac{p_{\theta}(x^{(i)} |z) p_{\theta}(z)} {p_{\theta}(z | x^{(i)} ) }  \frac{ q_{\phi} (z| x^{(i)}) } {q_{\phi} (z| x^{(i)}) } ] \\

&=E_z [ \log p_{\theta} ( x^{(i)} | z)] - E_z[ \log \frac{q_{\phi} (z| x^{(i)}) }{p_{\theta}(z) }] + E_z [ \log \frac{q_{\phi} (z| x^{(i)}) } {p_{\theta}(z | x^{(i)}) }] \\
&=\color{red} { E_z[ \log p_{\theta} ( x^{(i)} | z)] } - \color{green}{  D_{KL}( q_{\phi} (z| x^{(i)}) || p_{\theta}(z) ) } +\color{blue} { D_{KL}( q_{\phi} (z| x^{(i)}) || p_{\theta}(z | x^{(i)})) }\\

&= \mathcal{L}(\theta, \phi; x^{(i)}) +  D_{KL}( q_{\phi} (z| x^{(i)}) || p_{\theta}(z | x^{(i)})) \\

&\ge  \mathcal{L}(\theta, \phi; x^{(i)}) \: \textrm{(Variational lower bound)}

In the red term, $ \log p_{\theta} ( x^{(i)} | z)]$ is the probability of $x^{(i)}$ given $z$ we have sampled. This term is greater if given true ( or more "truthful") latent encoding $z|x^{(i)}$. Therefore, red term is also called reconstruction error/loss.

Green term is the KL between probabilistic encoder $q_{\phi} (z| x^{(i)})$ and prior $p_{\theta}(z)$. It encourages the model to learn some more interesting latent encoding rather than simply encoding an identity mapping. This term is called regularization term.

Blue term is the KL between two intractable posteriors. To my knowledge there isn't good way to deal with it. We simply acknowledge this is a non-negative term.

Red term minus green term yields the classical variational lower bound:
\mathcal{L}(\theta, \phi; x^{(i)}) &= E_z[ \log p_{\theta} ( x^{(i)} | z)]  -  D_{KL}( q_{\phi} (z| x^{(i)}) || p_{\theta}(z) ) \\

&=E_z[ \log p_{\theta} ( x^{(i)} ,z) - \log q_{\phi} (z| x^{(i)}) ]

Possible ways to deal with this bound:
$f_{\theta, \phi}(z)= \log p_{\theta} ( x^{(i)} ,z) - \log q_{\phi} (z| x^{(i)})  $
\mathcal{L}(\theta, \phi; x^{(i)}) &= E_{q_{\phi}(z|x^{(i)})}[f_{\theta, \phi}(z)] \\
&\approx \frac{1}{L} \sum_{i=1}^L f_{\theta, \phi}(z^{(l)}) \:\textrm{(Monte Carlo Estimator of this bound)}

where $z^{(l)} \sim q_{\phi}(z|x)$

We also need to compute the gradient:
$ \nabla_{\theta, \phi} E_{q_\phi(z)} \left [ \log p_\theta(x,z) - \log q_\phi(z|x) \right] $

For $\nabla_{\theta}$ things are not too hard:
$ \nabla_{\theta} E_{q_\phi(z)} \left [ f_{\theta, \phi}(z) \right]=E_{q_\phi(z)} \left[ \nabla_{\theta} \log p_\theta(x,z) \right]

For $\nabla_{\phi}$ things become a bit more complicated:
\nabla_{\phi} E_{q_\phi(z)} \left [ f_{\theta, \phi}(z) \right] &= \nabla_{\phi} E_{q_\phi(z)} \left [ \log p_\theta(x,z) - \log q_\phi(z|x) \right] \\

&= \nabla_{\phi} \int{\log p_\theta(x,z)  q_\phi(z|x) dz} -\nabla_{\phi} \int{q_\phi(z|x) \log q_\phi(z|x) dz }\: \textrm{(Assuming continuous  latent variable case)} \\

&=\int{\log p_\theta(x,z) \nabla_{\phi} q_\phi(z|x) dz} - \int{(\log q_\phi(z|x) +1) \nabla_{\phi}
q_\phi(z|x)  dz } \\
&= \int{ (\log p_\theta(x,z) - \log q_\phi(z|x) )  \nabla_{\phi}  q_\phi(z|x)  dz }

where we use the fact that
\int{ \nabla_{\phi}  q_\phi(z|x)  dz  } &=\nabla_{\phi}\int{   q_\phi(z|x)  dz  } \\
&=  \nabla_{\phi} 1 \\
&= 0

Again, we use the identity
$\nabla_{\phi}  q_\phi(z|x) =  q_\phi(z|x)  \nabla_{\phi} \log{q_\phi(z|x) }$
(Why? Because $\nabla_{\phi} \log{q_\phi(z|x) }= \frac{ \nabla_{\phi} q_\phi(z|x) } {q_\phi(z|x) }q_\phi(z|x) $) )

So we have:
\nabla_{\phi} E_{q_\phi(z)} \left [ f_{\theta, \phi}(z) \right] &= \nabla_{\phi} E_{q_\phi(z)} \left [ \log p_\theta(x,z) - \log q_\phi(z|x) \right] \\

&=\int{ (\log p_\theta(x,z) - \log q_\phi(z|x) )  \nabla_{\phi}  q_\phi(z|x)  dz } \\

&=\int{ (\log p_\theta(x,z) - \log q_\phi(z|x) )  q_\phi(z|x)  \nabla_{\phi} \log{q_\phi(z|x) } }\\

&=\color{red } { E_{q_\phi(z|x)} [ (\log p_\theta(x,z) - \log q_\phi(z|x) )   \nabla_{\phi}\log{q_\phi(z|x) }   ] }


An possible estimator can be:
\nabla_{\phi} E_{q_\phi(z)} \left [ f_{\theta, \phi}(z) \right]
&=\color{red } { E_{q_\phi(z|x)} [ (\log p_\theta(x,z) - \log q_\phi(z|x) )   \nabla_{\phi}\log{q_\phi(z|x) }   ] } \\

&\approx \color{blue}{\frac{1}{L} \sum_{l=1}^L f_{\theta, \phi}(z) \nabla_{\phi}   \log q_\phi(z^{(i)} |x)}


where $z^{(l)} \sim q_{\phi}(z|x)$
However, this estimator exhibits very high variance.

Re-parameterization trick:
The core of re-parameterization trick is to rewrite the $q_{\phi}(z|x)$ in terms of a deterministic mapping $g$ of both $x$ and auxiliary noise variable $\epsilon $ sampled from some probability distribution $p(\epsilon)$.
$\widetilde{\mathbf{z}}=g_{\phi}(\epsilon, \mathbf{x})$ with $\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon})$

Re-parameterization moves the random part to the $\epsilon$ which is reflected in the sampling process. Picture from [1].

Now the Monte Carlo estimates of the expectations of  some function $f(z)$ w.r.t. $q_{\phi}(z|x)$ is as follows:
$\mathbb{E}_{q_{\phi}\left(\mathbf{z} | \mathbf{x}^{(i)}\right)}[f(\mathbf{z})]=\mathbb{E}_{p(\epsilon)}\left[f\left(g_{\phi}\left(\boldsymbol{\epsilon}, \mathbf{x}^{(i)}\right)\right)\right] \simeq \frac{1}{L} \sum_{l=1}^{L} f\left(g_{\phi}\left(\boldsymbol{\epsilon}^{(l)}, \mathbf{x}^{(i)}\right)\right)$

$\boldsymbol{\epsilon}^{(l)} \sim p(\boldsymbol{\epsilon})$.

And the variational lower bound can be approximated rewritten in two ways, yielding two Stochastic Gradient Variational Bayes Estimator :
$$\begin{align} \mathcal{L}\left(\boldsymbol{\theta}, \boldsymbol{\phi} ; \mathbf{x}^{(i)}\right) &\approx \widetilde{\mathcal{L}}^{A}\left(\boldsymbol{\theta}, \boldsymbol{\phi} ; \mathbf{x}^{(i)}\right) \\

&=\color{blue} { \frac{1}{L} \sum_{l=1}^{L} \log p_{\boldsymbol{\theta}}\left(\mathbf{x}^{(i)}, \mathbf{z}^{(i, l)}\right)-\log q_{\boldsymbol{\phi}}\left(\mathbf{z}^{(i, l)} | \mathbf{x}^{(i)}\right) }
 \end{align} $$

where $\mathbf{z}^{(i, l)}=g_{\phi}\left(\boldsymbol{\epsilon}^{(i, l)}, \mathbf{x}^{(i)}\right)$
and $\boldsymbol{\epsilon}^{(l)} \sim p(\boldsymbol{\epsilon})$

2. Often the KL-divergence $D_{K L}\left(q_{\phi}\left(\mathbf{z} | \mathbf{x}^{(i)}\right) \| p_{\boldsymbol{\theta}}(\mathbf{z})\right)$ can be integrated analytically (especially true when both distributions are gaussians) leading to our second version of estimator:

$$\begin{align} \mathcal{L}\left(\boldsymbol{\theta}, \boldsymbol{\phi} ; \mathbf{x}^{(i)}\right) &\approx \widetilde{\mathcal{L}}^{B}\left(\boldsymbol{\theta}, \boldsymbol{\phi} ; \mathbf{x}^{(i)}\right) \\

&=\color{blue} {-D_{K L}\left(q_{\phi}\left(\mathbf{z} | \mathbf{x}^{(i)}\right) \| p_{\boldsymbol{\theta}}(\mathbf{z})\right)+\frac{1}{L} \sum_{l=1}^{L}\left(\log p_{\boldsymbol{\theta}}\left(\mathbf{x}^{(i)} | \mathbf{z}^{(i, l)}\right)\right)}
 \end{align} $$
where $\mathbf{z}^{(i, l)}=g_{\phi}\left(\boldsymbol{\epsilon}^{(i, l)}, \mathbf{x}^{(i)}\right)$
and $\boldsymbol{\epsilon}^{(l)} \sim p(\boldsymbol{\epsilon})$

When given multiple datapoints from dataset $X={x^{(i)} }_{i=1}^N $, we can construct an estimator based on mini-batches:
$\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\phi} ; \mathbf{X}) \simeq \widetilde{\mathcal{L}}^{M}\left(\boldsymbol{\theta}, \boldsymbol{\phi} ; \mathbf{X}^{M}\right)=\frac{N}{M} \sum_{i=1}^{M} \widetilde{\mathcal{L}}\left(\boldsymbol{\theta}, \boldsymbol{\phi} ; \mathbf{x}^{(i)}\right)$

where $\mathbf{X}^{M}=\left\{\mathbf{x}^{(i)}\right\}_{i=1}^{M}$ is the randomly drawn sample.

$\nabla_{\boldsymbol{\theta}, \boldsymbol{\phi}} \widetilde{\mathcal{L}}\left(\boldsymbol{\theta} ; \mathbf{X}^{M}\right)$ can be taken, of course.

The intuition behind this re-parameterization trick is:
we want to find to find some $p(\epsilon)$ such that
$q_{\phi}(\mathbf{z} | \mathbf{x}) \prod_{i} d z_{i}=p(\epsilon) \prod_{i} d \epsilon_{i}$
therefore, $\int q_{\phi}(\mathbf{z} | \mathbf{x}) f(\mathbf{z}) d \mathbf{z}=\int p(\boldsymbol{\epsilon}) f(\mathbf{z}) d \boldsymbol{\epsilon}=\int p(\boldsymbol{\epsilon}) f\left(g_{\phi}(\boldsymbol{\epsilon}, \mathbf{x})\right) d \boldsymbol{\epsilon}$
and so a differentiable estimator can be constructed: $\int q_{\phi}(\mathbf{z} | \mathbf{x}) f(\mathbf{z}) d \mathbf{z} \simeq \frac{1}{L} \sum_{l=1}^{L} f\left(g_{\phi}\left(\mathbf{x}, \epsilon^{(l)}\right)\right)$ where $\boldsymbol{\epsilon}^{(l)} \sim p(\boldsymbol{\epsilon})$.

Concrete ways of finding reparameterizations can be found from the original paper[7].

But the author didn't mention why re-parameterization helps alleviate variance, possible explanations can be found in the appendix section of [8].


  1. Kingma's NIPS 2015 workshop slides
  2. Kingma's Stochastic Gradient VB. Intractable posterior distributions? Gradients to the rescue!
  4. CS231n Lecture 11 Slides 
  5. Mnih, A., & Gregor, K. (n.d.). Neural Variational Inference and Learning in Belief Networks. 
  6. Openreview page for the original paper
  7. Kingma, D., & Welling, M. (2014). Auto-Encoding Variational Bayes
  8. Rezende, D., Mohamed, S., & Wierstra, D. (2014). Stochastic Backpropagation and Approximate Inference in Deep Generative Models.