Friday, October 25, 2019

Fusion of Detected Objects in Text for Visual Question Answering


B2T2: Bounding Boxes in Text Transformer 


Late fusion: Text and image that are integrated late
Early fusion: The processing of one be conditioned on the analysis of the other 

Data:= $(I , B, T, I)$

  • $I$ is an image
  • $B=[b_1, b_2, ..., b_m ]$ a list of bounding boxes referring to regions of $I$, where each $b_i $ is identified by the lower left corner, height and width
  • $T=[t_1, ..., t_n]$ is a passage of tokenized text (some tokens are not natural language but references to $B$)
  • $l$ is a numeric class label in $\{ 0, 1\}$


model the class distribution as:

$$p(l | T, I , B, R)= \frac{e^{\phi (E'(T, I, B, R))*a_l + b_l}}
{\sum_{l'}  e^{\phi (E'(T, I, B, R))*a_{l'} + b_{l'}} }

E'(T, I, B, R)= E(T) + \sum_{i=1}^m R_i [M \Phi(crop(I, b_i)) + \pi (b_i)]^T

where $M$ is a learnt matrix, $\Phi$ can be a resnet, $\pi(b_i)$ denotes the embedding of $b_i$'s shape and position information in $R^d$

Monday, October 21, 2019

Gradient Boosting Trees

What is gradient boosting

A kind of boosting using functional gradient descent. It uses weak learner to fit gradient residuals and then add to original learner, thus making an ensemble. 

Input: training set  a differentiable loss function  number of iterations M.
  1. Initialize model with a constant value:
  2. For m = 1 to M:
    1. Compute so-called pseudo-residuals:
    2. Fit a base learner (or weak learner, e.g. tree)  to pseudo-residuals, i.e. train it using the training set .
    3. Compute multiplier  by solving the following one-dimensional optimization problem:
    4. Update the model:
  3. Output 


$$\text{obj}(\theta) = L(\theta) + \Omega(\theta)$$

where $L$ is the training loss function, and $Ω$ is the regularization term. The training loss measures how predictive our model is with respect to the training data. A common choice of L is the mean squared error, which is given by
L(\theta) = \sum_i (y_i-\hat{y}_i)^2

\text{obj} = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i)

Additive Training:
\begin{split}\hat{y}_i^{(0)} &= 0\\
\hat{y}_i^{(1)} &= f_1(x_i) = \hat{y}_i^{(0)} + f_1(x_i)\\
\hat{y}_i^{(2)} &= f_1(x_i) + f_2(x_i)= \hat{y}_i^{(1)} + f_2(x_i)\\
\hat{y}_i^{(t)} &= \sum_{k=1}^t f_k(x_i)= \hat{y}_i^{(t-1)} + f_t(x_i)\end{split}

For step t:
\begin{split}\text{obj}^{(t)} & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i) \\
          & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t) + \mathrm{constant}\end{split}

For L2 loss:
\begin{split}\text{obj}^{(t)} & = \sum_{i=1}^n (y_i - (\hat{y}_i^{(t-1)} + f_t(x_i)))^2 + \sum_{i=1}^t\Omega(f_i) \\
          & = \sum_{i=1}^n [2(\hat{y}_i^{(t-1)} - y_i)f_t(x_i) + f_t(x_i)^2] + \Omega(f_t) + \mathrm{constant}\end{split}

\text{obj}^{(t)} = \sum_{i=1}^n [l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t) + \mathrm{constant}

\begin{split}g_i &= \partial_{\hat{y}_i^{(t-1)}} l(y_i, \hat{y}_i^{(t-1)})\\
h_i &= \partial_{\hat{y}_i^{(t-1)}}^2 l(y_i, \hat{y}_i^{(t-1)})\end{split}

\textrm{obt}^{(t)}=\sum_{i=1}^n [g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t)


Redefine each tree as:
f_t(x) = w_{q(x)}, w \in R^T, q:R^d\rightarrow \{1,2,\cdots,T\}

Here w is the vector of scores on leaves, q is a function assigning each data point to the corresponding leaf, and T is the number of leaves. In XGBoost, we define the complexity as

\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2


\begin{split}\text{obj}^{(t)} &\approx \sum_{i=1}^n [g_i w_{q(x_i)} + \frac{1}{2} h_i w_{q(x_i)}^2] + \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2\\
&= \sum^T_{j=1} [(\sum_{i\in I_j} g_i) w_j + \frac{1}{2} (\sum_{i\in I_j} h_i + \lambda) w_j^2 ] + \gamma T\end{split}

where Ij={i|q(xi)=j} is the set of indices of data points assigned to the j-th leaf. Notice that in the second line we have changed the index of the summation because all the data points on the same leaf get the same score. We could further compress the expression by defining:

$G_j = \sum_{i\in I_j} g_i$

$H_j = \sum_{i\in I_j} h_i$

\text{obj}^{(t)} = \sum^T_{j=1} [G_jw_j + \frac{1}{2} (H_j+\lambda) w_j^2] +\gamma T

The theoretical best solution is:
\begin{split}w_j^\ast &= -\frac{G_j}{H_j+\lambda}\\
\text{obj}^\ast &= -\frac{1}{2} \sum_{j=1}^T \frac{G_j^2}{H_j+\lambda} + \gamma T\end{split}

Splitting criterion:
Gain = \frac{1}{2} \left[\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right] - \gamma

Thursday, September 26, 2019

Attention Mechanism

Why attention? 

Resource saving

Only need sensors where relevant bits are
(e.g. fovea vs. peripheral vision)
• Only compute relevant bits of information
(e.g. fovea has many more ‘pixels’ than periphery)

Variable state manipulation

• Manipulate environment (for all grains do: eat)
• Learn modular subroutines (not state)

Some forms of attention that you might not notice:

  • Watson Nadaraya Estimator
  •  Pooling
  • Iterative Pooling 


  1. Alex Smola and Aston Zhang, Attention in Deep Learning, ICML 2019 Tutorial , Slides link 

Monday, September 23, 2019

Mutual Information

Mutual Information 
I(X ; Y)=D_{\mathrm{KL}}\left(P_{(X, Y)} \| P_{X} \otimes P_{Y}\right)


Intuitively, mutual information measures the information that $X$ and $Y$ share: It measures how much knowing one of these variables reduces uncertainty about the other.

\begin{aligned} \mathrm{I}(X ; Y) & \equiv \mathrm{H}(X)-\mathrm{H}(X | Y) \\ & \equiv \mathrm{H}(Y)-\mathrm{H}(Y | X) \\ & \equiv \mathrm{H}(X)+\mathrm{H}(Y)-\mathrm{H}(X, Y) \\ & \equiv \mathrm{H}(X, Y)-\mathrm{H}(X | Y)-\mathrm{H}(Y | X) \end{aligned}

where $H(Y|X)$ means Conditional Entropy, defined as:


\mathrm{H}(Y | X)=-\sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p(x, y) \log \frac{p(x, y)}{p(x)}


\begin{aligned} \mathrm{H}(Y | X) & \equiv \sum_{x \in \mathcal{X}} p(x) \mathrm{H}(Y | X=x) \\ &=-\sum_{x \in \mathcal{X}} p(x) \sum_{y \in \mathcal{Y}} p(y | x) \log p(y | x) \\ &=-\sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p(x, y) \log p(y | x) \\ &=-\sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p(x, y) \log p(y | x) \\ &=-\sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p(x, y) \log \frac{p(x, y)}{p(x)} \\ &=\sum_{x \in \mathcal{X}, y \in \mathcal{Y}} p(x, y) \log \frac{p(x)}{p(x, y)} \end{aligned}


Monday, September 16, 2019



Assume the RNN is with the following architecture, with hyperbolic tangent activation function, output is discrete.

From time $t=1$ to $t=\tau$ , we apply the update equations:
$\begin{aligned} \boldsymbol{a}^{(t)} &=\boldsymbol{b}+\boldsymbol{W} \boldsymbol{h}^{(t-1)}+\boldsymbol{U} \boldsymbol{x}^{(t)} \\ \boldsymbol{h}^{(t)} &=\tanh \left(\boldsymbol{a}^{(t)}\right) \\ \boldsymbol{o}^{(t)} &=\boldsymbol{c}+\boldsymbol{V} \boldsymbol{h}^{(t)} \\ \hat{\boldsymbol{y}}^{(t)} &=\operatorname{softmax}\left(\boldsymbol{o}^{(t)}\right) \end{aligned}$

The total loss for a given sequence of $x$ values paired with a sequence of $y$ values would then be just the sum of the losses over all the time steps. If $L^{(t)}$  is the negative log-likelihood of $y^{(t)}$ given $x^{(1)}, x^{(2)}, ..., x^{(t)}$, then

$\begin{aligned} & L\left(\left\{\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(\tau)}\right\},\left\{\boldsymbol{y}^{(1)}, \ldots, \boldsymbol{y}^{(\tau)}\right\}\right) \\=& \sum_{t} L^{(t)} \\=&-\sum_{t} \log p_{\text {model }}\left(y^{(t)} |\left\{\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(t)}\right\}\right) \end{aligned}$

For each node $\boldsymbol{N}$:

$$\frac{\partial L}{\partial L^{(t)}}=1$$


LSTM architecture from Deep Learning book

Forget gate:   $$ f_{i}^{(t)}=\sigma\left(b_{i}^{f}+\sum_{j} U_{i, j}^{f} x_{j}^{(t)}+\sum_{j} W_{i, j}^{f} h_{j}^{(t-1)}\right) $$

where $\sigma$ is the sigmoid function

Input gate: 
g_{i}^{(t)}=\sigma\left(b_{i}^{g}+\sum_{j} U_{i, j}^{g} x_{j}^{(t)}+\sum_{j} W_{i, j}^{g} h_{j}^{(t-1)}\right)


Output gate:
q_{i}^{(t)}=\sigma\left(b_{i}^{o}+\sum_{j} U_{i, j}^{o} x_{j}^{(t)}+\sum_{j} W_{i, j}^{o} h_{j}^{(t-1)}\right)


LSTM Cell Internal State:
s_{i}^{(t)}=f_{i}^{(t)} s_{i}^{(t-1)}+g_{i}^{(t)} \sigma\left(b_{i}+\sum_{j} U_{i, j} x_{j}^{(t)}+\sum_{j} W_{i, j} h_{j}^{(t-1)}\right)


Output(hidden state):
h_{i}^{(t)}=\tanh \left(s_{i}^{(t)}\right) q_{i}^{(t)}


  1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.

Sunday, June 23, 2019

Different perspectives on importance weight autoencoders

The importance-weighted autoencoder (IWAE) approach of Burda et al.[1] defines a sequence of increasingly tighter bounds on the marginal likelihood of latent variable models. Recently, Cremer et al.[2] reinterpreted the IWAE bounds as ordinary variational evidence lower bounds (ELBO) applied to increasingly accurate variational distributions. In this work, we provide yet another perspective on the IWAE bounds. We interpret each IWAE bound as a biased estimator of the true marginal likelihood where for the bound defined on K samples we show the bias to be of order O(1/K). In our theoretical analysis of the IWAE objective we derive asymptotic bias and variance expressions. Based on this analysis we develop jackknife variational inference (JVI), a family of bias-reduced estimators reducing the bias to O(K^{-(m+1)}) for any given m < K while retaining computational efficiency. Finally, we demonstrate that JVI leads to improved evidence estimates in variational autoencoders. We also report first results on applying JVI to learning variational autoencoders.
Microsoft Research


  1. Burda, Y., Grosse, R., & Salakhutdinov, R. (2016). IMPORTANCE WEIGHTED AUTOENCODERS. 
  2. Cremer, C., Morris, Q., & Duvenaud, D. (n.d.). Workshop track -ICLR 2017 REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS. 
  3. Debiasing Evidence Approximations: On Importance-Weighted Autoencoders and Jackknife Variational Inference