A lot of machine learning resources tend to be vague in giving definitions. This is because the authors don't know these concepts well enough.

So, what is a cross-entropy?

KL divergence(discrete form):

$KL(p||q)=\sum_{k=1}^K p_k log (\frac{p_k}{q_k})$

$KL(p||q)=\sum_k p_k log p_k -\sum_k p_k log q_k=-H(p)+ H(p,q)$

Here, $H(p)$ is the entropy for distribution $p$, $H(p,q)$ is the cross entropy between distribution $p$ and $q$, notice cross-entropy, like KL divergence, is asymmetric.

According to

Hence the "regular" entropy is $H(p)=H(p,p)$.

So what is a cross-entropy loss function, then?

Be patient, look at where does a maximum log likelihood come from.

You may have seen the derivation of MLE (Maximum Likelihood Estimation) several times. You assume:

So, what is a cross-entropy?

KL divergence(discrete form):

$KL(p||q)=\sum_{k=1}^K p_k log (\frac{p_k}{q_k})$

$KL(p||q)=\sum_k p_k log p_k -\sum_k p_k log q_k=-H(p)+ H(p,q)$

Here, $H(p)$ is the entropy for distribution $p$, $H(p,q)$ is the cross entropy between distribution $p$ and $q$, notice cross-entropy, like KL divergence, is asymmetric.

According to

*Cover and Thomas 2006,***cross entropy is the average number of bits needed to encode data coming from a source with a distribution $p$ when we use model $q$ to define our cookbook.**Hence the "regular" entropy is $H(p)=H(p,p)$.

So what is a cross-entropy loss function, then?

Be patient, look at where does a maximum log likelihood come from.

You may have seen the derivation of MLE (Maximum Likelihood Estimation) several times. You assume:

- Data is i.i.d distributed (recent years we have seen researches on non-i.i.d. data, but not for this article)
- $X={x^{(1)},..., x^{(m)}}$
- $p_{model}(x;\theta)$ is a parametric family of probability distributions over the data

$\theta_{ML}=argmax_{\theta} p_{model}(X; \theta)=argmax_{\theta} \sum_{i=1}^m p_{model}(x^{(i)}; \theta)$

It doesn't matter if you take the log likelihood because they are all positive:

$argmax_{\theta} \sum_{i=1}^m log( p_{model} (x^{(i)}; \theta))$

Divided by $m$， it becomes:

$\theta_{ML}=argmax_{\theta} E_{x \sim \tilde{p}_{data}} log p_{model} (x; \theta)$

Notice:

$KL(\tilde{p}_{data} || p_{model})=E_{x \sim \tilde{p}_{data}} [log \tilde{p}_{data}(x) - log p_{model}(x) ]$

Minimizing over $KL(\tilde{p}_{data} || p_{model})$ w.r.t our model is equivalent to minimizing for the cross entropy term:

$-E_{x \sim \tilde{p}_{data}} [log p_{model} (x)]=H(\tilde{p}_{data} || p_{model})$

This is also equivalent to MLE above. In fact, any loss consisting of a negative log-likelihood is a cross-entropy between the

--------------------------------

2019.5.18 Note:

For many discriminative models, the above formulas aren't very accurate. Concretely:

**empirical distribution**and**probability distribution**defined by the model. e.g., mean squared error is the cross-entropy between the empirical distribution and a Gaussian model. The term "cross-entropy" used to refer**negative log-likelihood (NLL)**for a Bernoulli(logistic regression) or softmax distribution is a misnomer because cross-entropy is in fact used in machine learning wherever there is a maximum likelihood.--------------------------------

2019.5.18 Note:

For many discriminative models, the above formulas aren't very accurate. Concretely:

- $X=\{x^{(i)}\}_{i=1}^n \times Y=\{y^{(i)}\}_{i=1}^n \sim \tilde{p}_{data}$

- $$\theta_{ML}=argmax_{\theta} p_{model}(Y|X, \theta)=argmax_{\theta} \sum_{i=1}^m p_{model}(y^{(i)} | x^{(i)}, \theta)$$
- $\theta_{ML}=argmax_{\theta} E_{x,y \sim \tilde{p}_{data}}[ log p_{model} (y | x, \theta)]$
- Then it can seen as minimize cross-entropy: $-E_{x,y \sim \tilde{p}_{data}}[ log p_{model} (y | x, \theta)]=H(\tilde{p}_{data}(y|x) || p_{model}(y|x, \theta))$
- Or you can see it from KL divergence perspective: $argmin_{\theta} KL(\tilde{p}_{data}(y|x) || p_{model}(y|x))=argmin_{\theta} E_{x,y \sim \tilde{p}_{data}} [log \tilde{p}_{data}(y|x) - log p_{model}(y|x) ]=argmin_{\theta} H(\tilde{p}_{data}(y|x) || p_{model}(y|x, \theta))$

Reference:

*Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.*

*Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.*

## No comments:

## Post a Comment