Monday, May 27, 2019

What is reinforcement learning?

A lot of people might think this question is obvious. It's just a bunch of methods for optimizing behavior for an agent in some environment based on reward signals.

  • But what's its history? 
  • How does it relate to optimization problem or traditional optimal control problem? 
  • How is it different from supervised or unsupervised learning?
  •  Why it's named "reinforcement learning" instead of "optimal behavior learning". 
  • Does the "reinforcement" word suggest something special?
Can you answer these questions?

Approximate dynam-ic programming (ADP) has emerged as a powerful tool for tackling a diverse collection of stochastic optimization problems. Reflecting the wide diversity of problems, ADP (including research under names such as reinforcement learning, adaptive dynamic programming and neuro-dynamic programming) has become an umbrella for a wide range of algorithmic strategies. Most of these involve learning functions of some form using Monte Carlo sampling. A recurring theme in these algorithms involves the need to not just learn policies, but to learn them quickly and effectively. Learning arises in both offline settings (training an algorithm within the computer) and online settings (where we have to learn as we go). Learning also arises in different ways within algorithms, including learning the parameters of a policy, learning a value function and learning how to expand the branches of a tree.
Approximate Dynamic Programming, First edition. By Frank Lewis and Derong Liu 

This book seems to view the so-called "reinforcement learning" as an alias for approximate dynamic programming which is used for solving stochastic optimization problems. So:

Reinforcement Learning= Approximate Dynamic Programming?

The term “ADP” can be interpreted either as “Adaptive Dynamic Programming” (with apologies to Warren Powell) or as “Approximate Dynamic Programming” (as in much of my own earlier work). The long-term goal is to build systems which include both capabilities; therefore, I will simply use the acronym “ADP” itself. Various strands of the field have sometimes been called “reinforcement learning” or “adaptive critics” or “neurodynamic programming,” but the term “reinforcement learning” has had many different meanings to many different people.
Learning And Approximate Dynamic Programming. By Jennie Si, Andy Barto, Warren Powell, and Donald Wunsch 
By the way, this book  or this paper  has drawn a comparison between supervised learning and reinforcement learning. Also you can find that comparison in Sutton and Barton's Reinforcement Learning book.  

Why is it called reinforcement learning?

The term reinforcement comes from studies of animal learning in experimental psychology, where it refers to the occurrence of an event, in the proper relation to aresponse, that tends to increase the probability that the response will occur againin the same situation. The simplest reinforcement learning algorithms make use ofthe commonsense idea that if an action is followed by a satisfactory state of affairs,or an improvement in the state of affairs, then the tendency to produce that actionis strengthened, i.e., reinforced. This is the principle articulated by Thorndike inhis famous “Law of Effect” (Thorndike, 1911). Instead of the term reinforcementlearning, however, psychologists use the terms instrumental conditioning, or operantconditioning, to refer to experimental situations in which what an animal actuallydoes is a critical factor in determining the occurrence of subsequent events. Thesesituations are said to include response contingencies, in contrast to Pavlovian, orclassical, conditioning situations in which the animal’s responses do not influencesubsequent events, at least not those controlled by the experimenter. There are verymany accounts of instrumental and classical conditioning in the literature, and thedetails of animal behavior in these experiments are surprisingly complex. See, forexample, Hergenhahn & Olson, 2001. The basic principles of learning via reinforcement have had an influence on engineering for many decades (e.g., Mendel &McClaren, 1970) and on Artificial Intelligence since its very earliest days (Minsky,1954, 1961; Samuel 1959; Turing, 1950). It was in these early studies of artificiallearning systems that the term reinforcement learning seems to have originated. Sut-vi REINFORCEMENT LEARNING AND ITS RELATIONSHIP TO SUPERVISED LEARNINGton and Barto (1998) provide an account of the history of reinforcement learning inArtificial Intelligence.But the connection between reinforcement learning as developed in engineeringand Artificial Intelligence and the actual details of animal learning behavior is farfrom straightforward. In prefacing an account of research attempting to capturemore of the details of animal behavior in a computational model, Dayan (2002)stated that “Reinforcement learning bears a tortuous relationship with historical andcontemporary ideas in classical and instrumental conditioning.” This is certainly true,as those interested in constructing artificial learning systems are motivated more bycomputational possibilies than by a desire to emulate the details of animal learning.This is evident in the view of reinforcement learning as a combination of search andlong-term memory discussed above, which is a an abstract computational view thatdoes not attempt to do justice to all the subleties of real animal learning.For our mobile phone example, the principle of learning by reinforcement isinvolved in several different ways depending on what grain size of behavior weconsider. We could think of a move in a particular direction as a unit of behavior,being reinforced when reception improved, in which case we would tend to continueto move in the same direction. Another view, one that includes long-term memory,is that the tendency to make a call from a particular place is reinforced when a callfrom that place is successful, thus leading us to increase the probability that we willmake a call from that place in the future. Here we see the reinforcement processmanifested as the storing in long-term memory of the results of a successful search.Note that the principle of learning via reinforcement does not imply that only gradualor incremental changes in behavior are produced. It is possible for complete learningto occur on a single trial, although gradual changes in behavior make more sensewhen the contingencies are stochastic.

Tuesday, May 21, 2019

Max Welling's "Intelligence Per Kilowatt-Hour"

Complement of this blog post.
The talk starts from high-level physics analogy but in fact it's about energy-efficient machine learning in the end. In this blog post I am more interested in physics part.

Helmholtz: Free Energy=  Energy - Entropy 

What is the "ability to perform physical work"?
What is the "level of organization, information of  a system"? He didn't explain. 

It From Bit 

Entropic Forces and Gravity
Entropy is degree of ignorance
Rissanen: Minimum Description Length Principle
Hinton: Variational Methods
Large Scale Approximate Bayesian Learning: MCMC versus Variational Bayes  

Sunday, May 19, 2019

Those measurements for two probability distributions


Also known as Csiszár ƒ-divergences, Csiszár-Morimoto divergences or Ali-Silvey distances
Given two distributions $P$ and $Q$ which are both absolutely continuous w.r.t. a reference distribution $\mu$ on $\Omega$  and:
  • $dP=pd\mu$
  • $dQ=qd\mu$
$D_f(P||Q)=\int_\Omega f \frac{p(x)}{q(x)} q(x) d\mu(x)$

Thursday, May 9, 2019

What is a cross-entropy loss function?

A lot of machine learning resources tend to be vague in giving definitions. This is because the authors don't know these concepts well enough.

So, what is a cross-entropy?

KL divergence(discrete form):
$KL(p||q)=\sum_{k=1}^K p_k log (\frac{p_k}{q_k})$
$KL(p||q)=\sum_k p_k log p_k -\sum_k p_k log q_k=-H(p)+ H(p,q)$

Here, $H(p)$ is the entropy for distribution $p$, $H(p,q)$ is the cross entropy between distribution $p$ and $q$, notice cross-entropy, like KL divergence, is asymmetric.

According to Cover and Thomas 2006, cross entropy is the average number of bits needed to encode data coming from a source with a distribution $p$ when we use model $q$ to define our cookbook.
Hence the "regular" entropy is $H(p)=H(p,p)$.

So what is a cross-entropy loss function, then?
Be patient, look at where does  a maximum log likelihood come from.

You may have seen the derivation of MLE (Maximum Likelihood Estimation) several times. You assume:

  • Data is i.i.d distributed (recent years we have seen researches on non-i.i.d. data, but not for this article)
  • $X={x^{(1)},..., x^{(m)}}$
  • $p_{model}(x;\theta)$ is a parametric family of probability distributions over the data
$\theta_{ML}=argmax_{\theta} p_{model}(X; \theta)=argmax_{\theta} \sum_{i=1}^m p_{model}(x^{(i)}; \theta)$

It doesn't matter if you take the log likelihood because they are all positive:
$argmax_{\theta} \sum_{i=1}^m log( p_{model} (x^{(i)}; \theta))$

Divided  by $m$, it becomes:
$\theta_{ML}=argmax_{\theta} E_{x \sim  \tilde{p}_{data}} log p_{model} (x; \theta)$

$KL(\tilde{p}_{data} || p_{model})=E_{x \sim  \tilde{p}_{data}} [log \tilde{p}_{data}(x) - log p_{model}(x) ]$

Minimizing over $KL(\tilde{p}_{data} || p_{model})$ w.r.t our model is equivalent to minimizing for the cross entropy term:
$-E_{x \sim  \tilde{p}_{data}} [log p_{model} (x)]=H(\tilde{p}_{data} || p_{model})$

This is also equivalent to MLE above. In fact, any loss consisting of a negative log-likelihood is a cross-entropy between the empirical distribution and probability distribution defined by the model.  e.g., mean squared error is the cross-entropy between the empirical distribution and a Gaussian model. The term "cross-entropy" used to refer negative log-likelihood (NLL) for a Bernoulli(logistic regression) or softmax distribution  is a misnomer because cross-entropy is in fact used in machine learning wherever there is a maximum likelihood.
2019.5.18 Note:
For many discriminative models, the above formulas aren't very accurate. Concretely:

  • $X=\{x^{(i)}\}_{i=1}^n \times Y=\{y^{(i)}\}_{i=1}^n \sim \tilde{p}_{data}$
  • $$\theta_{ML}=argmax_{\theta} p_{model}(Y|X, \theta)=argmax_{\theta} \sum_{i=1}^m p_{model}(y^{(i)} | x^{(i)}, \theta)$$
  • $\theta_{ML}=argmax_{\theta} E_{x,y \sim  \tilde{p}_{data}}[ log p_{model} (y | x, \theta)]$
  • Then it can seen as minimize cross-entropy: $-E_{x,y \sim  \tilde{p}_{data}}[ log p_{model} (y | x, \theta)]=H(\tilde{p}_{data}(y|x) || p_{model}(y|x, \theta))$
  • Or you can see it from KL divergence perspective: $argmin_{\theta} KL(\tilde{p}_{data}(y|x) || p_{model}(y|x))=argmin_{\theta} E_{x,y \sim  \tilde{p}_{data}} [log \tilde{p}_{data}(y|x) - log p_{model}(y|x) ]=argmin_{\theta} H(\tilde{p}_{data}(y|x) || p_{model}(y|x, \theta))$
  1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
  2. Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.

Wednesday, May 1, 2019

Geometric Deep Learning?

I found a new website:
Good tutorials:

Basically in the tutorial Joan introduced the concept of manifold and graph theories. And then derived spectral graph convolution operations. What is missing is how can CNN be applied to general manifold instead of Euclidean space and graph?