According to Itti and Baldi, the surprise a datum causes is the KL-divergence between the distribution of likelihoods of models before and after observing a that datum.

**1. Why is this good?**

The KL-divergence in this case reduces to - \sum_i P(M_i) \log P(D|M_i) + \log P(D), where P(D) = \sum_i P(M_i) P(D|M_i). (Or equivalent integrals, if the model parameters are continuous.)

In English, this translates to the average expected information contained in the datum given a randomly chosen model sampled from the current model distribution *minus* the actual information content of the datum given current beliefs (as represented by the current model distribution). That is, the metric will be large when some models currently believed plausible rate it as highly improbable, so that the first term is large, while other models currently believed plausible rate it as probable, so that the second term is small. That is, the metric will be large when the datum helps distinguish between possible models.

**2. How to compute it?**

Even after the mucking about above, this is a nasty sum (or integral) to compute. Itti and Baldi limited themselves to a very simple form of model having an algebraic solution to the intergral. I want to apply this to more complex models, so I've been considering numerical means of integration.

The obvious thing that springs to mind is to represent the model distribution by randomly sampling a set of models from the model distribution. To get this set of samples, I'm thinking the bastard child of the Metropolis-Hastings algorithm and downhill simplex optimization might be good.

With this sample set we can approximate summations over all models, weighted by P(M_i), by a summation over just this set. And all the quantities we need to compute are of this form.