- Introduction
- Simple estimators
- Gaussian_estimate
- Discrete_estimate
- Multivariate_gaussian_estimate
- Regression_estimate
- Hidden_factor_estimate

- Structured estimators

- Writing your own estimator

PyMML is a Python implementation of several of the MML87 estimators described in the book "Statistical and Inductive Inference by Minimum Message Length".

It may be loaded using:

>>> import mml

PyMML uses `numarray` and, for some estimators, `scipy`. It will be assumed in examples that you have imported `numarray`.

>>> from numarray import *

Nothing amazing here. `Gaussian_estimate` estimates the parameters of a normal distribution. The estimates produced are identical to those of classical statistics.

If priors are supplied, the estimator can also produce a length. This allows it to be plugged into the mixture estimator (or other such structured estimators).

**Examples:**

>>> data = array([ 1.0, 2.0, 3.0, 4.0, 5.0]) >>> estimate = mml.Gaussian_estimate(data) >>> estimate Gaussian_estimate: mean=3.00 sigma=1.58

The `mean` and `sigma` (standard deviation) estimates can be accessed like so:

>>> estimate.mean 3.0 >>> estimate.sigma 1.5811388300841898

If a length is needed, priors must be specified:

>>> estimate = mml.Gaussian_estimate(data, (0.0, 100.0), 0.1, 100.0) >>> estimate Gaussian_estimate: 26.94 nits mean=3.00 sigma=1.58 >>> estimate.length() 26.937871819253928

Here `(0.0, 100.0)` are the minimum and maximum expected values of the mean, `0.1` is the quantum of measurement, and `100.0` is the maximum expected standard deviation. The quantum of measurement defines how accurately data points need to be stored in the MML message, and also gives a lower bound on the expected standard deviation.

The unit used here, "nits", is a pun on "bits", being the number of base *e* digits required to encode some information. This is of course very silly, as there is no such thing as a base *e* digit. It's probably best seen as the negative natural logarithm of the likelihood of the message. The reason nits are used is because the maths is easier than working in bits, and working in raw probabilities quickly leads to floating point underflows.

`Discrete_estimate` takes as input a list of discrete values and gives an estimate of the probability of each value. It is based on the multinomial estimate described in the MML book, but assumes the order of values is important. This affects the message length, but not the estimated probabilities.

The form of prior used for this estimator is a "conjugate prior". That is, it may be viewed as the estimator having already seen a number of data points. Such priors tend to be reasonable, and have simple mathematical form... though to be precise, it is a slight generalization on a conjugate prior, in that we allow non-integer numbers of "prior observations". Such slightly generalized conjugate priors are common amongst the MML estimators described in the MML book.

Assuming one has seen one of each possible type of datum previously gives a uniform prior, and this is used as the default.

Note: if a prior other than the default is required, `scipy` must be installed.

**Examples:**

>>> data = array([ 1, 2, 1, 1, 0, 3, 3]) >>> mml.Discrete_estimate(data, 4) Discrete_estimate: 16.61 nits probability=[ 0.17 0.39 0.17 0.28]

Here, `4` is the number of types. Each type in the data is represented by an integer in `range(<number of types>)`.

A different prior can be specified:

>>> prior = array([ 1.0, 0.1, 1.0, 40.0 ]) >>> mml.Discrete_estimate(data, 4, prior) Discrete_estimate: 25.90 nits probability=[ 0.03 0.06 0.03 0.88]

Here `prior` specifies the prior. The values need not be integer, merely positive, though the interpretation of the prior is easier to understand if the values are integer. In the example, it is as though we have seen 1 of type 0, `0.1' of type 1, 1 of type 2, and 40 of type 3. Thus, despite the evidence of `data`, the estimator judges type 3 to be rather more common than the others, based on the prior expectation. However, while the estimator might have found the presence of type 1 items surprising, it has quickly learned that type 1 is not as rare as expected.

Note also that this estimate has a longer length than the previous one, as the data is more surprising given this prior.

`Multivariate_gaussian_estimate` gives an estimate of the mean and covariance of multi-dimensional data. Unlike `Gaussian_estimate`, this estimate uses a conjugate prior and does not correspond to classical statistical estimates.

The prior for this estimator has five parameters:

`quantum` - The measurement accuracy. This is a *volume*, and may be obtained for example by multiplying together the measurement accuracies of each individual dimension.

`m` - The number of "data points" defining the prior on the variance. Must be greater than one, and if a length estimate is required must be greater than the number of dimensions of the data. Does not need to be an integer. (A value between one and the number of dimensions in the data yields an "improper" prior.)

`V` - The total variance of the "data points" defining the prior variance. That is, `m` times the expected variance.

`m1` - The number of "data points" defining the prior on the mean. Must be positive. Does not need to be an integer.

`u0` - The prior expectation for the mean.

If only `quantum` is given, the prior expectation on the mean defaults to zero, with `m1 = 1`, and the prior expectation on the variance defaults to a spherically symmetric distribution of volume `quantum`, with `m` set to the number of dimensions in the data plus one.

Note: `scipy` must be installed to use this estimator.

**Examples:**

>>> data = array([ [1.0, 1.0], [3.0,2.5], [0.0,0.0], [7.0,7.5] ]) >>> mml.Multivariate_gaussian_estimate(data, 0.1 ** data.shape[1]) Multivariate_gaussian_estimate: 25.84 nits mean=[ 2.2 2.2] covariance=[[ 4.98 5.26] [ 5.26 5.62]]

Here, `0.1` is the measurement accuracy.

`Regression_estimate` performs linear regression, producing a linear model of some data in terms of some other given data.

A conjugate prior is used for the weights of the linear model. The effect of this is to slightly reduce the magnitude of these weights as compared to the result of classical least-squares fitting. This is not an essential feature of MML regression, if a uniform prior were used the MML estimate would be identical to classical least-squares fitting. However one would then need to define bounds on this prior, and it is not clear what a good choice of bounds would be. The conjugate form leads to a fairly natural choice of prior, and this allows more confident comparison of regression estimates based on different sets of data -- which is the main reason you would want to use MML regression.

The given data must be normalized to have mean `0.0` and variance `1.0`. If necessary, a constant term can be found by inserting a column of `1.0`s into the given_data.

The prior has one parameter, `m`, defining the expected ratio of the variance of the model to the variance of the residual. The choice of `m` is important when the size of the number of data points available is of roughly the same magnitude as the number of dimensions in the given data, but will have little effect when the data is more plentiful. The default, `m=1.0`, gives a prior expectation that the variance of the model weights will be equal to the variance of the residual. Larger `m` gives an expectation of the model weights being larger than the residual, smaller `m` gives an expectation of model weights smaller than the residual.

**Examples:**

>>> given_data = array([ [1.0, 0.0], [1.0, 1.0], [0.0,-3.0], [-2.0, 1.0], [0.5, 0.0] ]) >>> data = array([ 1.0, 0.0, 3.0, -3.0, 0.5 ])

The given data must be normalized:

>>> given_data -= sum(given_data) / given_data.shape[0] >>> given_data /= sqrt(sum(given_data*given_data) / given_data.shape[0]) >>> estimate = mml.Regression_estimate(data, given_data) >>> estimate Regression_estimate: sigma=0.82 model=[ 0.95 -1.24] >>> dot(given_data, estimate.model) array([ 0.59830942, -0.24528408, 2.27683648, -2.8020445 , 0.17218269])

Here, "sigma" is the standard deviation of residual (`data - dot(given_data, estimate.model)`).

If a length is needed, priors must be given:

>>> mml.Regression_estimate(data, given_data, 1.0, 0.1, 10.0) Regression_estimate: 26.46 nits sigma=0.82 model=[ 0.95 -1.24]

Here, `1.0` is the `m` value, `0.1` is the quantum of measurement, and `10.0` is the maximum expected sigma.

`Hidden_factor_estimate` is particularly nifty. This estimate is a case in which MML leaves other estimation techniques for dead.

The idea here is somewhat similar to the `Multivariate_gaussian_estimate`, in that we are fitting a multivariate Gaussian distribution to some multi-dimensional data. The difference is that the hidden factor estimator assumes the covariance is of a special form. It is assumed that, to the extent that different dimensions are correlated, they are correlated due to the presence of a single hidden factor. Estimates are found for the variance of each separate dimension, the direction vector of the hidden factor, and the amount of the hidden factor present in each datum.

It is possible that a hidden factor will not be found in the data. A single bit is used in the message to convey whether a hidden factor is present or not. (Note: this extra bit is not part of the length equations in the MML book. Also note that the use of a whole bit is a slight over-estimate of the length required to transmit this information. A true MML estimate would use some mind-bendingly clever coding trick to transmit that bit imprecisely. But hey, it's *one bit*, it's not a huge deal.)

Note: `scipy` must be installed to use this estimator.

**Examples:**

>>> data = array([ [-2.0,-1.0], [5.0,2.6], [10.1, 5.0], [-3.0, -1.4], [1.0, 0.6], [-8.0,-4.1] ]) >>> mml.Hidden_factor_estimate(data) Hidden_factor_estimate: mean=[ 0.52 0.28] sigma=[ 0.16 0.08] has_factor=True factor_loads=[ 6.38 3.21] factor_scores=[-0.4 0.71 1.49 -0.54 0.09 -1.35]

Here, `factor_loads` gives the hidden factor and `factor_scores` gives the amount of that factor estimated to be present in each datum. `sigma` is the standard deviation of each dimension after the hidden factor is removed.

If a length is needed, priors must be specified:

>>> mean_range = array([ [-30.0,30.0], [-30.0,30.0] ]) >>> quantum = array([ 0.01, 0.01 ]) >>> max_sigma = array([ 30.0, 30.0 ]) >>> mml.Hidden_factor_estimate(data, mean_range, quantum, max_sigma) Hidden_factor_estimate: 88.78 nits mean=[ 0.52 0.28] sigma=[ 0.16 0.08] has_factor=True factor_loads=[ 6.38 3.21] factor_scores=[-0.4 0.71 1.49 -0.54 0.09 -1.35]

Here, `mean_range` gives lower and upper bounds on the value of the mean in each dimension, `quantum` gives the quantum of measurement in each dimension, and `max_sigma` gives the maximum expected standard deviation in each dimension (after removal of the hidden factor).

`Composite_estimate` simply concatenates several estimates into the one message. Its main use is in conjunction with `Mixture_estimate`.

Data passed to this estimate should be given in the form of a `numarray` record array.

**Examples:**

>>> from numarray import records >>> data = records.array([ [0, 0.1], [0, -0.2], [0, 0.3], [0, -0.5], [0, 0.0], ... [1, 10.0], [1, 10.1], [1, 9.9], [1, 10.2], [1, 11.0] ]) >>> field_priors = [ (mml.Discrete_estimate, 2), (mml.Gaussian_estimate, (-30.0,30.0), 0.1, 30.0) ] >>> mml.Composite_estimate(data, field_priors) Composite_estimate: 68.39 nits Discrete_estimate: 9.24 nits probability=[ 0.5 0.5] Gaussian_estimate: 58.95 nits mean=5.09 sigma=5.44

Here, `field_priors` is a list of priors for each field in the record array. Individual priors are given by a tuple containing the class of estimator to use for that field followed by the parameters of the prior to be passed to that estimator.

`Mixture_estimate` performs mixture modelling. Mixture modelling was one of the first applications of MML, and is an application in which MML shines. A mixture model treats data as samples from a mixture of classes, each class having different parameters describing the distribution of values. The advantage of MML here is that it can use length calculations to determine the number of classes that can best be used to describe the data.

The model (and associated priors) used to describe the classes must be specified. This model can be any MML estimator. `Composite_estimate`-type classes may be used to perform Snob
style data clustering. `Hidden_factor_estimate`-type classes may be used to perform Factor-Snob
style clustering of multivariate data, although without Factor-Snob's ability to generate hierarchical taxonomies (note: this has not been optimized, and may be slow). The `Multivariate_gaussian_estimate` may also be used for multivariate data. A single-dimensional class model may be used to analyze single-dimensional data that is multi-modal.

(Note: use of `Regression_estimate` in a mixture model would result in a somewhat more robust linear regression, but would not be able to cluster based on `given_data`, only on `data`, and therefore would *not* give a non-linear model.)

The output of `Mixture_estimate` is a list of classes, the relative abundance of each class, and for each datum the likelihood of that datum being an instance of each identified class.

The results of this estimator are arrived at by iterative refinement. By default, 1000 iterations are used, but it is conceivable that with certain data the estimator will not arrive at the optimum estimate within this many iterations. Note that a slightly different optimization algorithm is used here than that used in Snob, and that this different algorithm has not been thoroughly tested.

**Examples:**

Using the same `data` and `field_priors` as in the example for `Composite_estimate` above:

>>> prior = (mml.Composite_estimate, field_priors) >>> estimate = mml.Mixture_estimate(data, prior) >>> estimate Mixture_estimate: 59.34 nits 50% Composite_estimate: 25.50 nits Discrete_estimate: 4.17 nits probability=[ 0.08 0.92] Gaussian_estimate: 21.12 nits mean=10.24 sigma=0.44 50% Composite_estimate: 24.04 nits Discrete_estimate: 4.17 nits probability=[ 0.92 0.08] Gaussian_estimate: 19.66 nits mean=-0.06 sigma=0.31

(This may take a little while to complete.)

Note how `Mixture_estimate` has arrived at a reasonable estimate as to the correct number of classes, as well as the parameters of each class.

`estimate.assignments` gives the likelihood of each datum belonging to each identified class. In this case the assignments are pretty definite (this will not always be so):

>>> estimate.assignments array([[ 7.11684102e-117, 1.00000000e+000], [ 8.55931957e-124, 1.00000000e+000], [ 3.86505552e-112, 1.00000000e+000], [ 1.70426389e-130, 1.00000000e+000], [ 3.32157425e-119, 1.00000000e+000], [ 1.00000000e+000, 7.61183851e-238], [ 1.00000000e+000, 1.31164746e-242], [ 1.00000000e+000, 4.17671382e-233], [ 1.00000000e+000, 2.13706329e-247], [ 1.00000000e+000, 1.41265790e-286]]) >>> argmax(estimate.assignments) array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])

Further iterative refinement of the model can be performed, if desired:

>>> estimate = mml.Mixture_estimate(data, prior, 100, estimate.assignments)

Here, `100` is the number of further iterations to perform.

If the dimensions of the data are correlated within classes, a `Composite_estimate` based mixture will contain spurious classes strung like dew-drops along these correlations. When this happens, it's time to break out the heavy artillery:

>>> data = array([[9.42,9.20], [1.47,1.15], [5.45,5.85], [7.09,7.04], [9.15,9.39], ... [30.26,0.53 ], [30.84,0.82], [30.21,0.32], [30.53,0.52], [30.78,0.45]]) >>> mml.Mixture_estimate(data, (mml.Multivariate_gaussian_estimate, 0.01**data.shape[1])) Mixture_estimate: 103.08 nits 22% Multivariate_gaussian_estimate: 7.63 nits mean=[ 5.5 5.41] covariance=[[ 9.63 9.44] [ 9.44 9.26]] 30% Multivariate_gaussian_estimate: 26.45 nits mean=[ 4.02 4.1 ] covariance=[[ 8.5 8.94] [ 8.94 9.43]] 48% Multivariate_gaussian_estimate: 58.10 nits mean=[ 25.44 0.44] covariance=[[ 97.1 1.7 ] [ 1.7 0.05]] >>> mml.Mixture_estimate(data, ( mml.Hidden_factor_estimate, ... array([[-100.0,100.0],[-100.0,100.0]]), array([0.01,0.01]), array([100.0,100.0]) )) Mixture_estimate: 151.95 nits 50% Hidden_factor_estimate: 64.15 nits mean=[ 30.52 0.53] sigma=[ 0.29 0.18] has_factor=False factor_loads=[ 0. 0.] 50% Hidden_factor_estimate: 79.76 nits mean=[ 6.52 6.53] sigma=[ 0.28 0.29] has_factor=True factor_loads=[ 3.24 3.34] factor_scores=[ 0.84 -1.57 -0.26 0.16 0.83 2.75 2.88 2.71 2.79 2.82]

The bold may wish to create their own MML estimate classes. These may then, for example, be plugged into the mixture estimator to perform novel forms of clustering.

Let us assume you have already checked that your model does not break any of the MML87 assumptions, and performed the mathematical gymnastics required to pose reasonable priors and derive the Fisher information. (It is common also to find formulae for parameters that will minimize the MML87 length formula. If this is difficult, `scipy.optimize.fmin` or similar may be used to minimize the message length.)

To create a new estimator, you need to create a new sub-class of `Estimate`. `Estimate` is the base class of all PyMML estimators. Your new class will need to provide a number of terms used in the MML87 length formula (which is implemented in the method `Estimate.length`). To be precise, your class must define methods called `neglog_prior`, `neglog_fisher`, `neglog_data`, and `dimensions`.

The `__init__` method of your new class should perform the actual parameter estimation. The parameter estimates performed in `__init__` should minimize the quantity calculated by `Estimate.length`. (The function `Estimate.test` may be used to check that there is no trivial alteration to parameter values that will reduce the message length.)

The parameters `__init__` accepts should be: `data` (an array of data points), any parameters for priors, and `weighting` (an array of floats, of the same length as `data`). Your `__init__` method should first call `Estimate.__init__(self, data, weighting)`. It should then perform the parameter estimation.

The `weighting` parameter is needed in order to support partial assignment when performing mixture modelling. Its effect can be seen in `Estimate.total_neglog_data`. Parameter estimation in `__init__` should take this into account. That is, the parameters should minimize the *expected* message length, where the probability of each datum actually occurring is given by its corresponding weighting.

A list giving the name of each estimated parameter should be placed in a class attribute called `parameters`. This is used by `Estimate.__repr__`.

Note: any of the existing "simple" PyMML estimators will provide a good template to work from.