Paul Harrison

# Introduction

PyMML is a Python implementation of several of the MML87 estimators described in the book "Statistical and Inductive Inference by Minimum Message Length".

It may be loaded using:

```>>> import mml
```

PyMML uses numarray and, for some estimators, scipy. It will be assumed in examples that you have imported numarray.

```>>> from numarray import *
```

# Simple estimators

## Gaussian_estimate

Nothing amazing here. Gaussian_estimate estimates the parameters of a normal distribution. The estimates produced are identical to those of classical statistics.

If priors are supplied, the estimator can also produce a length. This allows it to be plugged into the mixture estimator (or other such structured estimators).

Examples:

```>>> data = array([ 1.0, 2.0, 3.0, 4.0, 5.0])
>>> estimate = mml.Gaussian_estimate(data)
>>> estimate
Gaussian_estimate:
mean=3.00
sigma=1.58
```

The mean and sigma (standard deviation) estimates can be accessed like so:

```>>> estimate.mean
3.0
>>> estimate.sigma
1.5811388300841898
```

If a length is needed, priors must be specified:

```>>> estimate = mml.Gaussian_estimate(data, (0.0, 100.0), 0.1, 100.0)
>>> estimate
Gaussian_estimate: 26.94 nits
mean=3.00
sigma=1.58
>>> estimate.length()
26.937871819253928
```

Here (0.0, 100.0) are the minimum and maximum expected values of the mean, 0.1 is the quantum of measurement, and 100.0 is the maximum expected standard deviation. The quantum of measurement defines how accurately data points need to be stored in the MML message, and also gives a lower bound on the expected standard deviation.

The unit used here, "nits", is a pun on "bits", being the number of base e digits required to encode some information. This is of course very silly, as there is no such thing as a base e digit. It's probably best seen as the negative natural logarithm of the likelihood of the message. The reason nits are used is because the maths is easier than working in bits, and working in raw probabilities quickly leads to floating point underflows.

## Discrete_estimate

Discrete_estimate takes as input a list of discrete values and gives an estimate of the probability of each value. It is based on the multinomial estimate described in the MML book, but assumes the order of values is important. This affects the message length, but not the estimated probabilities.

The form of prior used for this estimator is a "conjugate prior". That is, it may be viewed as the estimator having already seen a number of data points. Such priors tend to be reasonable, and have simple mathematical form... though to be precise, it is a slight generalization on a conjugate prior, in that we allow non-integer numbers of "prior observations". Such slightly generalized conjugate priors are common amongst the MML estimators described in the MML book.

Assuming one has seen one of each possible type of datum previously gives a uniform prior, and this is used as the default.

Note: if a prior other than the default is required, scipy must be installed.

Examples:

```>>> data = array([ 1, 2, 1, 1, 0, 3, 3])
>>> mml.Discrete_estimate(data, 4)
Discrete_estimate: 16.61 nits
probability=[ 0.17  0.39  0.17  0.28]
```

Here, 4 is the number of types. Each type in the data is represented by an integer in range(<number of types>).

A different prior can be specified:

```>>> prior = array([ 1.0, 0.1, 1.0, 40.0 ])
>>> mml.Discrete_estimate(data, 4, prior)
Discrete_estimate: 25.90 nits
probability=[ 0.03  0.06  0.03  0.88]
```

Here prior specifies the prior. The values need not be integer, merely positive, though the interpretation of the prior is easier to understand if the values are integer. In the example, it is as though we have seen 1 of type 0, `0.1' of type 1, 1 of type 2, and 40 of type 3. Thus, despite the evidence of data, the estimator judges type 3 to be rather more common than the others, based on the prior expectation. However, while the estimator might have found the presence of type 1 items surprising, it has quickly learned that type 1 is not as rare as expected.

Note also that this estimate has a longer length than the previous one, as the data is more surprising given this prior.

## Multivariate_gaussian_estimate

Multivariate_gaussian_estimate gives an estimate of the mean and covariance of multi-dimensional data. Unlike Gaussian_estimate, this estimate uses a conjugate prior and does not correspond to classical statistical estimates.

The prior for this estimator has five parameters:

quantum - The measurement accuracy. This is a volume, and may be obtained for example by multiplying together the measurement accuracies of each individual dimension.

m - The number of "data points" defining the prior on the variance. Must be greater than one, and if a length estimate is required must be greater than the number of dimensions of the data. Does not need to be an integer. (A value between one and the number of dimensions in the data yields an "improper" prior.)

V - The total variance of the "data points" defining the prior variance. That is, m times the expected variance.

m1 - The number of "data points" defining the prior on the mean. Must be positive. Does not need to be an integer.

u0 - The prior expectation for the mean.

If only quantum is given, the prior expectation on the mean defaults to zero, with m1 = 1, and the prior expectation on the variance defaults to a spherically symmetric distribution of volume quantum, with m set to the number of dimensions in the data plus one.

Note: scipy must be installed to use this estimator.

Examples:

```>>> data = array([ [1.0, 1.0], [3.0,2.5], [0.0,0.0], [7.0,7.5] ])
>>> mml.Multivariate_gaussian_estimate(data, 0.1 ** data.shape[1])
Multivariate_gaussian_estimate: 25.84 nits
mean=[ 2.2  2.2]
covariance=[[ 4.98  5.26]
[ 5.26  5.62]]
```

Here, 0.1 is the measurement accuracy.

## Regression_estimate

Regression_estimate performs linear regression, producing a linear model of some data in terms of some other given data.

A conjugate prior is used for the weights of the linear model. The effect of this is to slightly reduce the magnitude of these weights as compared to the result of classical least-squares fitting. This is not an essential feature of MML regression, if a uniform prior were used the MML estimate would be identical to classical least-squares fitting. However one would then need to define bounds on this prior, and it is not clear what a good choice of bounds would be. The conjugate form leads to a fairly natural choice of prior, and this allows more confident comparison of regression estimates based on different sets of data -- which is the main reason you would want to use MML regression.

The given data must be normalized to have mean 0.0 and variance 1.0. If necessary, a constant term can be found by inserting a column of 1.0s into the given_data.

The prior has one parameter, m, defining the expected ratio of the variance of the model to the variance of the residual. The choice of m is important when the size of the number of data points available is of roughly the same magnitude as the number of dimensions in the given data, but will have little effect when the data is more plentiful. The default, m=1.0, gives a prior expectation that the variance of the model weights will be equal to the variance of the residual. Larger m gives an expectation of the model weights being larger than the residual, smaller m gives an expectation of model weights smaller than the residual.

Examples:

```>>> given_data = array([ [1.0, 0.0], [1.0, 1.0], [0.0,-3.0], [-2.0, 1.0], [0.5, 0.0] ])
>>> data       = array([  1.0,        0.0,        3.0,        -3.0,        0.5       ])
```

The given data must be normalized:

```>>> given_data -= sum(given_data) / given_data.shape[0]
>>> given_data /= sqrt(sum(given_data*given_data) / given_data.shape[0])

>>> estimate = mml.Regression_estimate(data, given_data)
>>> estimate
Regression_estimate:
sigma=0.82
model=[ 0.95 -1.24]

>>> dot(given_data, estimate.model)
array([ 0.59830942, -0.24528408,  2.27683648, -2.8020445 ,  0.17218269])
```

Here, "sigma" is the standard deviation of residual (data - dot(given_data, estimate.model)).

If a length is needed, priors must be given:

```>>> mml.Regression_estimate(data, given_data, 1.0, 0.1, 10.0)
Regression_estimate: 26.46 nits
sigma=0.82
model=[ 0.95 -1.24]
```

Here, 1.0 is the m value, 0.1 is the quantum of measurement, and 10.0 is the maximum expected sigma.

## Hidden_factor_estimate

Hidden_factor_estimate is particularly nifty. This estimate is a case in which MML leaves other estimation techniques for dead.

The idea here is somewhat similar to the Multivariate_gaussian_estimate, in that we are fitting a multivariate Gaussian distribution to some multi-dimensional data. The difference is that the hidden factor estimator assumes the covariance is of a special form. It is assumed that, to the extent that different dimensions are correlated, they are correlated due to the presence of a single hidden factor. Estimates are found for the variance of each separate dimension, the direction vector of the hidden factor, and the amount of the hidden factor present in each datum.

It is possible that a hidden factor will not be found in the data. A single bit is used in the message to convey whether a hidden factor is present or not. (Note: this extra bit is not part of the length equations in the MML book. Also note that the use of a whole bit is a slight over-estimate of the length required to transmit this information. A true MML estimate would use some mind-bendingly clever coding trick to transmit that bit imprecisely. But hey, it's one bit, it's not a huge deal.)

Note: scipy must be installed to use this estimator.

Examples:

```>>> data = array([ [-2.0,-1.0], [5.0,2.6], [10.1, 5.0], [-3.0, -1.4], [1.0, 0.6], [-8.0,-4.1] ])
>>> mml.Hidden_factor_estimate(data)
Hidden_factor_estimate:
mean=[ 0.52  0.28]
sigma=[ 0.16  0.08]
has_factor=True
factor_loads=[ 6.38  3.21]
factor_scores=[-0.4   0.71  1.49 -0.54  0.09 -1.35]
```

Here, factor_loads gives the hidden factor and factor_scores gives the amount of that factor estimated to be present in each datum. sigma is the standard deviation of each dimension after the hidden factor is removed.

If a length is needed, priors must be specified:

```>>> mean_range = array([ [-30.0,30.0], [-30.0,30.0] ])
>>> quantum = array([ 0.01, 0.01 ])
>>> max_sigma = array([ 30.0, 30.0 ])
>>> mml.Hidden_factor_estimate(data, mean_range, quantum, max_sigma)
Hidden_factor_estimate: 88.78 nits
mean=[ 0.52  0.28]
sigma=[ 0.16  0.08]
has_factor=True
factor_loads=[ 6.38  3.21]
factor_scores=[-0.4   0.71  1.49 -0.54  0.09 -1.35]
```

Here, mean_range gives lower and upper bounds on the value of the mean in each dimension, quantum gives the quantum of measurement in each dimension, and max_sigma gives the maximum expected standard deviation in each dimension (after removal of the hidden factor).

# Structured estimators

## Composite_estimate

Composite_estimate simply concatenates several estimates into the one message. Its main use is in conjunction with Mixture_estimate.

Data passed to this estimate should be given in the form of a numarray record array.

Examples:

```>>> from numarray import records
>>> data = records.array([ [0, 0.1], [0, -0.2], [0, 0.3], [0, -0.5], [0, 0.0],
...                        [1, 10.0], [1, 10.1], [1, 9.9], [1, 10.2], [1, 11.0] ])
>>> field_priors = [ (mml.Discrete_estimate, 2), (mml.Gaussian_estimate, (-30.0,30.0), 0.1, 30.0) ]
>>> mml.Composite_estimate(data, field_priors)
Composite_estimate: 68.39 nits
Discrete_estimate: 9.24 nits
probability=[ 0.5  0.5]
Gaussian_estimate: 58.95 nits
mean=5.09
sigma=5.44
```

Here, field_priors is a list of priors for each field in the record array. Individual priors are given by a tuple containing the class of estimator to use for that field followed by the parameters of the prior to be passed to that estimator.

## Mixture_estimate

Mixture_estimate performs mixture modelling. Mixture modelling was one of the first applications of MML, and is an application in which MML shines. A mixture model treats data as samples from a mixture of classes, each class having different parameters describing the distribution of values. The advantage of MML here is that it can use length calculations to determine the number of classes that can best be used to describe the data.

The model (and associated priors) used to describe the classes must be specified. This model can be any MML estimator. Composite_estimate-type classes may be used to perform Snob style data clustering. Hidden_factor_estimate-type classes may be used to perform Factor-Snob style clustering of multivariate data, although without Factor-Snob's ability to generate hierarchical taxonomies (note: this has not been optimized, and may be slow). The Multivariate_gaussian_estimate may also be used for multivariate data. A single-dimensional class model may be used to analyze single-dimensional data that is multi-modal.

(Note: use of Regression_estimate in a mixture model would result in a somewhat more robust linear regression, but would not be able to cluster based on given_data, only on data, and therefore would not give a non-linear model.)

The output of Mixture_estimate is a list of classes, the relative abundance of each class, and for each datum the likelihood of that datum being an instance of each identified class.

The results of this estimator are arrived at by iterative refinement. By default, 1000 iterations are used, but it is conceivable that with certain data the estimator will not arrive at the optimum estimate within this many iterations. Note that a slightly different optimization algorithm is used here than that used in Snob, and that this different algorithm has not been thoroughly tested.

Examples:

Using the same data and field_priors as in the example for Composite_estimate above:

```>>> prior = (mml.Composite_estimate, field_priors)
>>> estimate = mml.Mixture_estimate(data, prior)
>>> estimate
Mixture_estimate: 59.34 nits
50% Composite_estimate: 25.50 nits
Discrete_estimate: 4.17 nits
probability=[ 0.08  0.92]
Gaussian_estimate: 21.12 nits
mean=10.24
sigma=0.44
50% Composite_estimate: 24.04 nits
Discrete_estimate: 4.17 nits
probability=[ 0.92  0.08]
Gaussian_estimate: 19.66 nits
mean=-0.06
sigma=0.31
```

(This may take a little while to complete.)

Note how Mixture_estimate has arrived at a reasonable estimate as to the correct number of classes, as well as the parameters of each class.

estimate.assignments gives the likelihood of each datum belonging to each identified class. In this case the assignments are pretty definite (this will not always be so):

```
>>> estimate.assignments
array([[  7.11684102e-117,   1.00000000e+000],
[  8.55931957e-124,   1.00000000e+000],
[  3.86505552e-112,   1.00000000e+000],
[  1.70426389e-130,   1.00000000e+000],
[  3.32157425e-119,   1.00000000e+000],
[  1.00000000e+000,   7.61183851e-238],
[  1.00000000e+000,   1.31164746e-242],
[  1.00000000e+000,   4.17671382e-233],
[  1.00000000e+000,   2.13706329e-247],
[  1.00000000e+000,   1.41265790e-286]])
>>> argmax(estimate.assignments)
array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
```

Further iterative refinement of the model can be performed, if desired:

```>>> estimate = mml.Mixture_estimate(data, prior, 100, estimate.assignments)
```

Here, 100 is the number of further iterations to perform.

If the dimensions of the data are correlated within classes, a Composite_estimate based mixture will contain spurious classes strung like dew-drops along these correlations. When this happens, it's time to break out the heavy artillery:

```>>> data = array([[9.42,9.20], [1.47,1.15], [5.45,5.85], [7.09,7.04], [9.15,9.39],
...               [30.26,0.53 ], [30.84,0.82], [30.21,0.32], [30.53,0.52], [30.78,0.45]])
>>> mml.Mixture_estimate(data, (mml.Multivariate_gaussian_estimate, 0.01**data.shape[1]))
Mixture_estimate: 103.08 nits
22% Multivariate_gaussian_estimate: 7.63 nits
mean=[ 5.5   5.41]
covariance=[[ 9.63  9.44]
[ 9.44  9.26]]
30% Multivariate_gaussian_estimate: 26.45 nits
mean=[ 4.02  4.1 ]
covariance=[[ 8.5   8.94]
[ 8.94  9.43]]
48% Multivariate_gaussian_estimate: 58.10 nits
mean=[ 25.44   0.44]
covariance=[[ 97.1    1.7 ]
[  1.7    0.05]]

>>> mml.Mixture_estimate(data, ( mml.Hidden_factor_estimate,
...   array([[-100.0,100.0],[-100.0,100.0]]), array([0.01,0.01]), array([100.0,100.0]) ))
Mixture_estimate: 151.95 nits
50% Hidden_factor_estimate: 64.15 nits
mean=[ 30.52   0.53]
sigma=[ 0.29  0.18]
has_factor=False
factor_loads=[ 0.  0.]
50% Hidden_factor_estimate: 79.76 nits
mean=[ 6.52  6.53]
sigma=[ 0.28  0.29]
has_factor=True
factor_loads=[ 3.24  3.34]
factor_scores=[ 0.84 -1.57 -0.26  0.16  0.83  2.75  2.88  2.71  2.79  2.82]
```

# Writing your own estimator

The bold may wish to create their own MML estimate classes. These may then, for example, be plugged into the mixture estimator to perform novel forms of clustering.

Let us assume you have already checked that your model does not break any of the MML87 assumptions, and performed the mathematical gymnastics required to pose reasonable priors and derive the Fisher information. (It is common also to find formulae for parameters that will minimize the MML87 length formula. If this is difficult, scipy.optimize.fmin or similar may be used to minimize the message length.)

To create a new estimator, you need to create a new sub-class of Estimate. Estimate is the base class of all PyMML estimators. Your new class will need to provide a number of terms used in the MML87 length formula (which is implemented in the method Estimate.length). To be precise, your class must define methods called neglog_prior, neglog_fisher, neglog_data, and dimensions.

The __init__ method of your new class should perform the actual parameter estimation. The parameter estimates performed in __init__ should minimize the quantity calculated by Estimate.length. (The function Estimate.test may be used to check that there is no trivial alteration to parameter values that will reduce the message length.)

The parameters __init__ accepts should be: data (an array of data points), any parameters for priors, and weighting (an array of floats, of the same length as data). Your __init__ method should first call Estimate.__init__(self, data, weighting). It should then perform the parameter estimation.

The weighting parameter is needed in order to support partial assignment when performing mixture modelling. Its effect can be seen in Estimate.total_neglog_data. Parameter estimation in __init__ should take this into account. That is, the parameters should minimize the expected message length, where the probability of each datum actually occurring is given by its corresponding weighting.

A list giving the name of each estimated parameter should be placed in a class attribute called parameters. This is used by Estimate.__repr__.

Note: any of the existing "simple" PyMML estimators will provide a good template to work from.