Infer.NET user guide : Learners : Bayes Point Machine classifiers

The Bayes Point Machine: A probabilistic model for classification

We are interested in finding a general and robust procedure to predict the class y to which an instance of interest belongs, given relevant feature values x of this instance and the information in D={(y_i,x_i ), i=1,...,N}, the training set of observed class labels and corresponding feature values.

It can be shown that the only coherent solution to the classification problem is to provide the predictive distribution p(y=c│x,D),  c=1,...,K} over classes, conditional on x and D. In the following we provide a direct specification of the predictive distribution p(y│x,D) based on a set of assumptions. The derived classification model thus falls into the discriminative paradigm.

Binary classification

For simplicity, we will start by describing the model for the case of two class labels y in {-1,1}. The model is defined by the following assumptions:

Assumption 1: The feature values x are always fully observed.

This means that it must always be possible to compute all feature values. If features are missing for some instances, you will either have to fill in the missing values in a pre-processing step, or use a different model for classification.

Assumption 2: The order of instances does not matter.

That is, instances are assumed to be exchangeable. As a result of this assumption, the predictive distribution p(y|x,w) is characterized, for any given x, by the same latent parameters w, also referred to as weights.

If the predictive distribution varies with time, for example in the stock market, then you will need to expand x to include time information or use a different model for classification (such as a time-series model).

Assumption 3: The predictive distribution is a linear discriminant function.

We assume the predictive distribution to have a parametric density of the form

p(y|x,w) = p(y|s = w*x)

where s is called the score of the instance.

Assumption 4: Additive Gaussian noise.

To allow for measurement and labelling error, we add Gaussian noise to the score:

p(y│s,e) = 1(y*s+ e greater 0) with p(e) = N(e|0,1) and Heavidside step function H

This is also known as a probit regression model.

The model up to this point, represented as a factor graph, is depicted below:

Factor graph of a binary BPM with factorized weights

Assumption 5: Factorized heavy-tailed prior distributions over weights w.

Finally, by specifying the prior distribution p(w), we complete the probabilistic model for the Bayes Point Machine classifier. We use a heavy-tailed prior, a normal distribution whose precision is itself a mixture of Gamma-Gamma distributions  (as illustrated in the factor graph below, where Ga denotes a Gamma density for given shape and rate). Compared with a Gaussian prior, the heavy-tailed prior is more robust towards outliers, i.e. feature values which are far from the mean of the weight distribution, 0. It is invariant to rescaling the feature values, but not invariant to their shifting (to achieve the latter, the user must add a bias - a constant feature value - for all instances). Due to the factorized structure of p(w), features should be as uncorrelated as possible.

Factor graph of a binary BPM and a mixture of Gamma-Gamma prior distributions over factorized weights

Multi-class classification

The multi-class model is very similar to the described binary Bayes Point Machine model. Instead of a single linear discriminant function, there are now K functions with their respective weight vectors w_c and scores s_c, one for each class c in {1,...,K}. The maximum (noisy) score then determines the class y.

Assumption 6: Class labels have no inherent ordering.

The multi-class model assumes that the class labels have no inherent ordering, and is thus invariant to permuting the set of labels. If the labels in your problem have an order, then you can still use this model, even though you might benefit from a different model.

Factor graph of a multi-class BPM and a mixture of Gamma-Gamma prior distributions over factorized weights

With appropriate symmetry-breaking, using the multi-class model with 2 classes is equivalent to the binary model described above. However, an exception will be thrown if you try this. Finally, the computational cost of prediction in the multi-class model is quadratic in the number of classes K, making it less ideal a choice where K is large.

Source code

The Infer.NET code for the previously described classification models lives in the Learners/ClassifierModels project. The project uses the Infer.NET compiler to generate dedicated training and prediction algorithms from the models.

References

©2009-2015 Microsoft Corporation. All rights reserved.  Terms of Use | Trademarks | Privacy Statement