Bayesian learning views the problem of constructing hypotheses from data as a subproblem of the more fundamental problem of making predictions. The idea is to use hypotheses as intermediaries between data and predictions. First, the probability of each hypothesis is estimated, given the data. Predictions are then made from the hypotheses, using the posterior probabilities of the hypotheses to weight the predictions. As a simple example, consider the problem of predicting tomorrow's weather. Suppose the available experts are divided into two camps: some propose model A, and some propose model B.

The Bayesian method, rather than choosing between A and B, gives some weight to each based on their likelihood. The likelihood will depend on how much the known data support each of the two models. Suppose that we have data D and hypotheses H\,H2, ••• , and that we are interested in making a prediction concerning an unknown quantity X. Furthermore, suppose that each //,• specifies a complete distribution for X. Then we have

P(X|D) = V P(X\D,Hi)P(Hi\D) = P(X\Hi)P(Hi\D)

This equation describes full Bayesian learning, and may require a calculation of P(//,-|D) for all Hi. In most cases, this is intractable; it can be shown, however, that there is no better way to make predictions. The most common approximation is to use a most probable hypothesis, that is, an Hi that maximizes P(//,|D).

This often called a maximum a posteriori or MAP hypothesis

//MAP' P(X|D) w P(X|//MAp)P(//MAp!£>)

The problem is now to find //

MAP- By applying Bayes' rule, we can rewrite P(Hf\D) as follows: P(D\Hi)P(Hi) P(H,\D) = P(D) Notice that in comparing hypotheses, P(D) remains fixed. Hence, to find //MAP, we need only maximize the numerator of the fraction. The first term, P(D|//,), represents the probability that this particular data set would have been observed, given //, as the underlying model of the world. The second term represents the prior probability assigned to the given model.

Arguments over the nature and significance of this prior probability distribution, and its relation to preference for simpler hypotheses (Ockham's razor), have raged unchecked in the statistics and learning communities for decades. The only reasonable policy seems to be to assign prior probabilities based on some simplicity measure on hypotheses, such that the prior of the entire hypothesis space adds up to 1. The more we bias the priors towards simpler hypotheses, the more we will be immune to noise and overrating. Of course, if the priors are too biased, then we get underfilling, where the data is largely ignored. There is a careful trade-off lo make. In some cases, a uniform prior over belief networks seems to be appropriate, as we shall see. With a uniform prior, we need only choose an //, that maximizes P(D|//,). This is called a maximum-likelihood (ML) hypothesis, //ML