The merits of indefinite regress

The whole field of machine learning, and artificial intelligence in general, is plagued by a particular problem: the well known curse of dimensionality. In a nutshell, this curse means that whenever we try to increase the dimension of our search spaces in order to make our models more expressive we run into an exponentially increasing number of search points to visit until a satisfactory solution is found.

In practice, people are then forced to use smaller models with manageable parameter spaces to search through which sets up the whole problem of “narrow AI”: we end up being able to solve merely narrowly defined specific tasks.

We should know better though than searching more or less blindly through vast search spaces. Both decades of practice in machine learning, our philosophical thinking, scientific theories and the theory of computer science have demonstrated undeniably the validity of a well-known principle, Occam’s razor: “Entities must not be multiplied beyond necessity” (Non sunt multiplicanda entia sine necessitate). Especially in recent decades, Solomonoff’s theory of universal induction has shown that it is supremely important to find simple solutions to problems, i.e. that simple solutions are a priori exponentially more probable than complex ones.

Of course, we have known that before already and have been sincerely trying to reduce the size of our models and the number of their parameters. We have introduced various information criteria like the Akaike Information Criterion, the Bayesian Information Criterion, various penalty terms for model sizes. However, this is by far not good enough, as we will argue and will suggest a solution to it.

The problem with introducing a simplicity bias comes from the fact that we don’t know how to compute simplicity or “complexity” of any data set. Fortunately, we do have a formal definition of “complexity” – the Kolmogorov complexity, which we will call algorithmic entropy (since it actually is more related to entropy than to complexity, after all randomness is usually not considered complex in the usual way we use the word). The algorithmic entropy of a data string \( x\) is given by

\( C(x)=\mbox{min}(l(p):\; U(p)=x)\)

which is the length \( l(p)\) of the shortest program \( p\) being able to compute that data set on a universal Turing machine \( U\). In practice, in order to know the algorithmic entropy of a data set, we’d have to find the shortest program being able to generate it and measure its length (the number of bits it takes to specify the program).

However, finding a short description is the reason why we started this whole thing in the first place: we need a the algorithmic entropy in order to evaluate the appropriateness of candidate solutions but we need the solutions in order to compute their entropy – a hen-egg problem. And as most hen-egg problems in computer science, this one may also be solvable in an iterative way.

Consider a typical set up of a machine learning problem. We have chosen some model and want to fit its parameters to the data using some cost function. The size of the model is defined by the number of parameters which we can vary. For example, a the size of a neural network can be regulated by the number of neurons involved which affects the number of weights that need to be fitted.

Ideally, we do not want to go through the parameter space blindly but to try first “simple”, low entropy, parameter settings. For example, a convolutional layer of a neural network consists of batches of a small number of weights per neuron where each neuron has got the same set of weights. “Simple” means that the program describing the weights will be short since it only needs to specify the small number of weights of a single neuron. By going through the simple first we ensure to fulfil the Occam’s razor principle and stop as soon as we find a solution that fits the data well.

The problem is, however, the following. For a fixed size of the parameter set, which can be represented as a binary string of a fixed length, the fraction of simple strings is very low. After all, a simple counting argument shows that there are at most \( 2^{n-c+1}-1\) programs of length at most \( n-c\), i.e. that are able to compress the string by at least \( c\) bits. Since there are \( 2^n\) bits of length \( n\), the fraction of programs that compress a string by \( c\) bits goes like \( 2^{-c+1}\). There are two things to learn from this. First, the fraction of simple parameter descriptions does not depend on the length of the parameter set \( n\), i.e. the number of parameters. And second, the fraction drops very fast, exponentially with every compressed bit. Hence, the number of simple descriptions is very low.

Therefore, if a successful search of the parameter space with appropriate bias toward simple parameter sets is to be performed, we’d rather find a way to fish out those very few simple sets out of a vast number of random ones. Here, we also note that merely reducing the number of parameters does not help much to find the simple ones: the fraction of simple ones remains constant for any number of parameters.

To get an intuition about the problem consider a parameter set in the form of a 100 bit string. Searching through all \( 2^{100}\) is tedious, since it is a large number. We would like to go through all strings with an algorithmic entropy less than 50 bits. Running up to \( 2^{50}\) 50 bit programs that print 100-bit strings is now manageable and we majorly reduce our parameter search space while giving priority to simpler parameter sets. However, there are two problems: more complex parameter sets cannot be represented that way and the 50 bit programs are themselves not ordered by their entropy. Thus, only a small fraction of them is simple and we end up searching blindly through a mostly random program space – it is the same problem as with the parameters. Now, if we represent the programs by a model with its parameters, we can proceed the same way as before: build another model on top of it that generates simple parameter sets. In essence, in order to achieve a simplicity biased search through the parameter space, it seems like a good idea to build a meta-model and a meta-meta-model and so on – an indefinite regress of models!

One may ask oneself why building a 50 bit meta-model to print 100 bit parameters of the first-level model to explain data? Why not taking simply 50 bit parameters without a meta-model which corresponds to the same entropy of the whole model pyramid? Since 100 bit parameters are simply more expressive, even though a small fraction, only \( 2^{50}\) sets are considered.

How could this Occam biased search be performed? Here is an outline:

  1. Take model \( M\) with a variable set of parameters and the data set \( \vec{p}_0=\vec{x}\) to be generated. Set \( l=0\).
  2. Set \( l\leftarrow l+1\), \( n_l=1\) and define a new meta-model that estimates the input: \( \hat{\vec{p}}_{l-1}=M(\vec{p}_l)\)
  3. Use \( \vec{p}_l\) to compute an estimate of the input \( \hat{\vec{x}}=M(\vec{p}_1)=M(M(\vec{p}_2))=M(\cdots M(\vec{p}_l))\) and use your favourite search algorithm to minimize the objective function \( E(\vec{p}_l, \vec{x})\) by searching through the \( n_l\)-dimensional parameter space (\( n_l=\mbox{dim}(\vec{p}_l)\)).
  4. If a termination criterion satisfied, then break and return the best parameter set.
  5. Else, set \( n_l\leftarrow n_l+1\). If \( n_l>N\), go to (2), else go to (3).

This get’s the main idea across. The limit number of parameters \( N\) should be chosen fairly small, pretty much as soon as there is a significant difference between simple and complex parameter sets.

Obviously, this approach requires a model that can describe itself, its own parameters. And at each step in the hierarchy has to compress its input somewhat. This approach is reminiscent of deep neural networks – where the model is simple, a sigmoid neuron – but the parameter space can be huge, which is the space where the weights live. Inputs are described by hidden neurons which are in turn described by yet other hidden neurons etc. However, note that there is a significant restriction usually in deep networks: usually the activation of hidden neurons is taken as representation/description of the input while the entropy in the weights is neglected. In order for this approach to work well, both the entropy in the hidden units and in the weights has to be taken into account.

Further note, that the meta-models will tend to be ever smaller and that the overall description length of the models is limited by the entropy of the input (there is no point in creating a hierarchy of models with higher entropy than the inputs, since it won’t generalize). Overall, the approach resembles a growing pyramid of models stacked onto each other as we go from simpler to more complex descriptions.

What can we hope to gain from that approach? Consider again the 100-bit parameter string example. What if the solution is a simple 100-bit parameter string? In such a case, the situation has three properties:

  1. The search space \( 2^{100}\) is too large to be searched through – exhaustively or any other way. That’s the curse of dimensionality.
  2. The solution is however contained in that space.
  3. The solution is algorithmically simple since it has a high prior probability of actually occurring.

If you think about it, this is a quite common situation. The classical solution to that problem was to consider narrow tasks where the search space is small enough to afford an non-Occam-biased search. Or a simple model has been picked such as linear regression, where a strong assumption of linearity allows an efficient search in a high-dimensional parameter space.

Here, we presented a way to go beyond those restrictions. Who knows, maybe a not-too-bad search would thus be possible in a Turing compete search space. It looks like this is what you have to do, in order to perform an efficient and Occam-bias-optimal induction. Interestingly, it entails models described by meta-models which are themselves described by other and yet other meta-models – in an indefinitely deep regress. Doing this in a Turing complete language requires homoiconicity – the property of being able to use programs as data, of which LISP is such a prominent example.

Does this idea finally break out of the narrow AI paradigm? We have come up with a simple test whether a system is narrow or not. Let a model have a set of \( n\) parameters each taking \( D\) bits to specify. Can we think of an input best fitted by a set of \( m>n\) parameters whose description is smaller than \( n\;D\)? Then we have found an input whose parameters can be described in a simple way albeit not in the present representation even though it allows the required entropy. For example, consider the space of 3-dimensional polynomials trying to fit points along an 6-dimensional polynomial of the form: \( 6x^6+5x^5+4x^4+3x^3+2x^2+1x^1+0x^0\). Of course, this is not possible. Let each parameter take integer values from -8 to +7, i.e. we need 4 bits to specify it. Then specifying a 3-dimensional polynomial takes \( 3\times 4=12\) bits. However, it takes much less information to specify that particular 6-dimensional polynomial since it is so regular. As of now, we have not met a single practical machine learning technique or model, for which a simple example could not be found that is outside the scope of the model.

How is our indefinite regress perform in that respect? If \( M=\) polynomials, limiting the entropy to 12 bits will include simple high-degree polynomials like the 6-dimensional above since it takes a 1-dim polynomial, a line, to specify the decline of numbers 6,5,4,3,2,1,0 and the 1-dim polynomial is easily in the scope of the 12 bit description. Thus, we seem to be holding an approach in our hands, that has the potential to be truly general and to break the curse of narrow AI.

On a more general note, the approach is reminiscent of the infinite regress of thought, where we can become aware or a thought, then become aware of that awareness and of that awareness etc. It suggests the thought provoking hypothesis that it may be the necessity of doing proper Occam-bias-optimal inference that has made awareness and self-awareness possible during the course of the history of human mind.