Universal approximators vs. algorithmic completeness

Finally, it has dawned on us. A problem that we had troubles conceptualizing is the following. On the one hand, for the purposes of universal induction, it is necessary to search in an algorithmically complete space. This is currently not possible because of the huge number of possible programs to search through. Our mind then automatically wandered toward algorithms that don’t go through the trouble of searching through the whole search space available to them. Any parametrized machine learning algorithm, be it neural networks or linear regression, finds solutions by gradient descent or even simpler, but none bothers to loop through all weight combinations.

The next thought was that this kind of efficient search is only possible within a fixed algorithmically incomplete representation. However, it is well-known that neural networks are universal function approximators, which sounds like algorithmic completeness, does it not? After all, Turing machines merely compute a integer functions. Why then would it be so hard for neural networks to represent some structure not optimally suitable for them? For example, a circle can be represented by the equation \( x^2+y^2=const\), which is difficult to represent with, say, sigmoidal neurons. We attributed this difficulty to the fact that there are different reference machines after all and some things are easy to represent with one machine but difficult in another and vice versa.

But now we realize that there is something fundamentally different between universal function approximators and algorithmically complete languages. After all, polynoms are also universal approximators, too! But ever tried to fit a polynom to an exponential function? It will severely overfit and never generalize.

So, what is the difference between an algorithmically incomplete universal function approximator and a sub-optimal algorithmically complete reference machine? While the latter differs from another alg. complete machine merely by the constant length needed to encode a compiler between those machines, the former requires ever more resources when the approximation precision is increased. For example, every alg. complete language will be able to represent the exponential function exactly; some languages will require longer, some shorter expressions, but the representation will be exact. And the additional length in some languages is merely due to the fact that some functions are cumbersome to define in them, but once done, your function will be represented exactly. Conversely, a universal non-complete approximator will require ever longer representations when you increase precision and goes even to infinity when precision requirements are highest. More precisely, if \( \epsilon\) is the maximum approximation error that you allow and \( N\) is the size of the network \( N(\epsilon)\) increases as \( \epsilon\) decreases. Our point is that it should not depend on \( \epsilon\). For example, if you can represent the data \( x\) perfectly in some representation with \( M\) bits, then your representation should take \( N = M + \mbox{const.}\) bits where the constant does not depend on \( x\) (i.e. const is just the length of the compiler that translates between the two representations). And this is just not the case for neural networks. And that is also why the Taylor expansion of the exponential function is an infinite degree polynomial — polynomials are simply not suited to represent the exponential function.

But why should precision matter? Why do we care about this effect? Should we not rather care about the generalization abilities? For example, imagine a family of exponentials \( e^{-x/d}\), where \( d\) is the width of the exponential. If we fit a polynom to it, we would need a different set of parameters of the polynom for each value of \( d\)! Therefore, polynoms don’t really capture the structure of the function. On the other hand, an alg. complete language will always have such a single parameter available and once the representation is constructed, the whole family is covered. That’s actually the point. In order to generalize, it is necessary to recognize the sample as an example of a family that it belongs to and to parametrize that family. And neither neural networks nor polynoms can parametrize the exponential family, without picking completely different infinite (not single!) parameter sets for the respective polynomials. An alg. complete language, however, can do this.

We think, essentially, the criterion is: if your data is drawn from a class which is parametrizable with \( n\) parameters in some Turing complete representation, then your representation should be able to express that same data with not (much) more than n parameters. Otherwise, it is “just an approximation” which may fit the data well, but will certainly overfit, since it will not capture the structure of the data. This follows from universal induction. If your data \( x=yz\) is computed by a program \( p\), \( U(p)=x\), then taking a longer program \( q\) that computes some prefix \( y\) of \( x\), \( U(q)=yw\), will hardly make \( q\) able to predict the rest of \( x\): \( w\) will not equal \( z\). And it is well known, that the shortest program computing x is the best predictor with high probability, so that every bit wasted in program length reduces your prediction abilities exponentially, due to the \( 2^{-l(p)}\) terms in the Solomonoff Prior!

Another point is that algorithmic completeness may not even be necessary, if the data created by Nature comes from an incomplete representation (if the world consisted only of circles, then you only need to parametrize circles in order to represent the data and generalize from them perfectly). That this is the case is not even improbable, since physics imposes locality and criticality (power law distributed mutual information in natural data), which probably makes the data compressible by hierarchies (as reflected by our theory of incremental compression. However, this class is still quite large, and we should be able to capture its structure appropriately.