Hierarchical epsilon machine reconstruction
Having read this long paper by James P. Crutchfield (1994) “Calculi of emergence”, we have to admit, that it is very inspiring. Let’s think about the implications.
The first achievement of the paper is the definition of statistical complexity. Note that Kolmogorov complexity is a misnomer, since it rather measures randomness than complexity. Complexity on the other hand, seems to be somewhere between simplicity and randomness. Crutchfield measures it by taking a string and mapping each entry on a set of states where two entries have the same state and the predict the same distribution of futures conditioned on the past. Given that definition, the string is mapped essentially on a Markov chain. Since Markov chains possess a marginal distribution of states after running for a long time, the entropy of this distribution is the definition of statistical complexity.
However, this definition is only appropriate, if we get a finite description. In many cases however, a Markov chain or finite automaton, is not sufficient. Essentially, when trying to increase modeling precision, such as looking at ever longer past histories and future predictions, the number of states grows without bounds. When it grows to large, the system is pressed to “invent” a new computational class by grouping the states together to a new automaton, e.g. one with a register or other forms of memory. And, it is suggested to do that until a finite description is achieved. The statistical complexity is defined as the Shannon entropy of the state distribution in this highest description (lowest level finite description to be precise).
The cool part is that it somehow achieves filtering out the random part, somehow even including the pseudo-random part. After all, a theme running through the paper is also the following. The dichotomy between simplicity and randomness is somehow paralleled by the dichotomy between order and chaos. This parallel has been intriguing us for a long time already. The last time was when we asked ourselves, which strings are compressible, but are not incrementally compressible, i.e. that don’t have any properties. Another reason is that throughout the experience in machine learning, computation seems to happen best at the edge between order and chaos. The failure of Kolmogorov “complexity”, let’s call it (algorithmic) entropy, is evident here as well: chaotic sequences often have a low algorithmic entropy, even though they appear random. They appear so that convincingly that we use chaotic sequences in order to generate pseudo-random numbers in our computers. Somehow statistical complexity gets rid of those complications, slices away the random part and uncovers the “true” computational aspects of the system.
Another aspect of the discussion is the interesting interplay between different perspectives: an observer modeling a natural system / data set, the AI aspiration of automatizing this very modeling as a means to implement universal induction and nature’s creation of living beings that model their environments while getting ever more complex. The latter aspect is quite mesmerizing because of the interplay between order and chaos in nature, since nature seems to be a processes of self-organized criticality which means that the edge/border between order and chaos is an attractor, i.e. a stable state. If that state is exactly where innovation happens then it could explain why living beings become increasingly complex. And here is another contribution of that paper to that aspect: it is at the edge of chaos where the number of states tends to diverge into infinity. After all, both orderly/regular and chaotic regimes seem to lead to finite descriptions. But it is the divergence of state numbers that brings the modeling to its resource limits, forcing it to switch to a higher representation, a more powerful machine, a higher complexity level.
A yet another inspiring aspect of the paper is the parallel between “state grouping”, which is referred to as some processing that models the symmetries of the data, and detecting features in our incremental compression (IC) theory. We mentioned above that we suspect that it may be precisely the complex strings that are incrementally compressible. Given that hierarchical epsilon machine reconstruction is declared as a process the incrementally detects symmetries, the parallel is fairly strong. Here, in the questions about incremental/hierarchical construction, in the interplay between order and chaos, is a deep mystery to be uncovered, about evolution, the origin of complexity in nature and the essence of intelligence. We are quite confident about this.
There seems to be a deep parallel between the evolutionary incremental innovation process and the incremental process of induction. In the paper, innovation seems to consist of two parts. One is state grouping, which is a well-defined procedure, independently of the data. The other part is the “true novelty” which is found by random search in a small space, which results from the state grouping. In IC-theory no such random search is necessary once the regularity/feature/state grouping is found. Another aspect that comes to our mind is that this hierarchical epsilon machine process will often lead to a high hierarchy for natural data, since there is structure on all scales, hence there we have this open-ended hierarchical construction process. This reveals the role of the modeling precision: the more precisely we model, the more we sort-of zoom in on the data, the higher our hierarchies will have to become. Hence, it is not just the living beings that become ever more complex during evolution, but it is the structure of matter itself that organizes itself in such a way that it can only be modeled by a high hierarchy. The author views living beings as “dual” to their environment (matter), which is what it means to fill a “niche” in the environment. Thus it seems that this self-organizing criticality (SOC) process of nature forces matter into complex states which forces living beings to increase in complexity as well if they want to model those states properly in order to predict their environment and survive. Unfortunately, SOC is not sufficient for complexity, as evidenced by the simplicity of the showcase SOC model: the sand pile. Therefore, we don’t really know how nature forces matter into complexity.
Another interesting aspect of the paper is the following. The process of hierarchical epsilon machine reconstruction demands to build a machine by grouping those histories that entail the same future distributions. By increasing the precision, i.e. decreasing epsilon, the length of the look-back and look-ahead sequence increases which increases the number of reconstructed states. If this number grows without bounds as precision increases then it is advised to switch to a more powerful machine by grouping the new states again or some other innovation and do so repeatingly until a finite representation is reached.
This process reminds us of the “universal approximation” property of neural networks, see previous post on our ideas on that. The number of neurons generally also grows without bounds with increasing precision. Could it also be due to the simple fact that the present representation is not powerful enough? Maybe the particular neural network has to switch to some more powerful computation class? After all, our observations in that post implied that universal approximation is not the same as algorithmic completeness. Crutchfield’s paper helps to flesh out the difference: a simple representation can often approximate the data arbitrarily precisely, but at the expense of a large number of parameters, while more powerful computational classes can find a finite representation. The paper suggests a way of doing this, but I’m afraid that this “state grouping” procedure is too specific. Can we formally and more precisely describe what happens here?
Now that we think about it, there are several ways, in which hierarchical epsilon machine reconstruction (HEMR) differs from IC theory. First, although the author claims that the goal of state grouping is to find symmetries (features) of the data, this is not what happens. Rather, the attempt is to find a full description of the data in a single step in the hierarchy by using a not-so-powerful machine. After all, the goal of HEMR is to separate the random part from the “intrinsic computation” part, i.e. signal from noise. In IC theory, however, one tries to decompose the signal part into various features one of which will turn out to be noise which is carried through the parameters up to the highest layer since it doesn’t have features and is not compressible. Thus, IC decomposes the signal into crisp separate parts using functions/features from a Turing-complete set right at the beginning, while HEMR tries to model the whole signal at once but uses a not-so-powerful machine and steps up in the computation hierarchy if the machine doesn’t do it. If we combine IC and HEMR in a smart way, we might get a procedure that decomposes a string into parts which can be looked for by increasingly powerful machines. This would alleviate the IC-problem of having to search for features in a Turing complete space and the HEMR-problem of having to find a full description in a single step. Could we posit HEMR on the firm grounds of algorithmic information theory?
And what is the fundamental reason for the statistical complexity to peak at intermediate entropies? It seems that this peak corresponds to the most complex intrinsic structure which has to be modeled by the more powerful machines.